SHARE:
Uncategorized

Nokogiri: Simple Scrape in 5 Easy Pieces

Flatiron School / 18 October 2012

The following is a guest post by Aaron Streiter and originally appeared on his blog. Aaron is currently a student a The Flatiron School. You can learn more about him here, or follow him on twitter here.

In 5 simple steps, any information can easily be extracted from the web.

My goal in this post is to reduce scraping to its simplest form and makie it as accessible as possible to novices of Ruby and CSS. A great teacher once said the “the internet is your database,” and here is a tool that will help you achieve that.

1) Install Nokogiri gem

After a bitter feud with Hpricot, Nokogiri has emerged as the Ruby HTML/XML parser library of choice.

2) Create following rb file

3) Insert your favorite url

In my example I will use the Huffington Post home page.

4) Choose a CSS selector

I want to find the selector to the main headline of the entire site. Google’s “inspect element” tool reveals the main headline to be under the H1 tag.

5) Run program

Outputs: “SUPREMES: OHIO EARLY VOTING CAN PROCEED”

Once you have this program up and running, experiment! Each HTML tag, id, and class from any url can be parsed with Nokogiri. If you would like to learn more about scraping with Nokogiri and its unlimited possibilities I would suggest visiting Getting Started with Nokogiri.

Blocks vs. Procs vs. Lambdas: Ruby Closure Showdown Previous Post Tips for Using the Ruby Map Method Next Post