The following is a guest post by Aaron Streiter and originally appeared on his blog. Aaron is currently a student a The Flatiron School. You can learn more about him here, or follow him on twitter here.
My goal in this post is to reduce scraping to its simplest form and makie it as accessible as possible to novices of Ruby and CSS. A great teacher once said the “the internet is your database,” and here is a tool that will help you achieve that.
1) Install Nokogiri gem
After a bitter feud with Hpricot, Nokogiri has emerged as the Ruby HTML/XML parser library of choice.
2) Create following rb file
3) Insert your favorite url
In my example I will use the Huffington Post home page.
4) Choose a CSS selector
I want to find the selector to the main headline of the entire site. Google’s “inspect element” tool reveals the main headline to be under the H1 tag.
5) Run program
Outputs: “SUPREMES: OHIO EARLY VOTING CAN PROCEED”
Once you have this program up and running, experiment! Each HTML tag, id, and class from any url can be parsed with Nokogiri. If you would like to learn more about scraping with Nokogiri and its unlimited possibilities I would suggest visiting Getting Started with Nokogiri.
Make yourself useful.