The following is a guest post by Kevin Curtin and originally appeared on his blog. Kevin is currently a student a The Flatiron School. You can learn more about him here, or follow him on twitter here.
One of our recent projects at The Flatiron School was to build a scraper that went to various job sites and scraped a bunch of data about job openings.
After my initial implementations here and here, I noticed a pattern emerging. I was able to abstract the pattern and create a Scraper object that allows me to scrape different sites with a single object and less code.
The Pattern (RE: The Anatomy of Scraper)
- Visit some type of index page
- Collect all of the links you want to visit
- Cycle through the links and visit each individual page
- Gather the data you need as you go
- (optional) Save it to a database
I chose to initialize my object with four paramters:
- A human friendly name
- An index URL
- The base URL for the individual links you want to visit (these are sometimes different than the index URL)
- A database connection
Each of these parameters are needed for the next steps in the pattern. You cannot scrape a site if you don’t know which page to start on and how to construct the links to the pages you need to visit. While the database connection will eventually become optional, it makes sense to initialize it at the time the instance is first created so that it is available whenever we choose to write information.
I will get in to collecting the individual page URLs and tasking the object to begin scraping for the information you need.
Make yourself useful.