So it had been my original intention to have a working, but very very simple, CLI search engine working in time for this post; however, this is not the case. I only have part of one. I’ve hit a wall, but let me present what I have working so far.
So my goal for this little project was to gain greater exposure to the Nokogiri gem. I wanted to be able to run my script from the terminal, pass in a search term and have a list of ten or so URLS returned and ranked based on frequency of search term.
Before I begin coding I like to sketch out either in my head or on paper the basic architecture the program will have. Coding is too complex, and I’m still too much of a novice to really see what exactly the program will look like, but I think these little preparatory exercises have some benefit keeping my code targeted and avoiding too much sloppiness. In the case of this example, I started sketching before in my text editor. Here are my original notes:
The structure of the program I currently have is surprisingly close to this. I have the two classes: WebPage and Links. Each web page is an instance of WebPage and each web page has links. However, one important distinction is that the way I implemented the Links class, each WebPage instance only has one instance of the Links class. Instances of the Links class are like a collection object of all the links on a given web page.
Although I was able to implement the basic design and relationship between instances of WebPage and Links, I had trouble (surprise surprise) making an effective ranking algorithm. Actually, I thought what I had designed was actually pretty doable, but when I ran it it was terribly slow.
The general design of this algorithm was that it would take a seed URL, use that URL to make an instance of the WebPage class and it would then instantiate an instance (does that sound redundant?) of the Links class. Using this collection-like instance of the Links class, each URL would then be instantiated into new WebPage objects. And the process would repeat itself with each link.
Each instance of the WebPage class had an instance variable ‘@search_frequency’ that kept track of the number of occurrences of the search term. Every site would be ranked based on search frequency and the top 20 or so would be pushed into a ‘top_hits’ WebPage class array. My idea was that as the crawler did its thing, the array of top hits would continuously change and shuffle around the rankings. To keep things efficient and practical, and avoid having the program run too slow (which it did anyway) it would “prune” those pages that didn’t make it into the the top_hits array by not opening any more links from those pages. The program would stop running once the top_hits array stopped changing for several iterations. That is, with each iteration the current top hits ranking order would be compared with several previous rankings and if they were equal, the program would return the top hit URLs along with each URL’s frequency of the search term.
Obviously, this algorithm has a lot of problems. But I wasn’t trying to compete with Google, I just wanted a cute little CLI application that I could brag about 🙂
In addition to it running awfully slow, I had trouble instantiating new WebPage objects in my code. Not too sure what that’s all about, but I’m still looking into it.
Anyways not to leave you totally disappointed, I’ve taken the parts of my code that worked and have made a script that’s able to give the frequency of search terms in a list of web pages that you pass it in the command line.
Here’s a quick demo:
1) first in the console I run: $ruby simple_crawler.rb
2) then I enter the search term
3) then the search urls
easy as 1..2..3
It really is pen island!
And finally, my code so far…As you can see there’s some stuff in there that’s intended for a potential search engine, but as the code stands now, doesn’t make all that much sense.
A little more
So I’m not sure if I’ll actually keep working on this, but it did provide some good practice in some basic object-oriented design and using the Nokogiri gem. Ultimately at this stage, learning and improving as much as I can is all I really care about 🙂
I’m going to also look into using the wombat gem, too. That looks pretty rad.
Make yourself useful.