Wednesday, October 20, 2010

Silly Mistake

So recently I stumbled across this webcomic Wonderella, which is awesome! After I read a few pages, I decided to do what I do with all webcomics I like, ... download the whole thing. Because I can't stand that 30 second - 1 minute wait between pages. Previously, I used to do this sort of thing with wget, and try to discover the rule, and invariably there would be some complication or the other.

Then I wised up and started using Mechanize, which made things a breeze. And sometimes I've started multiple threads/processes and somehow allocated tasks to the threads so that the whole download happen faster. But this time I thought, "I've been thinking and writing about evented I/O a bit now, someone on the Node.js irc channel had mentioned he used node.js for http scraping, why not try the same."

I decided to use it with EventMachine because that way I could also test my new work in progress (then), pet project. (which I'm going to write about soon). So I started reading about EventMachine and used hpricot for the html parsing. In 30 minutes I had the script ready and I had tested bits of it.

I asked it to just get the first four comics from the archive page, and it worked. So I did it again, this time asking it to get the entire thing. But, nothing happened!! Nothing was downloaded. Little print statements let me know that it had correctly parsed the archives page and was going after the right pages, but nothing! I let it run for a while, but still nothing :/.

It took me a while to realize, I had been solving the wrong problem. I wanted to do things in parallel. But my bottleneck, wasn't how much memory I could afford on my machine. This is where evented I/O helps you. My bottleneck, was how many connections the webserver would allow me to make. I was attempting to download all the comics simultaneously, and that didn't work.

So like yeah, funny story :/. I just redid it with mechanize and it was delicious. Here is the code if anyone is curious what EM code looks like.

No comments:

Post a Comment