You don't need to use JQuery to scrape HTML pages with node.js and jsdom

I saw this article earlier called "jsdom + jQuery in 5 lines with node.js" and thought it was pretty cool. I tried it out, and it worked like a charm to parse a webpage. I didn't realize how easy it was to use jsdom with an external library like that, and I was pretty impressed with how fast it processed and returned the results.

Later, I happened to see that Google was killing their Buzz service, and I went to remove it from my profile page, when I saw the notice that you can export your Buzz content through Google's Takeout Service. So I clicked a couple buttons, and the GOOG created a file of my Buzz posts that I could download. But they didn't create a JSON document or something, instead they exported each post as an HTML page, and then zipped them all up. I never actually used Buzz for anything, but as soon as it came out, I had it re-post my Twitter feed, which I use daily. So I ended up with 2600+ HTML files of my tweets, which is pretty good. To my knowledge, there's not an easy way to get them directly from Twitter.

So now I had a little project! I could use node.js to parse the HTML pages and extract the post content and timestamp into a DB, then maybe I'd post them on my blog or something. Or maybe just look at them... it's not important, I just wanted to test out jsdom some more in a real-life task.

I wrote a little script that looked in the Buzz post directory and parsed each of the HTML pages one by one. Thankfully, node.js v0.4.12 has synchronous versions of all its file functions, so I could easily just grab the files one-by-one and not worry about callbacks, etc. Jsdom, however, is asynchronous, so I had to put the whole thing in a recursive function, which I think is sort of a pain, but nothing overly complex..

But then, something went wrong. I would loop through about 300 files, and then get this error:

FATAL ERROR: CALL_AND_RETRY_2 Allocation Failed - process out of memory

Joy. I had zero idea why. Jsdom has some memory leaks it seems, but there were no obvious solutions that I could find by searching the web or by trial and error. I thought maybe I was using recursion too deeply (I'm not sure what V8's limit is, but I thought it was a lot), so I started looking around for sync libraries to see if I could somehow fix the memory problem with some sort of magic like promises or whatever. I also tried to delete the variables on each loop, messed with jsdom's parameters, etc. Nothing worked.

Finally, after messing with it for several *hours*, it dawned on me that I was killing an ant with a bazooka. On every loop, jsdom was loading the entire JQuery script, parsing ~9000 or so lines of code, of which I was using almost none of to pull the values of exactly two HTML tags. Well, duh.

So I re-wrote the script to not use JQuery, and instead just pulled out the data I was looking for using plain ol' JavaScript and the DOM and *poof*, the memory problem went away. Makes sense if you think about it, why use JQuery for something as simple as pulling up a few values? Unless you're parsing a really complicated page, why not just use the basic DOM instead?

Here's an example which pulls the top story from Reuters. Truly 5 lines of code, and no ginormous libraries required.

var jsdom = require('jsdom');

jsdom.env('', function(err, window){

                var post = window.document.getElementsByClassName('topStory')[0].textContent;


All that said, I have to say that node.js is really a pain in the ass to do any sort of general scripting with. Looping through and parsing a bunch of files is a basic system admin task - it should not require me to jump through hoops of callbacks and recursion, etc. Maybe node isn't the right tool, but it'd be awfully nice to be able to use JavaScript in a more general way, no? Rather than having to have a mish-mash of Python, PHP, Perl, Ruby and whatever else littering my system (not to mention my brain), having more tasks written in a common language like JavaScript would be nice.

There really needs to be a simple way to just declare an asynch method as synchronous, so that you can whip up basic system tasks without having to write a bunch of interlocking spaghetti code. Sure, it's not the best way to develop for high-end, scalable systems, but for a lot of what programmers do, that sort of basic functionality would make a lot of scripts cleaner, simpler and easier to develop and maintain.

Okay, I'm going to bed now.


< Previous         Next >