Using PHP to scrape web sites as feeds

Posted Tuesday, July 31, 2007 5:08 pm

A few weeks ago I subscribed to John Gruber's Daring Fireball blog after I exchanged an email or two with him about what was then the upcoming iPhone... John's been keeping a close tab on iPhone news, and I've been enjoying his posts. About a week ago, I noticed I wasn't getting all the updates on his blog, and went looking around to see if I was missing some sort of uber-feed. Turns out I wasn't the only one with this problem, as the links were only available to paid subscribers.

Well, being completely, utterly broke and having my current job being LITERALLY to scrape web pages for a living, I couldn't resist the temptation to take 2 minutes and write a script to, um, help remind me to check John's blog. (Sorry John!). Since John has now opened up his feed to all, I thought it would be both educational - and cleansing - to post the code here for all to see... both for the would-be poverty stricken entrepreneur types, and those who'd like to make sure their content isn't being grabbed if they don't want it to be.

Here's the code:

<?php

    $url = 'http://daringfireball.net/';
    $title = 'Daring Fireball links';
    $description = 'Links';

    $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

    header('Content-type: text/xml; charset=utf-8', true);

    echo '<?xml version="1.0" encoding="UTF-8"?'.'>' . PHP_EOL;
    echo '<rss version="2.0">' . PHP_EOL;
    echo '<channel>' . PHP_EOL;
    echo '  <title>' . $title . '</title>' . PHP_EOL;
    echo '  <link>' . $url . '</link>' . PHP_EOL;
    echo '  <description>' . $description . '</description>' . PHP_EOL;

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($curl, CURLOPT_AUTOREFERER, true);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
    curl_setopt($curl, CURLOPT_TIMEOUT, 2 );        

    $html = curl_exec( $curl );

    $html = @mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');   

    curl_close( $curl );

    $dom = new DOMDocument();

    @$dom->loadHTML($html);

    $nodes = $dom->getElementsByTagName('*');

    $date = '';

    foreach($nodes as $node){

        if($node->nodeName == 'h2'){
            $date =  strtotime($node->nodeValue);
        }

        if($node->nodeName == 'dt'){

            $inodes = $node->childNodes;

            foreach($inodes as $inode){

                if($inode->nodeName == 'a' && $inode->getAttribute('class') == 'permalink'){
                    echo '<item>' . PHP_EOL;
                    echo '<title>' . @mb_convert_encoding(htmlspecialchars($inode->getAttribute('title')), 'utf-8') . '</title>' . PHP_EOL;
                    echo '<link>' . $inode->getAttribute('href') . '</link>' . PHP_EOL;
                    if($date){
                        echo '<pubDate>' . date(DATE_RSS, $date) . '</pubDate>' . PHP_EOL;
                    }
                    echo '</item>' . PHP_EOL;
                }
            }
        }
    }

    echo '</channel></rss>';

?>

I'll explain the code in chunks...

The first bit sets the params I'll use below in the RSS feed I generate - including the original URL, the title of the site, and the description. These are the basics for RSS and will be fine for our purposes. And then to make sure the script isn't easily bounced, I have a variable I set to make the User Agent to look like GoogleBot. That helps a lot with a variety of sites out there who just check for that and nothing else.. If you're doing any sort of googlebot check without verifying the IP, you're not being smart. Next I just echo out the top part of the RSS feed with my params, nothing special. I'm using UTF-8 to encode the XML, this is important when I'm writing out stuff later, so keep it in mind.

The next chunk is the curl calls to go get the content. PHP has a few ways to do this, but one of the best things about PHP is that most of the functionality it provides are just light wrappers around native libraries on Linux. In this case, it's using curl to go grab the content of the page - and the options are set to make sure it passes my custom user-agent, follows redirects, pass the redirect headers (in case the site watches for that) and gets rid of the header stuff for me, as I don't need that info.

Side note - another great thing about PHP is you can test it on the command line easily - just type "php scriptname.php' and it'll run your code, and you can see the HTML output right there or any errors. Also you can have your error output on the command line be more verbose as well, which helps debugging these quick scripts quite a bit. So as I'm writing my code, I normally stop and echo out as I go along to test that things are working as they should up to that point.

Now, the next step is to pass the HTML I got back from curl to the DOM - which again, is just a light wrapper around libxml - which works really well for processing HTML docs as well as XML. No messing with Regular Expressions here - someone else has figured out all the hard stuff about parsing markup and recovering from errors, etc. so why try to re-create that wheel? (I firmly believe that Jamie Zawinsky quote: "You have a problem and you think, I know, I'll use Regular Expressions... now you have two problems.") One tip though, even though the DOM has various ways to specify the encoding, it seems to be happiest when you just convert whatever your passing it to use HTML entities. So that's what I do using the "multi-byte" string converter. I use this rather than the regular converter as this makes sure that the odd Japanese or Finnish character doesn't get munged and mess up the works. That said I still use the @ symbol in front of those calls which tells PHP to ignore any errors, as something always is wrong.

Okay, once we have the DOM, then it's just a matter of iterating through the "nodes" (i.e. tags/elements, etc.) until you find what you're looking for. In this case, John has his links in specially marked anchors, which are usually inside dt's. However, the descriptive text isn't really easily grabable, as it's not in a containing DIV or anything, so I just decided to ignore that and go with the stuff that's easy to snag - like the title attribute in the anchor tag. By making sure I look for "a" nodes which also have an attribute of "class" that's called 'permalink', then I'm able to write out the individual items from there without having to do much more logic. And as I was iterating through the nodes, I also took a sec to grab the h2 tags, which seem to have the date. RSS Items don't actually need a pubdate, but they're nice to have, so if I can grab that info from a tag, I do. Another great thing about PHP? The strtotime() function - which will read just about any date/time in plain english and convert it into a usable date. It's a *very* convenient function to have. Finally, before I write out the title in my item, I need to make sure I convert the ampersands into &s;amps; and also convert the outgoing text back into utf-8 (as XML is in UTF-8).

The end result is a reasonable and valid feed that shouldn't have duplicate updates which is "good enough" for your feed reader to call periodically to get the links. Most sites nowadays have feeds, and full feeds, but every once in a while you'll run across a site that is adamant on making you go there to view their content. As I'm not re-publishing this feed, but just using it for private use, it's pretty close to just visiting the site anyways (if you're being very morally ambiguous).

If you're not a PHP hacker and you want to try this sort of thing out, Dapper provides point and click scraping tools which might also work as well. They've open sourced a bit of their code, and they use the Firefox engine to do the scraping, actually, so it looks like a very neat way of getting data out of various sites.

I hope that was educational in a good way. :-)

-Russ

< Previous Next >