WideFinder in PHP

Tim Bray is experimenting with Erlang, and created a little test script called WideFinder which has sort of taken on a life of it's own. Since everyone was donating versions in their favorite languages, I wanted to see how PHP on the command line script would compare, so I whipped it up and tweaked it a bit to get it to run faster. Now, I understand the real test here is for spreading loads across processes and cores, and maybe even messing with better algorithms or strategies like MapReduce, but to me I just wanted to see what I could get running and compare.

Here it is:


#!/usr/bin/php -q
<?php

$time_start = microtime(true);

$regex = '#GET /ongoing/When/dddx/(dddd/dd/dd/[^ .]+) #';

$file = fopen($argv[1], 'r');

$urls = array();

$line = fgets($file);

while ($line) {

        if(preg_match($regex, $line, $matches)){

                $urls[$matches[1]] += 1;

        }

        $line = fgets($file);

}

arsort($urls);

$urls = array_slice($urls, 0, 10);

print_r($urls);

echo 'Executed in ' . (microtime(true) - $time_start)  . ' seconds' . "n";


Even though it's a simple script, you always learn stuff doing these sorts of things and this time was no different for me. I duplicated Tim's example data file until it was 300MB and 1.6MM lines long, and then ran the above. The fastest time I got was just below 7 seconds, with about 7.5 seconds being the norm. This is on my desktop machine, which is a not particularly powerful 3Ghz P4 with 1.7GB of RAM.

Stuff I learned:

* Tim's Ruby script actually runs around 6.5 seconds on the same box on the same file with less variation in speeds, which is a second or more faster than my best PHP. I guess Ruby doesn't completely suck. :-)

* Putting the Regular Expression into its own variable increased the speed of the loop by more than a second, which probably shouldn't surprise me, but did. There doesn't seem to be any way of pre-compiling the regex, though, like there is in Java.

* I was checking for the end of the file in each loop, but doing it the above way (just checking to see when the line returns a null) saved about a half a second as well.

* The sorting functions have almost no effect on the final speed. Disk i/o and the regular expressions are where most of the processing comes from.

That was fun... though I really need to get up to speed on a real programming language some day like Python. But if you look at the above script, do you see those two array functions (sort and slice)? I love that stuff! Not having to write that by hand is my favorite part of PHP. I'd much rather lego that type of code together than write it out on my own, no matter how simple it is.

-Russ

< Previous         Next >