I'm doing some analysis on the logs I've kept for Mowser over the past few months. It got roughly 2.6MM mobile page views in October, and 3.25MM in November - and adding a bit for December and I have 6,605,088 records to parse. What I wanted to determine first off is a way of making a better guesstimate at my Unique Users as it's a real pain in the ass to determine since every phone out there goes through some sort of centralized IP from the carrier.
I think I've mentioned this before, but for every mobile page that I serve, I write a row in my database with the standard log stuff like request and referrer, but also I serialize the entire PHP $_SERVER object and throw it into a blob as well so I can analyze it later (there's lots of stuff that just doesn't appear in an extended Apache log). Well, now is later... So I just wrote a little script that went through the stored request blobs, de-serialized them, and pulled out the HTTP headers into a unique array, so I could start to look at them and figure out how to determine MSISDN numbers or other unique identifiers. Pretty straight forward, right?
Now, I understand that HTTP headers are extensible - Mowser adds a few itself in addition to the standard proxy ones it passes , but how many would you think there were? 100? 200? Nope. To my bewilderment, there were 599 unique HTTP header types passed the server. Which considering the millions of page views, might not actually be all that bad, but as I'm starting to slog through them it seems like a fuck of a lot to me.
Is this a surprise to you? Maybe I just never paid attention before... but wow. If you'd like the list, I threw the output of the script in a text file here. It's not the most interesting chunk of data in the world, but really goes to show how much data is being passed around right under our noses, so I thought I'd share.
I think the next step will be to re-run the query, but this time keeping track of the count in hopes some of the more popular headers bubble up to the top. Wish me luck.