Let the microblogs bloom

[image]

I was just about to embark on a post yesterday about my latest obsession which is web-based forums (actually, it's a return of an old obsession) when identi.ca launched with their open source PHP-based Twitter clone, so I just had to try it out. I threw it up on foozik.com if you want to see. It took me a while to get the dependencies working, but it seems pretty cool.

It's a great effort, looks good, and promoted in all the right ways. Evan (the guy behind identi.ca and the laconi.ca code base) did a great job creating a nice little project with some cool features like OpenID, Jabber support and the beginnings of a federation system.

Looking at the code, however, it's doomed.

The core architecture just isn't made to scale, and a day after it launched identi.ca already seems to be paying the price, even after adding a bunch more servers. Here's the the problem in a few lines of code:


$notice = DB_DataObject::factory('notice');

# XXX: chokety and bad

$notice->whereAdd('EXISTS (SELECT subscribed from subscription where subscriber = '.$profile->id.' and subscribed = notice.profile_id)', 'OR');

$notice->whereAdd('profile_id = ' . $profile->id, 'OR');

$notice->orderBy('created DESC');

Even the comments express this is "chokety and bad". Ignoring the use of the PEAR::DB data object stuff (that's adding abstractions on top of your database that you can't afford to have) this code shows that the design of the system is fundamentally flawed. The core problem is the query itself - it's expensive as hell: "Get all the notices (messages) where I am subscribed to the publisher." Oh, man. As the database grows, the indexes will have to get huge, and as there's more subscribers and more subscriptions between subscribers, it's going to be impossible for that query to keep up.

The lesson from Twitter is that microblogs aren't Content Management Systems at all, but are instead Messaging systems, and have to be architected as such. SMTP or EDI are our models here, not publishing or blogs.

Here's how a microblog system has to work to scale: All the messages created by users have to go into a Queue when they're created, and an external process then has to go through one by one and figure out which messages go into which subscriber's message list. As the system grows and more messages are created, the messages may arrive in your "inbox" slower, but they will still arrive. This type of system can be easily broken up into dedicated servers and multiple processes can handle different parts of the read/write process, and the individual user message lists can be more easily cached - as once a page is created that contains messages, it doesn't change.

If you don't set it up like this? Well, you're seeing what happens at Twitter and identi.ca now. As the number of users scale up, and the number of messages increase, the load quickly overwhelms any relational database until it'll become impossible to keep up. Structuring microblog systems essentially like self-contained web-mail is the key. Just think about it - Hotmail survived and scaled after it launched in the 1990s when 512MB of RAM was considered a lot and 10GB hard drives were "big". It's not about *power* or throwing more/bigger hardware at the problem, it's about architecture.

Another example: Lots of web forums out there get millions of new posts a day by tons of users (GaiaOnline.com had 5MM posts last week, for example), but they scale just fine because the format of forums have been designed to scale. These sites work because they simplify queries, facilitate caching, and not guaranteeing instant updates. Simplification and caching are to me the fundamental aspects of web scaling. This is the way it's been for a decade now... Want to survive a Slashdotting, for example? Get your DB out of there and export your pages as .html static files. These same principles can be applied even to a microblog/messaging system as well.

Once this is widely accepted (and I'm sure there are many that would argue with me), the thing that will separate these types of services won't be whether they stay up (ala Twitter), but how fast your subscription messages are updated. Some services might be smaller or offer more features but not update as quickly whereas others will pride themselves on being as close to real-time as possible. The key is that it's all about messaging, not publishing. (Oh, and this also facilitates federation as well, but that's another topic).

This said, all is not lost for the Laconi.ca code. The good thing is that it's open source, and not only that, but using the GNU Affero license, which means any changes made on anyone's server needs to be released back as well. Hopefully now that there's a code base to start from, some others (like myself) who are too lazy to start from scratch can get in there and start tweaking and shaping the code into something that will fundamentally scale better.

I envision lots of microblog services out there, actually. Even though services do have a network effect going for them (the reason Twitter has survived for so long), the idea of smaller microblog services is very appealing. One comment on Identi.ca yesterday spelled out it's usefulness perfectly - a person wanted a version she could use in her classroom as a way for students to ask questions. Very cool - yes you could use a live chat room or a simple forum or e-group, but bringing the subscription model into it would add a very interesting dynamic that I think better reflects how people really interact. That's just one example, but I think there's even more out there.

Just my thoughts for now. I'm going to write more about web forums in a bit.

-Russ

< Previous         Next >