The Future of Web Aggregation

[image]

Let's think back a few years ago to what I think we can all consider the heyday of news feeds - a time when blogs were a still relatively new concept, companies like Feedburner were acquisition targets, Atom was an upstart format taking on the hegemony of RSS, and things like Twitter and Facebook were only just gaining traction.

For a while there, there was a lot of buzz around news feeds as consumers got more used to the idea. Yahoo!, for example, added feeds to their My Yahoo! page and Google created iGoogle, mobile phones all rushed to create "widgets", etc. I remember some of the various news reader options like Bloglines, NetNewsWire, NewzCrawler, FeedDemon, etc. Apple even embraced the idea of "Podcasts" directly on their devices and the whole concept of feeds became mainstream.

Since then, however, feeds have started to fade in importance. Google introduced it's Reader, and everyone who were real news junkies moved to using that. Everyone else either uses Facebook or Twitter. In fact, if you're a new news site or blog, it's pretty much required have a Twitter and FB account to push out your changes nowadays.

"Following" has replaced "subscribing" it seems.

(As an aside, considering the wars that were fought over "full news feeds", the shift to Twitter as a primary news reader blows my mind. Apparently, links and snippets are enough for the vast majority of Internet users. Not even original URLs matter much either - shortened, trackable links are fine too. Imagine trying to take that position a few years ago?)

How far have feeds fallen from grace? Well, a primary example is Facebook - they're notorious for keeping their data locked up and providing few feeds, if any. Yes there are a few still working, but you have to hunt to find them, and you still miss most of what's happening in your network. But even more telling, some of the hottest new startups that have gotten press lately don't even use feeds at all. Check out Quora for instance. Being able to keep track of all those Question/Answer threads and various categories via feeds used to be a number one priority for new sites, but Quora hasn't even bothered. Same thing goes for Hunch as well.

What surprised me the most, however, was when I read in a Wired news article that the new iPad-based news reader Flipboard doesn't even use RSS or Atom feeds *at all*. They simply scrape the original content pages using a version of the code that powers Readability, the Javascript based bookmarklet for making websites more readable that Apple recently integrated into Safari. That same article points out that Instapaper also does the same sort of thing as well.

So wow, RSS and Atom XML feeds are starting to become as irrelevant as "Made for Netscape" buttons and <blink> tags. As someone who has written their own custom news reader and uses it daily, I'm pretty amazed.

But waitasecond, without any sort of standards or standard semantic markup to take the place of a feed's title, pubdate and description elements (the basics of any feed), it's going to be really hard to grab a page's content correctly, no? Well, it's not easy - and not always perfect - but it's definitely possible. Especially with better and better web-parsing engines hosted right on the server itself.

In fact, much of this problem reminds me a lot of what I did with Mowser a couple years ago to re-purpose web content for the mobile web, doing intelligent things like taking out styles and functionality that mobiles couldn't handle, and swapping out Google AdSense Javascript for it's mobile equivalent, etc. What I really built into the server was a virtual web parsing engine. It used libxml2 underneath to consume the HTML, but then I would walk the DOM strip out the stuff that wasn't needed. Mowser's main strength (in my opinion), was the ability of letting publishers leave hints in their markup to help better process their pages (using class attributes, since you can have multiple per element), rather than leaving it all up to the ability of automatic transcoders, which are notoriously bad at it. I also had the idea of publishing a wiki-style site where regular users could help give hints to sites as well, improving the overall quality of the final mobile page.

In a similar effort, Instapaper is now starting an effort to create a "Community text-parsing configuration" for its service. The idea being that users of Instapaper's text-only feature can create a configuration which helps the text-parser suss out the important bits of a web-page, and also allows publishers to similarly help by leaving hints in their site's class attributes as well.

For my service, I had to deal with forms, dynamic content, tables, lists and images, etc., so I was never able to get it perfect. But even just trying to extract only story content (titles, paragraphs, etc.) like Flipboard and Instapaper are doing is not always going to work well. Web pages can be created in a million different ways, and usually are.

This brings me to one of the cooler sites I've seen lately - ScraperWiki. It's sort of like Yahoo! Pipes, but rather than having to deal with drag/drop icons to do the parsing, you get to write the actual code (in php or python) yourself. It's a very, very cool idea (and one, I have to say, I've had before). Any scrapers you write are shared, wiki-style, with everyone else so they can be re-used and modified. This shares the burden of accurately parsing a site with lots of people, and could be a killer service.

In fact, I propose that scraping is the obvious evolution of web-based content distribution. The fact is that separate but equal is never equal - in this case a separate feed for content that comes from a web site will always be inherently inferior to it's main HTML content. (This goes for "mobile-specific" versions as well.) Search engines already do some sort of page "scraping" in order to extract content and index the results, right? This new trend is essentially bringing functionality that has been limited to specific uses cases (search) to a whole new generation of applications.

Sites like Quora and Hunch no longer have to worry about having feeds. They know they can rely on their users to use social networks to distribute links, and users will either view their sites via a web browser, or any service they use will mimic a browser (i.e. scraping) to grab that content automatically. No need to worry about full-content vs. summaries, etc.

All this brings me, finally, to my main thought: Are we on the verge of seeing a new generation of news readers?

The basic feed reader functionality of dumbly polling a site over and over again, grabbing the XML when it updates, checking for duplicates and then presenting the results in a long list is just simply not compelling to the average user. Only the most hardcore of the info junkies (like myself) want to deal with that sort of deluge of content. Though the core idea is great - bringing information from sites all over the internet into one spot for quick and easy access - the implementations up until now have left a lot to be desired.

The newest content and community startups have apparently recognized this and forgone feeds all together. Why bother? Only a select few people actually use the feeds anyways, the rest simply use feeds as an easy way to steal content for AdSense driven spam sites. Sites that want to share their content do so now by providing custom APIs, with developer keys, etc. - it's much easier to track and, in the end, provides a richer experience to their users.

The newest aggregators have also recognized the problems with news readers, and by correlation feeds themselves, and are instead focusing on making content easier and more enjoyable to consume. Whether it's Pulse's new image-heavy horizontal layout for the iPad, or Instapaper's text-only view of saved pages, the idea is to get away from the Google Reader river of news.

By breaking out of the box that feeds create, aggregators can do so much more for their users. Think beyond scraped content as just a simple replacement for feed content, instead think about the intelligence that content parsers have to contain in order to work. To extract the correct content from a website, a crawler has to intelligently process a page's structure and function, maybe even its security and scripting logic as well. This is a huge leap forward because it enables more intelligence and awareness to be integrated into the experience, which is exactly what users want.

Flipboard, for example, is interesting not only because it formats the results into a friendly newspaper-like experience, but also because it generates the news from your contacts on Twitter and Facebook. It doesn't just dumbly grab a list of things you already read, it integrates with your existing network to get content you didn't know existed. It doesn't care if there's a feed or not of the link it's grabbing, it parses the page, pulls out a headline and some content and presents it to the user to browse at their leisure. This sort of thing is just the beginning.

Every day I read my news reader and I get frustrated with the experience in various ways. I see more duplicate content than I need to (especially with today's content-mills repeating every little item), or I see too little content from some sources, and too much from others. Some sites constantly add inane images to attract my attention or include linked feed advertising, others include great illustrations that I don't want to miss, but sometimes do because I mentally start to ignore all the images. I also can't get to any information from within my company's firewall, though there are plenty of updates to internal blogs and wikis I'd like to keep track of. And despite keeping track of hundreds of sites, I still end up clicking over to Facebook once in a while to catch up on shared videos or links or other updates, as that stuff doesn't have a feed.

Essentially, it's becoming more and more work to separate signal from noise, and it never seems that everything you want to keep track of has a feed. I can't imagine what it must be like if your job is to parse news for a living. Imagine being an analyst for a bank and having to wade through the cruft you'd get in a news reader every day, not to mention the monthly publications, etc.

What I think is going to happen is that both browsers and aggregator services in the cloud are going to start enabling a lot more logic and customization. We see the start of it now with Grease Monkey scripts and browser plugins and extensions, but I think another level of user-friendly artificial intelligence is needed. I'm talking about applications that parse web pages, gather content, and display it all intelligently, economically and without magic (which is generally always wrong), as directed by your specific choices of what you think is good and bad. ScraperWiki is the first step towards this sort of thing, but really it's only just the beginning.

Anyways, a few years ago I decided that the mobile web as a separate entity was a dead end because of the quickly improving mobile browsers and it turns out I was pretty spot on. It never dawned on me that the same logic could be applied to web feeds because of things like quickly improving server-side parsers and bad user experiences, but now I'm seeing that it is. I personally still wouldn't launch a new site today without having a decent feed, but I bet it'll be a short time before I don't worry about it, and I bet there's a lot of other web developers that feel that way already.

-Russ

< Previous         Next >