HTML5 Microdata - Thoughts on semantic data, mobile adaption and Facebook FBML

Posted Thursday, October 21, 2010 11:40 am

A few years ago, I had an idea that was essentially a reaction to the introduction of "microformats" - embedding useful data into web pages by using "semantically correct" tags around information like address info or calendar stuff, and then using CSS to make that content blend into the page. The problem (to me) is that microformats were really stretching some of the tag definitions, and there was no real organizing scheme - all the different specs seemed like one-offs. I started messing with a parser, and just gave up because you basically had to hard-code the tags you were looking for, in the order they were supposed to be found.

So my idea was something I called "microschema". It was essentially to do the same sort of data marking, but by overloading an HTML's class attribute to do it. This works because class can contain multiple entries separated by a space - some could be used by CSS for formatting, the others for denoting key-names for the data in the element.

I wrote a long email to someone (who never responded - thanks!), but never got around to posting about it on my blog, and then moved on. I dug out the email - here's a bit of it:

Microschema - a proposal for standard markup rules to better enable Microformats

The general idea is to define a set of concrete rules for extracting semantic data from a web page, using low touch approaches to existing standards and practices that have been pioneered by the Microformats group. Instead of using something as complex as RDF or applying arbitrary new namespaces on top of XHTML in order to give the underlying markup more meaning, they instead base their formats on existing XHTML elements and attributes enabling easier implementation and faster adoption.

The problem however with Microformats as they currently are created is that they are one at a time, using a set of design patterns as a guide, rather than building on a solid set of basic parser rules. Not only is this process slow, but it's error-prone and not scalable - as each new microformat will have its own quirks, and need it's own specifically designed parser in order to interpret them, rather than using a more generic, yet standardized system which can be coded into a parser once and re-used. This is what the Microschema ruleset aims to do - helping make Microformat rules clearer, save time in creating new microformats, and spur adoption as well. The Microschema isn't a generic set of rules in the sense of a full-on schema, with ordering and value constraints, the same way that microformats are not meant to replace full-on standards-body created data specs.

Simply put, Microschema is a set of practices that people and parsers can follow to unambiguously identify information on a web page by clearly marking field - value pairs inside normal tags. Microformats can then build on top of these rules to group together these pairs into data standards such as vCard or vCal.

Microschema is an effort to take interpretaion out of Microformats. Instead of religiously trying to divine the true meanin of the original HTML spec and applying appropriate new applications (i.e. address, abbr, etc.), Microschema is simply about using the tools at hand to create concrete rules about parsing meta-data out of XHTML in a standard, open way.

Example:
        <div>
        <span class="tel home">1-123-123-1234</span>
        <span class="tel office">1-123-123-1235</span>
        <span class="tel office">1-123-123-1236</span>
        </div>

I actually didn't throw away the idea completely - it lead to some of the stuff I worked on within Mowser (my mobile transcoding startup, for those with short memories). Re-using the "class name as data marker" idea, I wanted to use class names as ways of hinting for the transcoder to do things like hide areas of a page from being processed (like side-bars, embedded flash, etc.) or to show certain areas only to mobiles, etc. I don't think I ever got it fully implemented though - I just relied on handheld CSS to do most of the work, but the core of the idea was there: Embed key-field information in the class attributes to help a client or server-side parser better analyze and present the data inside the marked tags. Just recently in fact, Instapaper started supporting this exact idea, so it's obviously a concept that is very useful.

This brings us to the future - I ran across an article on HTML5 Microdata the other day in Webmonkey (they're still around!??!) called Microdata: HTML5's Best-Kept Secret, and what do you know? The HTML5 folks have added in some new rules to let developers add in arbitrary attribute names to any tag, allowing them to become key-names for data within the tags. Well, hey! Would you look at that! No need to clumsily overload the class attribute any more, now you can specifically mark attributes as belonging to one type of "vocabulary" (aka "namespace"). Apparently, Google is already parsing Microdata when it's spidering web pages.

Here's an example:

<div itemscope itemtype="http://data-vocabulary.org/Organization">
    <h1 itemprop="name">Hendershot's Coffee Bar</h1>
    <p itemprop="address" itemscope itemtype="http://data-vocabulary.org/Address">
      <span itemprop="street-address">1560 Oglethorpe Ave</span>,
      <span itemprop="locality">Athens</span>,
      <span itemprop="region">GA</span>.
    </p>
</div>

That's pretty cool. I'm a firm believer in the "One Web" school of thought when it comes to web pages (i.e. mobile or separate versions of websites are inherently inferior to the main desktop version, and thus users will naturally demand full-browser capabilities, rather than settle for lesser quality, regardless of convenience.). But I think there is probably quite a bit that can be done to help enhance the experience - say on a mobile handset, tablet or television - by embedding additional data within Microdata.

Then again, it's been a few years since I was thinking about this a lot, and technology trends have moved on. I wonder if this sort of "hidden semantic identifier" within HTML is actually even really needed at all. Look at how Facebook implemented their various widgets and APIs using FBML (or now XFBML). It's their own proprietary markup language that is embedded directly into regular HTML pages. The FBML markup is ignored by the browser, but picked out by their Javascript and processed into visible widgets when the page is rendered.

If you look at it, the fundamental idea is exactly the same - humans get one form of the information contained within the markup, and the computers get a different version that can be used to help the presentation, functionality, or for data processing. Why bother with embedded attributes like Microdata at all when it could be as simple as developers deciding on some semi-standard tags like <data name="key">value</data> which browsers will ignore, but parsers can use?

One thing is for sure, embedded RDF, Microformats and various other semantic namespace ideas really haven't worked - so it's encouraging to see this obviously important concept still evolving. It'll be interesting to see in a few years whether Microdata has taken off, or whether it too disappears into technical obscurity like the various schemes preceding it.

-Russ

< Previous Next >