Making the Internet Robot-friendly

6 March 2012

Microdata, part of the HTML5 specification, is a way of semantically marking up web content so it is more easily understood by machines. It’s still a fairly new standard so it isn’t that widely adopted, but it has been adopted by google as a way to add extra information to search result listings, which is bound to give it a bit of a boost.

The philosophy behind microdata is to add extra markup to human-readable content to give it meaning to machines. At the moment these ‘machines’ are mostly crawlers used by search engines to index the web, but in theory this could be hugely useful for all sorts of different purposes. These range from scientists using it for sharing and data-mining of scientific results, to apps using it to provide product price comparisons and ratings or information about events and locations.

As a way of testing out microdata, I had a go at using it to mark up my About Page:

<div itemscope itemtype="http://data-vocabulary.org/Person">
    Hi, I'm <span itemprop="name">Matt Swain</span>.
    I'm a <span itemprop="title">PhD student</span> at the
    <span itemprop="affiliation">University of Cambridge</span>.
</div>

The additional itemtype and itemprop attributes leave the content completely unchanged for a human reader, but add a wealth of useful information for machines. To check that it works, Google offers a tool that displays all the structured data its web crawlers managed to extract from a page.

As this tool shows, the main benefit of doing this at the moment is to get author information such as a photo next to pages you have created in Google search results.