Skip navigation

Sign up | Lost password? | Help

Tom Heath's Displacement Activities

Posts tagged with "amazon"

Where is the business value in Linked Data?

, , , ...

Where is the business value in Linked Data? What is the business case for exposing your organisation's data assets according to Linked Data principles and best practices, and being a good citizen of the Web of Data? Whenever I ask myself this question I'm tempted to give some trite answer like "you've got to be in it to win it". Ultimately I think this is true, or at least will be in time, in just the same way that businesses in the nineties asked themselves about the value of having a Web site, and (hopefully) came to realise that this was a moot point; not having a Web site was not an option.

However, I'm impatient, and want to see everyone participating in a Web of Data sooner rather than later. I also want to have a meaningful answer when other people ask the business value question, that isn't just a flimsy "trust me" or an arrogant "you'll see". With that in mind I've tried to clarify my thoughts on the subject and spell them out here.

The first issue to address relates to publishing data on the Web full-stop - we'll get to Linked Data specifics later. APIs for public access to data are now widespread on the Web, with sites like Amazon and Flickr being good examples. A common reaction to this kind of arrangement is to think in terms of the data having been given away, and wonder about how this affects the bottom line.

For both Amazon and Flickr the data represents a core asset, but the route of openness has enabled them to foster communities that use this data and drive revenue generation in other areas, whether that's selling stuff (Amazon) or collecting annual subscription fees (Flickr). People may pay for goods or pay an annual subscription, but my guess is that (perhaps in contrast to enterprises) individuals are unlikely to pay in large numbers for data. In the case of Amazon the data either isn't *that* important to people /really/, is available from other sources, or would become available from other sources if Amazon began to charge at all, or charged more than a nominal amount. For Flickr the same rules apply, except that people are even less likely to pay a separate fee to access a pool of data that they themselves have contributed to. The key point here is that providing APIs to their data has allowed Amazon and Flickr to drive additional traffic into their established revenue channels.

Seen this way, an organisation with rich data assets has two choices. The first is to open up access to its data, and understand that the challenge is now not just about having qaulity data, but enabling others to create value around these assets and therefore ultimately do the same for the organisation. The second option is to keep the data locked away like the crown jewels, while the organisation and the data itself are slowly rendered irrelevant by cheaper or more open alternatives.

An interesting example in this case is the UK Government's approach to the Ordnance Survey, the national mapping agency. Rather than accepting that the tax base has already financed the creation of the OS's phenomenal data assets and therefore should have the right to re-use these as they see fit, the UK government requires the OS to generate revenue. Whilst the OS itself is making some great efforts to participate in the Semantic Web, to a large extent their hands are tied. This opens the door (or creates the door in the first place) for initiatives such as OpenStreetMap.

The kind of scenario I can imagine is this: the government continues to not "get it", Ordnance Survey data remains largely inaccessible to those who can't afford to license it, OpenStreetMap data becomes good enough for 80% of use cases, fewer people license OS data, OS raises prices to recoup the lost revenue, less popular locations stop being mapped as they are deemed unprofitable, even fewer people buy OS data, the OS and all its data assets are sold at a fraction of their former "value".

What the UK government doesn't fully understand (despite things like the "Show us a better way" competition), but has been well demonstrated in the US, is that opening up access to data creates economic benefits in the wider economy that can far outstrip those gained from keeping the data closed and attempting to turn it into a source of revenue. Organisations whose data assets have not been created using public funds may not have the same moral obligations to do so, but the options remain the same: open up or be rendered irrelevant by someone who does.

So if the choice is between openness or obsolescence, how does Linked Data help? Let's look at Amazon and Flickr again. Both these services make data easily available, but have compelling reasons for data consumers to link back to the original site, whether that's to gain revenue from affiliate schemes or to save the hassle of having to host one's own photos at many different resolutions. The net result is the same in both scenarios: more traffic channelled to the site of the data provider.

A typical Web2.0 scenario is that data is accessed from an API, processed in some way, and re-presented to users in a form that differs somehow from the original offering provided by the data publisher -- a mashup. This difference may be in the visual presentation of the data, in added value created by combining the data with that from other sources, or in both. Either way, this kind of mashup is likely to be presented to the user as an HTML document, perhaps with some AJAX embellishments to improve the user experience.

The extent to which the creator of the mashup chooses to link back to the data source is a function of the rewards on offer and the conditions under which the data can be used. Not all services will have the same compelling reasons for data consumers to link back to the data providers themselves, as not all data publishers will be able to afford the kind of affiliates scheme run by Amazon. However, even in cases such as a book mashup based on Amazon data, where the creator links back to Amazon prominently in order to gain affiliate revenue, both the data publisher and the application creator lose. Or at the very least they don't win as much as they could.

This may sound counter-intuitive, so let's look at the details. In processing data to create a mashup, the connection between the data and the data provider is effectively lost. This is a result of how conventional Web APIs typically publish their data. The code snippet below shows data from the Amazon E-commerce Service about the book "Harry Potter and the Deathly Hallows":

<ItemAttributes>
<Author>J. K. Rowling</Author>
<Creator Role="Illustrator">Mary GrandPré</Creator>
<Manufacturer>Arthur A. Levine Books</Manufacturer>
<ProductGroup>Book</ProductGroup>
Harry Potter and the Deathly Hallows (Book 7)
</ItemAttributes>

If you look at elements such as <Author>, you'll see that author names are given simply as text strings. The author herself is not identified in a way that other data sources on the Web can point to. She does not have a unique identity, but exists only in the context of this document that describes a particular book. There is no unique identifier for this person that can be looked up to obtain more information. As a result this output from Amazon represents a data "blind alley" from which there's nowhere to go. There is nothing in the data itself that leads anywhere, or even points back to the source - in effect the connection between the data and the data publisher is lost.

The connection between publisher and data may be reinstated to some degree in the form of HTML links back to the data source, but by this point the damage is done. These links are tenuous at best and enforced mainly by economic incentives or licensing requirements. In Web2.0-style mashups based on these principles there is no reliable way to express the relationships between the various pieces of source data in a way that can be reused to build further mashups - the effort is expended once for a human audience and then lost.

In contrast, Linked Data mashups (or "meshups" as they sometimes get called) are simply statements linking items in related data sets. Crucially these items are identified by URIs starting "http://", each of which may have been minted in the domain of the data publisher, meaning that whenever anyone looks up one of these URIs they may be channeled back to the original data source. It is this feature that creates the business value in Linked Data compared to conventional Web APIs. Rather than releasing data into the cloud untethered and untraceable, Linked Data allows organisations and individuals to expose their data assets in a way that is easily consumed by others, whilst retaining indicators of provenance and a means to capitalise on or otherwise benefit from their commitment to openness. Minting URIs to identify the entities in your data, and linking these to related items in other data sets presents an opportunity to channel traffic back to conventional Web sites when someone looks up those URIs. It is this process that presents opportunities to generate business value through Linked Data principles.

On the Web, but not *In* the Web

, , , ...

In my recent Talk with Talis podcast, Paul Miller and I got chatting about the conceptual difference between exposing data on the web using Web2.0-style APIs (such as Amazon), and serving up Linked Data (also look here for TimBL's original Design Issues document, which spells out what must rapidly be becoming "the four commandments of Linked Data"). The discussion centers around the "On the Web, but not In the Web" distinction. Kingsley liked the discussion, and suggested it should be blogged for posterity, so here is a transcribed excerpt (starting at 28m41s through the podcast):

Paul Miller: You said that reviews you put into Revyu.com are available on the web as a normal review, and also available on the Semantic Web, to be embedded in other places. Now, how is that different to me doing a review on Amazon, and cutting and pasting it and sticking it into epinions, or my blog, or whatever?

Tom Heath: OK, so, if you do the review in Amazon it will be available on the Web in two ways. It'll be available on the HTML Web for people to browse with their browser, and the review would also be available through the Amazon Web Services API, which means that it is reusable to an extent: I can query the Amazon Web Services API and retrieve that information and do something with it. But this kind of highlights a really key distinction between Web2.0 APIs and the Semantic Web, or the Web of Data, or the Linked Data Web, or however you choose to name it, in that by default if you write a review in Revyu then it's there available, it has a URI, people can make other statements about it, they can reference it in other RDF statements on the Semantic Web, and they can also link to it from the HTML Web.

So, in contrast, if you write a review in Amazon, then the ability to link that review with other bits of information is very limited. You can't necessarily easily say that the review references a certain item or is provided by a certain person, in any way other than embedding this information in XML elements within the results from the Amazon Web Services API. So, this information is available on the Web, but it's not really in the Web, if that distinction makes sense.

It's a distinction that Tim Berners-Lee has, um, well I'm not sure if he's explicitly made the distinction but he always uses the phrase "in the Web" and I never really understood, I never really got why he was using this form of words until recently, when it dawned on me that something being on the Web doesn't really make it in the Web, and I think that's the key distinction between data from Amazon, the Amazon API, or any of the the other Web2.0 kind of APIs, that it's there available on the Web but it's not really in the Web, because it's hard to link it together, which is something that RDF does very well, which XML doesn't really do.