Linked Data: my challenge

What with Gordon Brown’s recent (just an hour or so ago) announcement of lots of digital goodness at the “Building Britain’s Digital Future” event, the focus sharpens once again on Linked Data.

I’ve been sitting on the sidelines sniping gently at Linked Data since it apparently replaced the Semantic Web as The Next Big Thing. I remained cynical about the SW all the way through, and as of right now I remain cynical about Linked Data as well.

This might seem odd from someone obsessed with – and a clear advocate of – the opening up data. I’ve blogged about, talked about and written papers about what I’ve come to call MRD (Machine Readable Data). I’ve gone so far as to believe that if it doesn’t have an API, it doesn’t – or shouldn’t – exist.

So what is my problem with Linked Data? Surely what Linked Data offers is the holy grail of MRD? Shouldn’t I be embracing it as everyone else appears to be?

Yes. I probably should.

But…Linked Data runs headlong into one of the things I also blog about all the time here, and the thing I believe in probably more than anything else: simplicity.

If there is one thing I think we should all have learned from RSS, simple API’s, YQL, Yahoo Pipes, Google Docs, etc it is this: for a technology to gain traction it has to be not only accessible, but simple and usable, too.

Here’s how I see Linked Data as of right now:

1. It is completely entrenched in a community who are deeply technically focused. They’re nice people, but I’ve had a good bunch of conversations and never once has anyone been able to articulate for me the why or the how of Linked Data, and why it is better than focusing on simple MRD approaches, and in that lack of understanding we have a problem. I’m not the sharpest tool, but I’m not stupid either, and I’ve been trying to understand for a fair amount of time…

2. There are very few (read: almost zero) compelling use-cases for Linked Data. And I don’t mean the TBL “hey, imagine if you could do X” scenario, I mean real use-cases. Things that people have actually built. And no, Twine doesn’t cut it.

3. The entry cost is high – deeply arcane and overly technical, whilst the value remains low. Find me something you can do with Linked Data that you can’t do with an API. If the value was way higher, the cost wouldn’t matter so much. But right now, what do you get if you publish Linked Data? And what do you get if you consume it?

Now, I’m deeply aware that actually I don’t actually know much about Linked Data. But I’m also aware that for someone like me – with my background and interests – to not know much about Linked Data, there is somewhere in the chain a massive problem.

I genuinely want to understand Linked Data. I want to be a Linked Data advocate in the same way I’m an API/MRD advocate. So here is my challenge, and it is genuinely an open one. I need you, dear reader, to show me:

1. Why I should publish Linked Data. The “why” means I want to understand the value returned by the investment of time required, and by this I mean compelling, possibly visual and certainly useful examples

2. How I should do this, and easily. If you need to use the word “ontology” or “triple” or make me understand the deepest horrors of RDF, consider your approach a failed approach

3. Some compelling use-cases which demonstrate that this is better than a simple API/feed based approach

There you go – the challenge is on. Arcane technical types need not apply.

28 thoughts on “Linked Data: my challenge”

  1. I’m still a fan of the original guidelines for Linked data – paraphrased:

    Give each thing a ‘permanent’ page on the web.
    Put information about that thing on that page.
    Put connections from that page to other pages to make it more easily understood.

    Wikipedia does this excellently, without having to think about RDF/SW.

    For me though, a SPARQL endpoint containing data is not the same as having the data on the web.

    The metaphor that works for me is that SPARQL endpoints are to Linked data, as access to an undocumented SQL server are to CSV files.

  2. @Ben – thanks, that’s useful. I have to say, SPARQL brings me out in a rash, but have been trying to see through whether it’s because I don’t know it or it is genuinely hard…

    Got any examples of this kind of data in use (ie. being consumed as opposed to being produced?)

  3. I find myself in a very similar position, having been recently introduced to the world of Linked Data by colleagues. So I started attending “the meetings” and found many people asking the same awkward questions you do. There are practical solutions to some low level questions, some theorising on some high level ones and a great deal unanswered inbetween. But I’ve found it to be an open and welcoming community even to a newbie like me. My advice is to meet them halfway, go to some meetups, pose lot of questions, that way you’ll raise the bar on everyone’s understanding.

  4. @Anthony – that’s interesting, maybe I should simply make a bit more effort to engage with the people rather than trying to engage with the technology. Which particular meetings do you go to / would you recommend?

  5. “The metaphor that works for me is that SPARQL endpoints are to Linked data, as access to an undocumented SQL server are to CSV files.”

    This could be true, but this is the whole point of ontology. As long as someone uses an ontology correctly, or devises their own in a meaningful way, then the SPARQL endpoint is documented, in as much as it can be asked to describe all its own concepts.

  6. Thanks for the concise, provocative post, Mike! There is an old marketing adage that says, “Perception is…all there is!” and yours is a legitimate sampling of the perceptions of those who haven’t (yet…) drunk the #LinkedData Kool-Aid! šŸ˜‰

    I would like to challenge you on your Linked Data vs. API assertion. Having constructed a few over the years AND having engaged in ensuing political battles over their design, I would argue that (a) good APIs are (also?) hard, and (b) by adopting Sir Tim’s core principles, the remaining interoperability problem is “simply” one of establishing data relationships.

    In my view, the top “competitor” to the linked data approach in the open data space is not access via API, but rather the downloading of CSV files. Why? Because virtually any data store can export and ingest CSV, because it is relatively easy to create code (script or otherwise) to parse into one’s current model, and it is bog simple to point, click and download.

    In my view, the value of the linked data approach is to make the datasets easier to find and easier to access with a minimum of source-specific coding.

    • “In my view, the top ā€œcompetitorā€ to the linked data approach in the open data space is not access via API, but rather the downloading of CSV files.”

      I have to agree here. And the steps towards creating linked data instead of CSV are not that arduous, provided you can get your input early enough in the process to start generating unique URIs for each thing you want to describe.

  7. @John – thanks for commenting

    I guess I’m not quite saying “API vs Linked Data” from a technical perspective – I’m starting to understand that – er – I don’t actually really understand what the differences are in any great depth šŸ™‚

    The comparison was less technical, more: “Hey, I can do API’s but I can’t do Linked Data. Help me understand why”

    In terms of making datasets easier to find and less source-specific: agree, this seems absolutely key. But at the same time, and calling back to your “perception” point – it seems to me telling that there is a thriving community around lightweight web-based API’s (see the hoardes of examples at http://www.programmableweb.com/) and very little around RDF/LD. This may be a matter of time (although as I say, it’s not like these approaches haven’t been around since the dawn of the web) but IMO is more about too much complexity. This is what concerns me.

  8. “1. It is completely entrenched in a community who are deeply technically focused. Theyā€™re nice people, but Iā€™ve had a good bunch of conversations and never once has anyone been able to articulate for me the why or the how of Linked Data, and why it is better than focusing on simple MRD approaches, and in that lack of understanding we have a problem. Iā€™m not the sharpest tool, but Iā€™m not stupid either, and Iā€™ve been trying to understand for a fair amount of timeā€¦”

    It was a huge learning curve for me too; but once i got through everything it turns it that it’s really rather simple – and since I’ve been able to introduce a few fellow develops to linked data in a couple of hours – which is much better than the 6 month learning curve I went through – it is important to note though that this is not only a whole new tech stack, but also a whole old tech stack, used properly – so much of the comprehension time comes from getting rid of the old bad habits and misunderstandings about http / uri’s and the like. A caveat of being a developer for years, I’ve found that many “new” developers can pick up on linked data and the enabling technologies much easier than old devs, purely for the aforementioned reason.

    “2. There are very few (read: almost zero) compelling use-cases for Linked Data. And I donā€™t mean the TBL ā€œhey, imagine if you could do Xā€ scenario, I mean real use-cases. Things that people have actually built. And no, Twine doesnā€™t cut it.”

    If it helps any, rebuilt a full application to do things the linked data way, dropped the rdbms, used only a quad store, all data models in owl, all data in rdf; accessed via HTTP and sparql rather than sql – the system dropped from 2500+ classes to under 30, the system is 10x faster at least, and the functionality exposed (both implemented and unimplemented) is virtually endless.

    The key features of linked data behind the firewall (eg negating consuming and exposing linked open data) are the predicate, the uri and triple/quad store – together this little setup let’s you “link” all your data, store it all in a single “table” and know what that data is due to the “predicate” (which is a class + property in a single value).

    This opens up a new realm of functionality to us developers, allowing you to query all your data as if in the same table, store any data, extend classes however you want, and also multi extend classes (which you can’t do any other way) – for example “class A extends B,C,D,E”. Further, to extend your system lets say by adding in 2 new properties to a class, all you need to do is change the class blueprint (the owl file) and everything else is done, the data is magically stored as it’s rdf, it’s all linked in and queriable via sparql and the quad store, and all you need to do is allow the app to save the new predicate, and use it where needed.

    All that is behind the firewall, when you put 2+2 together and realise that everybodies data could be stored the same using open models (owl) that are web accessible, then you suddenly see that all you can build applications which literally have access to all the data in the world.

    A simple example is the good old “select your country” drop down, normally this means build a country table, find all the countries + associated iso codes and insert them, make a class Country then all code to work with it – in the linked data world, it’s just a simple SPARQL query to the web and you have all that data instantly with no real coding – further still if a country changes name or a new on comes about.. then your data is magically updated too. Thats just a basic “select your country” – grok that you can do that for everything and you won’t look back šŸ™‚

    re: “3. The entry cost is high ā€“ deeply arcane and overly technical, whilst the value remains low. Find me something you can do with Linked Data that you canā€™t do with an API. If the value was way higher, the cost wouldnā€™t matter so much.”

    Agreed about the cost (at the minute, as it’s adopted more demo’s and code will be available and cost to entry will be much lower) – as for the value, in the 10 years I’ve been developing nothing has ever came close, developing is fun again, the world is my oyster, and I have some very happy clients who pay me very well – I can offer my clients functionality they couldn’t dream of, and that virtually no other dev can offer them; not only that, but things that used to take me weeks I can now do in minutes/hours. I’ve never seen anything of higher value.

    re: “But right now, what do you get if you publish Linked Data? And what do you get if you consume it?”

    You have to go behind the firewall first to see the benefits, as outlined above, then you’ll see what can be gained by publishing and consuming – really though, focus on getting it implemented without the “open” bit, then you’ll naturally just want to open up your data and consume other peoples šŸ™‚

    do hope that helps!

    Nathan

  9. Mike, w.r.t. http://www.programmableweb.com I believe this is a valid question, but I think that rich set of accessible APIs is more suited to the “mashup” phase of dataset-based application construction than the “meshup” phase.

    To clarify: by “mashup,” I mean those applications that are constructed by marshalling functionalities exposed via web APIs. By “meshup,” I mean datasets that are constructed by combining datasets, usually published and accessed via #linkeddata principles.

    Clearly the two can an must co-exist. I think the difference is that web APIs are expected to be specialized, but access to datasets should be in the first approximation uniform, allowing the specialization to happen (possibly, usually) int he specific vocabularies.

    Then, given a standard approach to accessing data (like URLs, HTTP and HTML for web pages) communities can focus on shared vocabularies; see for example @iand’s OpenVocab project: http://open.vocab.org/

  10. @Nathan – thanks, great comment – really useful. I might drop you a line via your blog if that’s ok, get some more specific pointers as to where and how to start…

  11. @John – yes, thanks. Agree about co-existence. I guess I’m partly trying to see parallels in what is the undoubted success of web API’s in the emerging Linked Data field. It could be that the latter is simply much heavier-weight (in which case I should just forget trying to understand and evangelise about it)…

  12. 1. Why I should publish Linked Data. The ā€œwhyā€ means I want to understand the value returned by the investment of time required, and by this I mean compelling, possibly visual and certainly useful examples

    Two benefits, one is that I “standard developer” can along and query your entire site as if it had an api – “select all posts with a comment made by me” and use that data – the second benefit is that you can do this too, saving you lots of coding and negating your application all together, simply query and render the results. (hence why i mentioned dropping 2500 classes down to 30 earlier). Linked data removes the need for 80% of code imho; and all code you do make is truly re-usable (with the exceptions of specific queries and presentation templates).

    2. How I should do this, and easily. If you need to use the word ā€œontologyā€ or ā€œtripleā€ or make me understand the deepest horrors of RDF, consider your approach a failed approach

    consider what I’d say a failed approach then; the ontology and the triple are as needed as the sql, the pojo, the class, the struct, array and the variable. The ontology and the triple are where the power sits, remove them from your equation and you’re simply going through a pointless confusing exercise which you’ll reap no benefits from (although others will as they can consume your data and treat your entire site as an api).

    3. Some compelling use-cases which demonstrate that this is better than a simple API/feed based approach.

    Ack, the use-case is everything – you need to grok that first – if ten sites publish linked data, then I can query those ten sites in a single query as if all the data where local, and in a single table, in a format I already understand. Or in API terms, I can call ten remote apis in a single call and correlate filter and query the results, all in one line of “query” – show me another way of doing that and I’ll eat my hat. Swap out the ten with one million and its still possible. I’d say show me a compelling use case not to adopt that level of functionality and MRD-ness.

    regards !

  13. @Richard Watson

    “This could be true, but this is the whole point of ontology. As long as someone uses an ontology correctly, or devises their own in a meaningful way, then the SPARQL endpoint is documented, in as much as it can be asked to describe all its own concepts.”

    I don’t really see the difference – I can query a SQL server to find header values and keys.

    The one advantage an ontology has over ordered data is that typically it is published, dereferenecable and has documentation associated with this URI outlining the usage, underlying models and structure.

    Take for example data.gov.uk:

    Listing all the types of things in the endpoints, I find generic ontologies for data (SCOVO), geo, etc but the ontologies I am most interested in (the ones peculiar to data.gov.uk itself) look like http ontologies, but aren’t:

    http://transport.data.gov.uk/0/ontology/traffic#CountPoint

    What’s a ‘Count Point’? How is this gathered? what does this constitute?

    I tried a ‘DESCRIBE ‘ query on the endpoint, in case it held the descriptive data on this URI only for it to result in an empty result set.

    If you have seen the work that I do, or the things I advocate, you will see that I have already been won over by the arguments for Linked Data as a mechanism of publication.

    However, it’s still very much a Curate’s Egg – the great cases of LoD publishing (BBC, the interconnected sets of medical gene/drug data, etc, etc) are somewhat spoiled by the publishing attempts that don’t provide a good basis or springboard for those that are new to the area.

    • “I donā€™t really see the difference ā€“ I can query a SQL server to find header values and keys.”

      OK, well assuming you (a) have a connection to the SQL endpoint (unusual for the ordinary individual wanting to get public data) and (b) are familiar with the particular brand of SQL syntax they are using, you could get a list of tables and fields. For one site.

      So if (for instance) every local authority was to publish its waste collection timetables via SQL you would need to have a connection to each one and work out the particular fields in each case. Then you would have to work out a separate query for each one to return data in the format you require and store that somewhere. Then you would have to perform a query on the stored data.

      On the other hand, if they all had a SPARQL endpoint you could find out how many bins are collected on a Wednesday with one query.

      “I tried a ā€˜DESCRIBE ā€˜ query on the endpoint, in case it held the descriptive data on this URI only for it to result in an empty result set.”

      The truth is that at the moment it’s all a bit ropey, mainly because nobody looks at it. The momentum is growing, maybe not quickly enough, but remember that ordinary web was the preserve of academia for quite a few years before “ordinary” people got the whole idea of it.

      My own view is that once someone gets a real handle on what an RDF browser could really do and manages to make a good one then we’ll see people not only using such tools but demanding better access to data.

    • “http://transport.data.gov.uk/0/ontology/traffic#CountPoint and others arenā€™t dereferenceable ”

      There could be many reasons for this, but the point is to get these ontologies out there rather than have a link that goes nowhere. A lot of these government sites are being given the advice “go ugly early” which is fine from the point of view of getting the data out there, but it doesn’t make it a perfect example of how linked data makes a difference in the long term.

  14. Yes, I understand your concern, but don’t worry. Making Linked Data from CSV’s is going to be a click button oriented Wizard affair (any second now, I will be unveiling this amongst other things).

    Sharing Queries that start from the insight, expose Query and Data Behind are the keys (what I call Meshups rather than Mashups).

    Links:

    1. http://delicious.com/kidehen/meshup — shared meshup collection on del.icio.us
    2. http://bit.ly/9QKvEm — recent example of Geo Spatial enhanced Linked Data page for iPhone etc..
    3. http://bit.ly/d17xK0 — Getting The Linked Data Value Pyramid Layers Right

    Kingsley

    • “Making Linked Data from CSVā€™s is going to be a click button oriented Wizard affair (any second now, I will be unveiling this amongst other things).”

      Sounds almost too good to be true, and obviously I eagerly await the Wizard’s arrival, but how am I going to get around the fact that I could have 10 CSV files all referencing a John Smith and no way to know if they are the same John Smith or different ones?

  15. @Kingsley – sounds intriguing – look forward to hearing more about it. And thanks also for the links – really helpful

  16. Richard,

    This is what’s going to happen re. CSV to Linked Data Wizard. You import your CSV data into Virtuoso (as you do today re. Access or Excel):

    1. A Wizard then enables you organize the tabular representation.
    2. You import.
    3. You Publish as RDF based Linked Data

    Now, lets say you have 10 CSV files with data related to “John Smith”, you will have an Identifier for each of the CSV entities and you will be able, post RDF View Generation, make assertions like:
    owl:sameAs .

    Once you make the assertions above, Virtuoso’s reasoning will handle the data reconciliation for you via conditional application of rules context for “co-references” .

    Here is an example of the kind of resource description presentation that you will see re. co-referenced data in Virtuoso:

    1. http://bit.ly/bc9P7M — I have many Identifiers šŸ™‚

    Virtuoso allows you to mesh/smush (perform union expansion of co-referenced data), conditionally i.e., you enable a give rules context e.g. owl:sameAs inference .

    Kingsley

Comments are closed.