Pushing MRD out from under the geek rock

The week before last (30th June – 1st July 2009), I was at the JISC Digital Content Conference having been asked to take part in one of their parallel sessions.

I thought I’d use the session to talk about something I’m increasingly interested in – the shifting of the message about machine readable data (think API’s, RSS, OpenSearch, Microformats, LinkedData, etc) from the world of geek to the world of non-geek.

My slides are here:

[slideshare id=1714963&doc=dontthinkwebsitesthinkdatafinal-090713100859-phpapp02]

Here’s where I’m at: I think that MRD (That’s Machine Readable Data – I couldn’t seem to find a better term..) is probably about as important as it gets. It underpins an entire approach to content which is flexible, powerful and open. It embodies notions of freely moving data, it encourages innovation and visualisation. It is also not nearly as hard as it appears – or doesn’t have to be.

In the world of the geek (that’s a world I dip into long enough to see the potential before heading back out here into the sun), the proponents of MRD are many and passionate. Find me a Web2.0 application without an API (or one “on the development road-map”) and I’ll find you a pretty unusual company.

These people don’t need preaching at. They’re there, lined up, building apps for Twitter (to the tune of 10x the traffic which visits twitter.com), developing a huge array of services and visualisations, graphs, maps, inputs and outputs.

The problem isn’t the geeks. The problem is that MRD needs to move beyond the realm of the geek and into the realm of the content owner, the budget holder, the strategist, for these technologies to become truly embedded. We need to have copyright holders and funders lined up at the start of the project, prepared for the fact that our content will be delivered through multiple access routes, across unspecified timespans and to unknown devices. We need our specifications to be focused on re-purposing, not on single-point delivery. We need solution providers delivering software with web API’s built in. We need to be prepared for a world in which no-one visits our websites any more, instead picking, choosing and mixing our content from externally syndicated channels.

In short, we now need the relevant people evangelising about the MRD approach.

Geeks have done this well so far, but now they need help. Try searching on “ROI for API’s” (or any combination thereof) and you’ll find almost nothing – very little evidence outlining how much API’s cost to implement, what cost savings you are likely to see from them; how they reduce content development time; few guidelines on how to deal with syndicated content copyright issues.

Partly, this knowledge gap is because many of the technologies we’re talking about are still quite young. But a lot of the problem is about the communication of technology, the divided worlds that Nick Poole (Collections Trust) speaks about. This was the core of my presentation: ten reasons why MRD is important, from the perspective of a non-geek (links go to relevant slides and examples in the slide deck):

  1. Content is still king
  2. Re-use is not just good, it’s essential
  3. “Wouldn’t it be great if…”: Life is easier when everyone can get at your data
  4. Content development is cheaper
  5. Things get more visual
  6. Take content to users, not users to content (“If you build it, they probably won’t come”)
  7. It doesn’t have to be hard
  8. You can’t hide your content
  9. We really is bigger and better than me
  10. Traffic

All this is is a starter for ten. Bigger, better and more informed people than me probably have another hundred reasons why MRD is a good idea. I think this knowledge may be there – we just need to surface and collect it so that more (of the right) people can benefit from these approaches.

Scraping, scripting, hacking

I just finished my talk at Mashed Library 2009 – an event for librarians wanting to mash and mix their data. My talk was almost definitely a bit overwhelming, judging by the backchannel, so I thought I’d bang out a quick blog post to try and help those I managed to confuse.

My talk was entitled “Scraping, Scripting and Hacking your way to API-less data”, and intended to give a high-level overview of some of the techniques that can be used to “get at data” on the web when the “nice” options of feeds and API’s aren’t available to you.

The context of the talk was this: almost everything we’re talking about with regard to mashups, visualisations and so on relies on data being available to us. In the cutting edge of Web2 apps, everything has got an API, a feed, a developer community. In the world of museums, libraries and government, this just isn’t the case. Data is usually held on-page as html (xhtml if we’re lucky), and programmatic access is nowhere to be found. If we want to use that data, we need to find other ways to get at it.

My slides are here:

[slideshare id=1690990&doc=scrapingscriptinghacking-090707060418-phpapp02]

A few people asked that I provide the URLs I mentioned together with a bit of context. Many of the slides above have links to examples, but here’s a simple list for those who’d prefer that:

Phew. Now I can see why it was slightly overwhelming 🙂

If you love something, set it free

Last week, I had the privilege of being asked to be one of the keynote speakers at a conference in Amsterdam called Kom je ook?. This translates as “Heritage Upgrade” and describes itself as “a symposium for cultural heritage institutions, theatres and museums”.

I was particularly excited about this one: firstly, my partner keynoters were Nina Simon (Museum Two) and Shelley Bernstein (Community Manager at the Brooklyn Museum) – both very well known and very well respected museum and social web people. Second (if I’m allowed to generalise): “I like the Dutch” – I like their attitude to new media, to innovation and to culture in general; and third – it looked like fun.

Nina talked about “The Participatory Museum” – in particular she focussed on an oft-forgotten point: the web isn’t social technology per se; it is just a particularly good tool for making social technology happen. The fact that the online medium allows you to track, access, publish and distribute are good reasons for using the web BUT the fact that this happens to populate one space shouldn’t limit your thinking to that space, and shouldn’t alter the fact that this is always, always about people and the ways in which they come together. The changing focus of museum moving from being a content provider to being a platform provider also rang true with me in so many ways. Nina rounded off with a “ten tips for social technology” (slide 12 and onwards).

Shelley gave another excellent talk on the incredible work she is doing at the Brooklyn Museum. She and I shared a session on Web2 at Museums and the Web 2007, and once again it is the genuine enthusiasm and authenticity which permeates everything she does which really comes across. This isn’t “web2 for web2’s sake” – this is genuine, pithy, risky, real content from enthused audiences who really want to take part in the life of the museum. 

My session was on setting your data and content free:

[slideshare id=768086&doc=mikeellisifyoulovesomethingsetitfreefinal-1227110930707512-9&w=425]

Hopefully the slides speak for themselves, but in a nutshell my argument is that although we’ve focussed heavily on the social aspects of Web2.0 from a user perspective, it is the stuff going on under the hood which really pushes the social web into new and exciting territory. It is the data sharing, the mashing, the API’s and the feeds which are at the heart of this new generation of web tools. We can resist the notion of free data by pretending that people use the web (and our sites) in a linear, controlled way, but the reality is we have fickle and intelligent users who will get to our content any which way. Given this, we can either push back against freer content by pretending we can lock it down, or – as I advocate – do what we can to give user access to it.

Are synapses intelligent?

It’s hard not to be fascinated by the emerging and developing conversations around museums and the Semantic Web. Museums, apart from anything else, have lots of stuff, and a constant problem finding ways of intelligently presenting and cross-linking that stuff. Search is ok if you know what you’re looking for but browse as an alternative is usually a terribly pedestrian experience, failing to match the serendipity and excitement you get in a physical exhibition or gallery.

During the Museums and the Web conference, there was a tangible thread of conversation and thought around the API’d museum, better ways of doing search, and varied opinions about openness and commerce, but always there was the endless tinnitus of the semantic web never far away from people’s consciousnesses.

As well as the ongoing conversation, there were some planned moments as well, among them a workshop run by Eric Miller (ex. W3C sem web guru), Ross Parry‘s presentation and discussion of the “Cultural Semantic Web” AHRC-funded think tank and the coolness of Open Calais being applied to museum collections data by Seb Chan at the Powerhouse (article on ReadWrite Web here – nice one Seb!).

During the week I also spent some time hanging out with George Oates and Aaron Straup Cope from Flickr, and it’s really from their experiences that some thoughts started to emerge which I’ve been massaging to the surface ever since.

Over a bunch of drinks, George told me a couple of fairly mind-blowing statistics about the quantity of data on Flickr: more than 2 billion images which are being uploaded at a rate of more than 3 million a day….

What comes with these uploads is data – huge, vast, obscene quantities of data – tags, users, comments, links. And that vat of information has a value which is hugely amplified because of the sheer volume of stuff.

To take an example: at the individual tag level, the flaws of misspellings and inaccuracies are annoying and troublesome, but at a meta level these inaccuracies are ironed out; flattened by sheer mass: a kind of bell-curve peak of correctness. At the same time, inferences can be drawn from the connections and proximity of tags. If the word “cat” appears consistently – in millions and millions of data items – next to the word “kitten” then the system can start to make some assumptions about the related meaning of those words. Out of the apparent chaos of the folksonomy – the lack of formal vocabulary, the anti-taxonomy – comes a higher-level order. Seb put it the other way round by talking about the “shanty towns” of museum data: “examine order and you see chaos”.

The total “value” of the data, in other words, really is way, way greater than the sum of the parts.

This is massively, almost unconceivably powerful. I talked with Aaron about how this might one day be released as a Flickr API: a way of querying the “clusters” in order to get further meaning from phrases or words submitted. He remained understandably tight-lipped about the future of Flickr, but conceptually this is an important idea, and leads the thinking in some interesting directions.

On the web, the idea of the wisdom of crowds or massively distributed systems are hardly new. We really is better than me.

I got thinking about how this can all be applied to the Semantic Web. It increasingly strikes me that the distributed nature of the machine processable, API-accessible web carries many similar hallmarks. Each of those distributed systems – the Yahoo! Content Analysis API, the Google postcode lookup, Open Calais – are essentially dumb systems. But hook them together; start to patch the entire thing into a distributed framework, and things take on an entirely different complexion.

I’ve supped many beers with many people over “The Semantic Web”. Some have been hardcore RDF types – with whom I usually lose track at about paragraph three of our conversation, but stumble blindly on in true “just be confident, hopefully no-one will notice you don’t know what you’re talking about” style. Others have been more “like me” – in favour of the lightweight, top-down, “easy” approach. Many people I’ve talked to have simply not been given (or able to give) any good examples of what or why – and the enduring (by now slightly stinky, embarassing and altogether fishy) albatross around the neck of anything SW is that no-one seems to be doing it in ways that anyone ~even vaguely normal~ can understand.

Here’s what I’m starting to gnaw at: maybe it’s here. Maybe if it quacks like a duck, walks like a duck (as per the recent Becta report by Emma Tonkin at UKOLN) then it really is a duck. Maybe the machine-processable web that we see in mashups, API’s, RSS, microformats – the so-called “lightweight” stuff that I’m forever writing about – maybe that’s all we need. Like the widely accepted notion of scale and we-ness in the social and tagged web, perhaps these dumb synapses when put together are enough to give us the collective intelligence – the Semantic Web – that we have talked and written about for so long.

Here’s a wonderful quote from Emma’s paper to finish:

“By ‘semantic’, Berners-Lee means nothing more than ‘machine processable’. The choice of nomenclature is a primary cause of confusion on both sides of the debate. It is unfortunate that the effort was not named ‘the machineprocessable web’ instead.”

RSS search results

A quickie (as I’ve only got a week to go until Museums and the Web and I have workshop on blogging, a workshop on mashups, a professional forum on “openness” and a “blogathon” to prepare…)


I’ve been playing about with Yahoo! Pipes a fair bit this week and preparing some stuff for the mashup workshop. In doing so, an idea I’ve floated before (see slide 33 “provide alternative routes” on the presentation below) came up to the surface again with a very, very simple suggestion:

All museums (everyone, actually) should provide their search results as RSS

Now I can hear some people at the back shuffling around uncomfortably and muttering things like “shoehorning technology”, “RDF”, “Z39-50″…I have my anti-makelifemoredifficult earplugs in, though, and I can’t really think of any practical reasons why this isn’t a hugely good idea.

Searching the web for things like “museum search RSS” doesn’t get me anything useful. From a previous bookmark I had a link to the AADL catalogue – they do it and here’s a search for “Montreal” delivered via RSS.

I’m immediately able to mash this up using Pipes, hack about with the URL, style it using my own XSLT.

Assuming this isn’t an original idea (and Owen Stephens seemed to think it was a good one and had been done within some library systems) – why aren’t more institutions doing it?

Comments please!

[slideshare id=176658&doc=web2-and-distributed-services-mike-ellis-v2-1195806210354162-3&w=425]

Introducing OneTag

You might have noticed I’ve been a bit quiet on the blog front for the last couple of weeks. This is because I’m having a drive to send some ideas partying and have therefore been knee-deep coding my latest project most evenings.

OneTag logoI’ve put together an idea for people who run conferences or events. It’s called OneTag (www.onetag.org). It’s very simple conceptually, although as I’m discovering, a complete *dog* to code… – the idea is that it aggregates all the “buzz” about a particular (live) event and then provides the means to view this in different ways. Find out more at http://www.onetag.org/ot/about.asp.

Usual “it’s a beta” disclaimers apply…

I’ve agreed with David Bearman and Jennifer Trant that I’ll be trialling the system during the Museums and the Web 2008 conference in Montreal.

I need your help…

First off, if you’re going to the conference and intend to blog, twitter or upload any photos then the global tag follows the same pattern as previous years and is therefore mw2008. If you’re blogging then just add this as a tag or category; if you’re twittering then please use the hashtag #mw2008 as part of your tweet.

Second, if you’re the owner of a blog or other social networking site, will be blogging about the conference and have feed addresses you can supply me with, then let me know in the comments or via email and I’ll add these to the OneTag aggregator.

Finally, if you’d like to get access to the mw2008 OneTag feeds and views to help me test them then do feel free to get in touch – again, via email if you know it or using the comments to this post. Alternatively, tweet me direct at http://twitter.com/dmje.

I’m at the stage where as many critical eyes as possible is going to help muchly..

Thanks in advance!

2008 (a little late…)

If you write a blog, I’m discovering that you pretty much have to do a January post with either a review of the previous year or a punt at what the future holds. I’ll leave the review bit to others, but here’s my personal mind-dump for the big things of 2008…

Facebook, Schmasebook

I reckon Facebook as it is now is going to fall way off the public radar during 2008. The disclaimer “as it is now” is my get-out clause – if Facebook find a way of changing the rules around applications, finally expose their Social Graph data to the world or make some serious amends to browse and search then they may have a hope. But right now, here’s a typical Facebook experience:

“Hey, I’ve been invited to this Facebook thing. Now let me see. Wow, Bob is there. And Jon. And Jane. I’ll invite them all to be my friends. Cool. Look, Bob got married. He looks old! I can’t believe he has kids. Look – Jane went to the shops. Now…um..Right, might post some pictures. Nice. Someone added a comment. Cool. Now I’m going to look for my mate Jon…Damn, there are 4 million Jon’s. Never mind…Er…Who the hell is this inviting me to be a friend? I never heard of her…Wait, WTF is a FunWall? Why have I got 35 invites? I don’t need this noise. Life is too busy. For now I might just add Facebook emails to my spam filter…Oh crap, all my friends have joined AnotherDamnSocialNetwork.com. What, you mean I have to re-input all my data? Sod that, I’m off”

Facebook has also reminded us of something else: we lose touch with some people for a very good reason. 🙂

Some say Twitter is the new Facebook. And although I don’t actually do it much (yeah, ‘course I got an account…) I’d say that the single best thing about Facebook is the updates feature, which is basically…Twitter.

It’ll be interesting to see whether the geek adoption of Twitter goes mainstream in ’08.

Signal / Noise

Dunbar wrote about the 150 being the maximum number of individuals that any one person can keep in touch with at any one time. I don’t know of any measures for information input but as RSS continues to spread, so I reckon we’ll hear more about what we should do about the sheer quantity of incoming material. Certainly I’m seeing a lot of buzz (not all just about it being January and everyone having a good spring clean..) and the beginnings of some products (AideRSS, SocialStream, etc etc) which help us cut down on the inputs. We all took to RSS because it lets us do more with less: I’ve just cut down my feed list massively and I’m still trawling through 400+ articles a day. Something is gonna break :-). While you’re at it, check out this great post which Mr Pope sent over to me. Similar kind of sentiment about keeping up with tech – or not, more to the point…

More “Everyware”

The iPhone will obviously be seen as the beginning of the mobile web, although of course the reality is that many of us have been “browing” online since that wap thing back in the 90’s. User adoption as always drives public perception which drives investment which drives adoption…

The iPhone does two major things in one blast:

1. The “usually crap” experience you have during mobile web surfage is buried under full-page zoomable browsing, easy(ish) typing and widgets, all of which manage to surface the web but without the chuff you usually get around mobile browsing. In short, the iPhone is primarily a sexy and really usable interface. For many people – especially the queues of teens you see hanging around O2 shops where they have the iPhone/iPodTouch available to fiddle with – this is enough.

2. Apple have done a very cunning deal in the UK with The Cloud whereby all iPhone users get access to any hotspot as part of their contract. In one swoop, you’ve got on-the go internet access at a huge range of hotspots with no hassle..

Two continuing and evolving approaches: RSS and OpenID

We tech types might all be 100% familiar with the format and ease of use of RSS. Continuing support from browsers (IE friendly feed view, etc) means that it’s going to keep hitting the mainstream over the coming months. I’m also willing to bet, however, that new tech will appear around OPML and feed analysis. I’m also pretty sure that we will see more of RSS as a portable data format (for instance, getting search results like these from Technorati) – not “just some news headlines” but a further extension of RSS, and a great example of how to extend a simple format to do interesting things.

OpenID is another approach which has been around a looong time but is finally seeing some serious heavy hitters in the form of Google, IBM and Flickr joining the party. I’m also noticing many more sites supporting OpenID – maybe some museums during 2008…?

And finally to the biggest keyword of them all for 2008: APML..

Attention, please

The notion of “Attention Data” is already chugging around tech circles pretty hard, and has been for a couple of years. The underlying question about the openness of the Social Graph is integral to this, and has to a certain extent driven AD back into the limelight. Fundamentally, we’re all giving any site we visit something incredibly valuable: our attention. Anything you do online carries with it an implication about what you like and who you are – this is understandably very powerful information, both for developers and advertisers. Attention Profile Markup Language (APML) is an XML based standard for attempting to capture this activity, and I reckon it’s going to be big in the next months and years – not just capturing it but also doing useful and interesting stuff with it as well…

And that, as they say, is it for now…Let’s see where it all goes…