I first wrote about hoard.it, the bootstrapped “API spider” that Dan Zambonini and I built, back in 2008. We followed up the technology with a paper for Museums and the Web 2009, and in that paper talked about some possible future directions for the service. You’ll see if you scroll down the paper that there is a section entitled “next steps” which outlines some of these.
I’ve always been very excited by the possibilities that the hoard.it model might uncover. Not, as you’ll see from the paper, because this is a new or particularly original approach – screenscraping is as old as the hills – but because it offers up some rapid and very easy solutions to some bigger problems. These problems are even older than screenscraping: legacy content, no resource, little expertise, and they’re firmly entrenched in pretty much every web presence there is.
Much as the world seems to be heading along a Linked Data route, it comes back to me time and time again that however beautiful these RDF or RDFa or even API approaches are in theory, they mean little without uptake and critical mass. Simplicity of approach defines RSS as a success, for example, and even though it isn’t perfect by a long shot, it gets things done rapidly and easily and as a consequence has been a cornerstone of the Machine Readable Data web that we see today.
This post isn’t, however, about Linked Data – I’ll leave that for another day (let’s leave it as: I’m ready to be convinced, but I’m not there yet). However, one of the major concerns I have about the LD approach is that however you look at it it requires a fair amount of effort to make the changes required. One day, I and others may be convinced that this effort is worth it, but for now this still leaves the huge problem that we outlined in our paper: legacy systems, legacy sites and legacy content – all of it put together at huge expense and effort, and none of it available in this new MRD world.
I got thinking about the hoard.it approach over the weekend and started to focus in on the ideas articulated in the “template mapping” part of the paper. Lurking at the back of my brain has been the question about who defines the mapping, and whether there might be any way to make this more flexible: make it something an organisation, individual or even a crowd-sourced democracy might be able to define.
The idea I’d like to float is this:
1. A page, section or even entire website has a corresponding file which lays out a simple schema. This schema maps things in the html DOM to data fields using jquery-like (simple, universal, understandable!) DOM pointers. Here’s what a single field definition might look like:
DC.Title > $("div#container ul.top-nav li:eq(0)").text();
This example means, hopefully obviously: “the content at the DOM location div#container ul.top-nav li:eq(0) is the DC.Title field”
2. If the institution creates this file then they can use the “rel” tag somewhere in the head to point to this file and the mapping type – so for instance you might have <link rel=”simple.dublincore.transformation” href=”somelocalfile.txt” /> or even <link rel=”hcard.microformat.transformation” />. This means that any MRD parser (or search engine!) which comes to the page could quickly parse the content, apply the transformation and then return the data in a pre-defined format: XML, RSS, vCard or maybe even RDF (I don’t know – could you? Help me out here!).
3. If the institution doesn’t create this file, then anyone could write a transformation using this simple approach, publish their transformation file on the web and then use it to get data from the page using a simple parser. I’m writing an example parser as we speak – right now all it accepts is the source url (the data) and the definition url (the transformation file) but this could be much more effective with some further thought…
And that, simply, is it. The advantages are:
1. Machine Readable Data (albeit simple MRD) out of the box: no changes (apart from a rel tag if you want it) to the page. No arcane languages, no API development. No time lag (I’d see most applications reading this data in real-time – just an http request to the page which is then parsed)
2. Any changes to the structure of the page (the long-recognised problem with screen-scraping) can be quickly changed in a single defining file for the whole part of the site which has that “shape”. If this file is democratised in some way then the community could spot errors because of changed structure in the page and amend the file accordingly
3. Multiple files can be defined for a single page: ditto, sub-pages or sections can have more specific “cascade-like” files which are specific to that particular content shape
I know of two approaches which are similar but different. Firstly, the YQL-like approach to screen scraping which is very, very elegant but also a) specific to Yahoo! (a company not in best financial and future-proof shape) and b) as far as I know, can’t be specified for collections of pages but rather on a page-by-page basis (let me know if this isn’t the case..). The second approach which is also similar but in different ways is GRDDL. This is more like the open approach I suggest here, but has the problem of a) being based around XSL which therefore means the source document has to be valid XML and b) requires on-page edits, too
These are the only two conceptual approaches that I know of which are similar, but it could well be that there are well-defined lightweight ways of doing this which others have written about.
I’d very much like your feedback on this roughly-shaped idea. I’ll get the parser up and running shortly and it’ll then become clear whether it might be an interesting approach – but if you know of similar ways of doing this stuff, I’d love to hear them.
if i understand it,then i like the idea…. it sounds like you’re trying to fill a similar need as rdfa and microformats-enabling people to pull data from web pages with minimal overheads required in exposing the data…however it occurs to me that your method, however, would require that the document structure would need to quite closely mirror the structure of the data you want to expose.Whilst this might be the case for certain elements, like the doc title example (which, incidentally, is generally going to contain unstructured data), i doubt it wil be the case for all interesting data on a page.but if it’s about applying an external desription of data on a page, is that what xslt does?
@Mark – thanks for commenting, really useful.
You’re absolutely right – the fact that data needs to have a certain shape (and also the fact that the data needs to be exposed on-page at all!) is to a certain extent a downside of the proposed approach. Also it is probably the case that in reality a fair amount of munging will be required of extracted data. I guess a real-world test is the only way of finding out if this is enough of a barrier to make the idea a no-go or not…
Or how about just sticking with CSS selectors and CSS syntax for the metadata file? MSS?
div#container ul.top-nav { metadata: DC.Title }
Of course, HTML already has a element that you’d really want to map to DC.Title, and a page that doesn’t provide that probably has additional problems.
One nice feature of your idea is that it’s not incompatible at all with RDFa (my preference) or with HTML5 microdata. DC metadata could be mapped to RDFa attributes as well as to HTML element text.
@Sean – thanks for commenting
Yeah, I originally thought about using CSS as the selector, but did some reading and thought that the jquery approach allows a cleverer means of DOM selection. So, for example, you can exclude certain elements based on filters etc. I’m not a massively good CSS’er (like everything else, I hack it to make it work) but my impression was that DOM selection via CSS is a bit more limited?
re. RDFa – yes and microdata – yes, absolutely. If you have a look at my comment on this post http://doofercall.blogspot.com/2008/05/screen-scraping-and-posh.html you’ll see that I suggest a hierarchical approach to grabbing data from the page, with “most accurate” at the top (including API, microformats, RDFa) and “least accurate” (the DOM approach) at the bottom. I think remaining compatible with these emerging approaches as well as with what has gone in the past is pretty key
When RDF first came along, ten years ago, the proposal was that you could publish seperate, machine-readable metadata for your web page and link to it from the head, using a link tag. Remember how we all rushed out and created RDF metadata files for our pages? Me neither. That precedent makes me very nervous about putting your machine-readable data in a seperate file — in all likelihood, it will be forgotten about and won’t get updated when the site changes. Hence the aversion to hidden metadata in the microformats community.
One point about embedding the metadata directly in the HTML — if you have access to add a link tag, surely that means you can add class=”dc_title” (microformats-style) or property=”DC.Title” (RDFa-style) to the appropriate HTML tag? This approach being much more robust against changes in the surrounding HTML and easier to maintain going forward.
I do agree that with the majority of museums seeing digitisation as publishing their records in HTML, there needs to be a mechanism of some sort to embed catalogue data in HTML. And it needs to be a mechanism with low enough barrier to entry that it doesn’t require huge technical skills to set up.
Jim – cheers for the comment
Forgive me if I’ve got this wrong but I think you’ve misunderstood what I’m suggesting. The idea isn’t to hold data outside the page, but to hold a file which holds the *data shape* which refers to the existing data/html *on* the page. Sorry if I didn’t make this clear!
The content in the file *could* be included somehow in the page, but the reality is that more than one page is likely to be represented in the same template shape – so it is better, like CSS, to hold this externally.
Make more sense?