Are PDF files open data?

Over the weekend James Arthur Cattell (doer of open data and one of the organisers of@UKGovCamp, @ODCamp) asked for thoughts on PDFs and open data.

As James notes in a follow-up blog post, It prompted some strong (and predictable) responses:

It was Matt Shaw’s answer that got my attention:

https://twitter.com/MattShaw85/status/703634892345745410

My response:

Matt made a suggestion

https://twitter.com/MattShaw85/status/703640463375515649

HTML is a good solution to the ‘closed’ nature of PDF. As Janet Davis noted:

But HTML isn’t always straightforward. In many organisations the people who are responsible for the data are not connected to the people who make web pages. To get a new webpage commissioned is often a herculean effort of inter-departmental trial by support ticket.  Sending a pdf file to be linked on an existing ‘open data’ page is easier.  Of course that doesn’t mean poor technical and support infrastructure should be a reason not to open data as Alex Blandford insightfully subtweeted:

And as @AubrieMcg noted.

They are right. But that’s about infrastructure. Until someone adds and File>Save As > JSON on my website option then File > Save As > PDF might do the job. At least it’s there and it’s useable. It’s that idea of usability that most interests me.

I suggested to Alex that a pdf with the boundaries of back gardens the council wanted to compulsorily purchase would be more useful to my mum (true story by the way) than a web page. Why not:

Fair point. But as Tim Blackwell noted

So yes, the web page might be an image as well but the pdf is a record and one entirely appropriate as far as a need might be expressed.  Of course there is a ‘slippy slope’  argument here that somehow data in a PDF is in the control of messenger but I’m not sure how the same data, parsed and rendered via JSON in an ad sponsored mobile app wouldn’t be open to the same accusations.

Uses and gratifications

Giuseppe Sollazzo had an interesting and philosophical reflection on the debate citing Aristotle’s view on potentiality and actuality:

This is exactly the situation of our PDF: data that is locked in a printable format exists potentially but not in actuality. The potential for it to become Open Data is there, but in actual fact the Open Data does not exist yet. This has nothing to do with licensing, but with format: a PDF that is licensed in a compliant way to the Open Definition is Open — but it’s not even Data, never mind whether it’s Open. Certainly, you could extract data from a PDF. (And, anyway, whether the data in itself will be Open might depend on the licensing.)

He makes the broad point (via rugby) that data needs to be used to realise its full potential “use that data wisely; power applications; change the way the operations are managed”

That was the interesting focus of the debate for me. Not if it was open or not but what a user wanted to do with the data.  Yes a PDF might make it really difficult for you to run your open data application for fun a profit but for others, it might be just enough for them to use it.

The idea that open data is only of ‘use’ if its machine readable and unicorn friendly seems limiting. Yes, PDF makes it really hard for tech people to do what they want with data. But the formats that make it really easy for tech people to do what they want with data make it really hard for an average joe to use it. Yes, even html.

As Giuseppe points out above, it’s the license not the format that makes it open. Even Tim Berners Lee thinks that publishing data in PDF format is open data, albeit one star.  That’s a philosophical, maybe even ideological reading of the usefulness that goes beyond the functional needs.

And if open data is going to fulfill some of it’s potential for social and democratic innovation and not just the technological and economic , the idea of usability needs to be more inclusive.

The open image used in this post is from Mikol on Flickr