Radically Open Cultural Heritage Data at SXSW Interactive 2012

April 11, 2012

Image

Posted by Adrian Stevenson

I had the privilege of attending the annual South by South-west Interactive, Film and Music conference (SXSW) a few weeks ago in Austin, Texas.    I was there as part of the ‘Radically Open Cultural Heritage Data on the Web’ Interactive panel session, along with Jon Voss from Historypin, Julie Allinson from the University of York digital library, and Rachel Frick from the Council on Library and Information Resources (CLIR). We were delighted to see that Mashable.com voted it as one of ’22 SXSW Panels You Can’t Up This Year’.

All of our panelists covered themes and issues addressed by the Discovery initiative, including the importance of open licenses, and the need for machine readable data via APIs to facilitate the easy transfer, aggregation and link-up of library, archives and museum content.

Jon gave some background on the ‘Linked Open Data in Libraries, Archives and Museums’ (LOD-LAM) efforts around the world, talking about how the first International LODLAM Summit held in San Francisco last year helped galvanise the LODLAM community. Jon also covered some recent work Historypin are doing to allow users to dig into archival records.

Julie then covered some of the technical aspects of publishing Linked Data through the lens of the OpenArt Discovery project, which recently released the ‘London Art World 1660-1735’ data. She mentioned some of the benefits of the Linked Data approach, and explained how they’ve been linking to VIAF for names and Geonames for location.

I gave a quick overview of the LOCAH and Linking Lives projects, before giving a heads up to the World War One Discovery project. LOCAH has been making archival records from the Archives Hub national service available as Linked Data, and Linking Lives is a continuation project that’s using Linked Data from a variety of sources to create an interface based around the names of people in the Archives Hub. After attempting to crystallise what I see are the key benefits of Linked Data, I finished up by focusing on particular challenges we’ve met on our projects.

Rachel considered how open data might affect policies, procedures and the organisational structure of the library world.  She talked about the Digital Public Library of America, a growing initiative started in Oct 2010. The DPLA vision is to have an “open distributed network of comprehensive online resources that draw on the nations living history from libraries, universities, archives and museums to educate, inform, and empower everyone in current and future generations”. After outlining how the DPLA is aiming to achieve this vision, she explained how interested parties can get involved.

There’s an audio recording of the panel on our session page, as well as recordings of all sessions mentioned below on their respective SXSW pages. I’ve also included the slides for our session at the bottom of this post.

Not surprisingly, there were plenty of other great sessions at SXSW. I’ve picked a few highlights that I thought would be of interest to readers of this blog.

Probably of most relevance to Discovery was the lightening fast ‘Open APIs: What’s Hot and What’s Not’ session from John Musser, founder of Programmableweb.com, who gave us what he sees as the eight hottest API trends. He mentioned that the REST style of software architecture is rapidly growing in popularity, being regarded as easier to use than other API technologies such as SOAP (see image below). JSON is very popular with 60% of APIs now supporting it. It was also noted that one in five APIs don’t support XML.

Hot API Protocols and Styles from John Musser of Programmableweb.com

The rise of REST – ‘Hot API Protocols and Styles’ from John Musser of Programmableweb.com at SXSW 2012

Musser suggested that APIs need to be supported, with Hackathons and funded prizes being a good way to get people interested. He noted that the hottest trend right now is that VCs are providing significant funding to incentivise people to use their APIs, Twilio being one of the first to do this. He also mentioned that your API documentation needs to be live if you’re to get interest and maintain use. Invisible mashups are also hot, with operating systems such as Apple’s OS cited as being examples of such. Musser suggests the overall meta-trend is that APIs are now ubiquitous. John’s now made his slides available on slideshare.

The many users of laptops amongst us will have been interested to hear about the ‘Future of Wireless Power’.  The session didn’t go into great detail, but the message was very much “it’s not a new technology, and it’ll be here very soon”. Expect wireless power functionality in mobile devices in the next few years, using the Qi standard.

Some very interesting folks from MIT gave the thought provoking ‘MIT Media Lab: Making Connections’ session. Joi Ito, Director of MIT Media Labs explained how it’s all about the importance of connecting people, stating that “we’re now beyond the cognitive limits of individuals, and are in an era where we rely on networks to make progress”. He suggested that traditional roadmaps are outmoded, and that we should throw them away and embrace serendipity if we’re to make real progress in technology. Ito mentioned that MIT has put significant funding into undirected research and an ‘anti-disciplinary’ approach. He said that we now have much agility in hardware as well as software, and that the agile software mentality is being applied to hardware development. He pointed to a number of projects that are embracing these ideas – idcubed, affectiva, sourcemap and formlabs.

Josh Greenberg talked about ‘macroscopy’ in the ‘Data Visualization and the Future of Research’ session, which is essentially about how research is starting to be done at large scale. Josh suggested that ‘big data’ and computation are now very important for doing science, with macroscopy being the implementation of big data to research. He referred to the ‘Fourth Paradigm’ book which presents the idea that research is now about data intensive discovery. Lee Dirks from Microsoft gave us a look at some new open source tools they’ve been developing for data visualisation, including Layerscape, which allows users to explore and discover data, and Chronozoom, which looked useful for navigating through historical big data.  Lee mentioned Chronozoom was good for rich data sources such as archive & museum data, demoing it using resources relating to the Industrial Revolution.

So that was about it for the sessions I was able to get to as part of the SXSW Interactive conference. It was a really amazing event, and I’d highly recommend it to anyone as a great way to meet some of the top people in the technology sector, and of course, hear some great sessions.

The slides from our session:



The Digital Public Library of America. Highlights from Robert Darnton’s recent talk

January 24, 2012

I was fortunate to be among those attending Robert Darnton’s talk on the Digital Public Library of America initiative last week. Harvard Professor and Director of Harvard Library, Darnton is a pivotal figure behind DPLA and his talk – most concurred – was both provocative and inspirational. More than a description of the DPLA initiative, Darnton framed his talk with key issues and questions for us to reflect upon. How can we provide a context where more knowledge is as much as possible freely available to all? Where we can leverage the internet to change the emerging patterns of locked down and monopolised chains of supply and demand?  And as Professor David Baker highlighted in his introduction of Darnton, there is much alignment here with the broader and more aspirational ethos of Discovery: a striving to support new marketplaces, new patterns of demand, new business models – all in the ideal pursuit of the Public Good. Arguably naïve aspirations, but certainly the tenor in the room was one of consensus, a collective pleasure at being both challenged and inspired. Like Discovery, the DPLA is a vision, a movement, tackling these grand challenges, but also striving to make practical inroads along the way.

The remainder of this post attempts to capture Darnton’s key points, and also highlight some of the interesting themes emerging in the Q&A session that followed.

————-

 “He who receives ideas from me, receives instruction himself without lessening mine; as he who lights his taper at mine receives light without darkening me” Thomas Jefferson

 

To frame his talk, Darnton invoked this oft-cited tenet of Thomas Jefferson – that the spread of knowledge benefits all. He aptly applied this concept to the concept of the internet and specifically the principles of Open Access for the Public Good, and the assumption that one citizen’s benefit does not diminish another. But of course, he cautioned, this does not mean information is free and we face a challenging time where, even as more knowledge is being produced, an increasingly smaller proportion of it is being made available to the public openly. To illustrate this, he pointed to how academic journals have increased in costs at four times the cost of inflation, and we are anticipating that these rates will continue to rise, even as Universities and libraries face increasing cutbacks. We need to ask, how can that increase in price be sustained? Health care may be a Public Good, but information about health is monopolised by those who will push it as far as the market will bear.

Darnton acknowledged that publishers will reply by deprecating the naiveté of the Jeffersonian framing of the issue. And, he conceded, journal suppliers clearly add value; it’s fair they should benefit – but how much? Publishers often invoke the concept of ‘marketplace of ideas’ But in a free marketplace, the best will survive. For Darnton, we are not currently operating in a free marketplace, as demand is simply not flexible  – publishers create niche journals, territorialise, and then crush the competition.

The questions remain, then, how can we provide a context where more knowledge is as much as possible freely available to all? Where we can leverage the internet to change these locked down and monopolised chains of supply and demand?  The remainder of Darnton’s talk outlined the approaches being taken by the DPLA initiative. It’s early days, he acknowledged, but significant inroads are already being made.

So what is DPLA? A brief overview

Darnton addressed (in relative brief) the scope and content of DPLA, the costs, the legal issues being tackled, technical approaches, and governance.

Scope and content: Like Discovery, the DPLA is not to be One Big Database – instead, the approach is to establish a distributed system aggregating collections from many institutions. Their vision is to provide one click access to many different resource types, with the initial focus on producing a resource that gives full text access to books in public domain, e.g. from  Hathi Trust, the Internet Archive, and U.S and international research libraries. Also carefully highlighted that the DPLA vision is being openly and deliberately defined in a manner that makes the service distinct from those services offered by public libraries, for instance excluding everything from the last 5-10 years (with a moving wall annually as more content come available as Public Domain).

The key tactic to maximise impact and reduce costs will be to aggregate collections that already exist, and so when it opens, it will likely only contain a stock of public domain items, and will grow as fast as funding commits. To achieve this, it will be designed in a way that as much as possible makes it interoperable with other Digital Libraries (for example, an agreement has already been made with Europeana). So far funding has been dedicated to building this technical architecture, but there is also a strong concentration on ongoing digitisation and collaboratively funding such initiatives.

In terms of legal issues Darnton anticipates that DPLA will be butting heads against copyright legislation – he clearly has strong personal views in this area (e.g. referring to the Google Books project as a ‘great idea gone wrong’ with Google’s failure to pursue making the content available under Fair Use)  but he was careful to distinguish these views from any DPLA policy in this regard.  But as DPLA will be not-for-profit, he suggested that they might stand a good chance to invoke the Fair Use defence in the case of orphan works, for example. But he also acknowledged this is difficult and new territory. Other open models referenced included the case of a Scandinavian style licence for public digital access to all books. He also stated that he sees the potential for private partnerships in developing value-added monetised services such as apps – while keeping the basic open access principles of the DPLA untouched.

The technical work of DPLA is still very much in progress, with a launch date of April 2013 for a technical prototype along with 6 winning ideas from a beta sprint competition. More information will be released soon.

In terms of governance, a committee has been convened and has only just started to study options for running DPLA.

Some questions from UK stakeholders

The Q&A session was kicked off by Martin Hall, VC of Salford University, who commented that in many ways there is much to be hopeful for in the UK in terms of the Open agenda. Open Access is going strongly in the UK with 120 open access repositories; and, he stated, a government that seems to ‘get it’ largely because of a fascination with forces in the open market. As a result there is a clause in new policy about making available ‘openly’ public datasets.  This is quite an extraordinary statement, Hall commented, given the implications for public health, etc. and this is possibly indicating a step change. But it all perhaps contributes to the quiet revolution occurring around Open Access.

Darnton responded by highlighting that in the USA they may have open access repositories, but that there is a low compliance rate in terms of depositing (and of course this is an issue in the UK too). But Harvard has recently mandated the deposit; and while there was less than 4% before, there is now over 50% compliance, and the repository “is bulging with new knowledge.”

In addition, Darnton reminded the group, while the government might be behind ‘Open,’ we still face opposition from the private sector. A lot of vested interests feel threatened by open access; and there is always a danger of vested interest groups capturing attention of the government.  But, he said, it’s good to see hard business reasons are being argued as well as cultural ones, but we need to be very careful.

Building on this issue, Eric Thomas, Vice Chancellor of Bristol University raised the issue of demonstrating the public value – how do we achieve this? He noted that the focus of Darnton’s talk was on supply side, but what about demand? To what extend are DPLA looking at ways to demonstrate public value, i.e. ‘this is what is happening now that couldn’t happen before…’?

In his response, Darnton referred to a number of grassroots approaches that are addressing this ‘demand’ side of the equation, including a roving Winnebago ‘roadshow’ to get communities participating in curating and digitising local artefacts. In short, DPLA is not about a website, but an organic, living entity… This approach, he later commented was about encouraging participation from the top down and bottom up.

Alistair Dunning from JISC posed the question of what will ‘stop people from going to Google?; Darnton was keen to point out that while he critiqued Google’s approach to the million books copyright situation, DPLA was in no way about ‘competing’ with Google.  People must and will use Google, and DPLA will open their metadata and indexes to ensure they are discoverable by search engines. DPLA would highly value a collaborative partnership with Google.

Peter Burnhill from EDINA raised the critical question of licensing. Making data ‘open’ through APIs can allow people to do ‘unimaginable things’; what will the licensing provision for DPLA be? CC-0?  Darnton acknowledged that this was still a matter of debate in terms of policy decisions – and especially around content. He agreed that there were unthought of possibilities in terms of Apps using DPLA, and they want to add value by taking this approach (and presumably consider sustainability options moving forward).  In short, the content would be open access, and metadata likely openly licensed, but in terms of reuse of the content itself, this *could* be commercialised in order to sustain the DPLA.

In a later comment, Caroline Brazier from the British Library expressed admiration for the vision and the energy and the drive. She explained that from the BL perspective ‘we’re there for anybody who wants to do research’; She highlighted how the British Library and the community more broadly has a huge amount to do to push on with advocacy, particularly around copyrighting issues.  This, forces all institutions of all sizes to rethink their roles in this environment – there are no barriers here, she suggested: we can do things differently. We need to think individually about what we do uniquely. What do we do? What do we invest in? What do we stop doing? Funding will be precious, and we really need to maximise the possibility to get funding.

Darnton agreed, and stated that there is a role for any library that has something unique to make it available (and of course, the British Library is the pinnacle of this). The U.S. has many independent research libraries (the Huntington, Newberry, etc) and they very much want to make room for them in the DPLA; they want to reach out to these research libraries who may be open minded but are behind closed doors in terms of broader public.

The final (and perhaps one of the most thought-provoking questions) came from Graham Taylor from the Publishers Association. He stated that he concurred with much of what Darnton had to say (perhaps surprising, he suggested, given his role) but he did comment that throughout the afternoon he had “not heard anything good about publishers.” So, he asked, where do publishers fit? In many regards, publishers are the risk-takers, the ones who work to protect intellectual property, and get all works out there – including those that pose ‘risk’ because they are not guaranteed blockbusters.

Darnton strongly agreed that publishers do add value, but, he explained, what he’s attacking is excessive, monopolistic commercial practices to such an extent that they are damaging the world of knowledge.  He was struck by Taylor’s comment on risk-taking, though, for indeed publishing is a very risky business. But sometimes the way risk is dealt with is unfortunate, with that emphasis on the blockbuster as opposed to a quality, sound backlist. So what can be done about this risktaking and sharing the burden? Later this year, he said, Harvard would be hosting a conference that explores business opportunities in publishing in open access. If publishers are gatekeepers of quality, how can open access can be used to the benefit of publishing, and so alleviate that risk-taking and raise quality?


Looking out towards an ever widening horizon

September 22, 2011

Being asked to judge a competition is, to be frank, a terrifying experience and one for which I never quite feel up to the task. As the deadline for submitting my marks approaches my anxiety levels climb until I end up wondering whether asking my next-door neighbour to do it for me is a feasible and/or ethical option. And then, finally, I sit down with the judging criteria and the scoring sheet in front of me and everything starts to flow in an orderly and quietly magical way. I look at each entry in turn and give them marks and make supporting comments based on the judging criteria and my personal perspective. For a couple of hours I give my full attention to the competition entries in front of me and don’t let any thoughts about the rest of the world (particularly the rest of the judging panel or the competition entrants themselves) enter my head. Once I’ve scored every entry, I spend a very short amount of time checking that the scores I’ve given still make sense when I take a step back and look at them in relation to each other. Then I submit my scoring sheet and celebrate with a cup of tea. And then I start worrying again but only until I see the announcement of the winning entries and can feel reassured that my personal perspective was not so very far off the perspectives of the other judges.

Anyhow, this blogpost wasn’t meant to be a blow by blow account of the trials and tribulations of being a competition judge – what I really want to do is share some of my thoughts about the competition entries. What struck me straight off was the wide range of use cases, the deep originality demonstrated by the 11 entries and the potential many of them had for being turned into real-world applications with relatively little redevelopment work. In face I gave six of the entries top marks for the ‘What potential does it have’ score because using them instantly sparked my imagination and I could see how the applications either opened up possibilities for other applications or had the potential to bring a new audience to a particular dataset. For instance, one of the winning entrants ‘What’s About’ uses your current location to reveal nearby English Heritage ‘nationally important places’ by visualising them on a Google Map. Straight away I could see that this could be useful for helping individual users discover places of interest on their doorstep that they might not be aware of. With very little further development I could imagine this being useful as ‘virtual fieldtrip’ tool in geography or history classes. Or as a pre- and/or post-school trip tool that allows students to explore the actual site of the visit (via Google Streetview) and read related books via the link to the British Library dataset. Or as a leisure and tourism search tool that could draw tourists to lesser known sites of interest. The list goes on and on.

One of my areas of expertise is usability so I was particularly interested in how easy each entry was to use. In some ways this was not a simple thing to judge because the entries ranged from those aimed squarely at folks with technical expertise, such as Mark van Harmelen’s ‘Command Line Ruby Database-free Processor’ through to applications aimed at end users with no technical or specialist knowledge, such as Alex Parker’s Timeline application, and others which had a foot in both camps, such as Thomas Meehan’s Lodopac. Also, the nature of the entries themselves varied greatly from fully functioning user interfaces through to more conceptual demonstrator style entries. The only way for me to score the entries fairly was to take each application on its own merit. Generally speaking though I gave entries a higher score if they were easy to use on first view and gave lower marks if the applications couldn’t be used without referring to the supporting notes or if it was aimed at a non-technical end user but, in practice, that user would need some technical expertise in order to get to grips with it. I wasn’t worried or adversely influenced by the odd technical glitch as long as it wasn’t obviously linked to a deeper user experience problem.

As a member of the JISC Activity Data programme synthesis team it was gratifying to see the OpenURL router dataset being utilised by four of the entries (one of which was from the team at EDINA who actually originated the data as part of their JISC AD project).

-          Command Line Ruby Database-free Processor

-          OpenURL Router Data Prototype Recommender [EDINA entry]

-           OpenURL.ac.uk Stat Explorer

-          Using Gource to Visualise OpenURL Router Data

Two of the OpenURL entries were also notable for their potential as serendipity/distraction engines. Namely the applications submitted by EDINA and Chris Keene which very quickly lead me to intriguing looking articles within one or two clicks of their respective interfaces. [Can I interest anyone else in Vogel et al’s Acoustic analysis of the effects of sustained wakefulness on speech or maybe in Benatar’s Why it is Better Never to Come Into Existence?]

Looking back at the accompanying notes I submitted here’s a selection of my verbatim comments which highlight a few points of personal interest:

  • Composed impressed me with how tightly Owen Stephens had integrated the display of the MusicNet record within the relevant Copac record. My comment: “[...] the way the bookmarklet returns the results into the blank space on the right of the Copac record seems nothing short of magical. It would be good to explore how it could be expanded to work for non-music records.”
  • Timeline had me doing virtual cartwheels in the comment box. A subset of my comments: “A small amount of additional development would make it endlessly browseable, something I could get lost in for hours. [..]  A really elegant interface which could be used as another route to discovering items of interest in any visual collection [...].” I could also see potential for Alex’s interface to be combined with Yogesh Patel’s Discovery Map to add a geographical element to the interface.
  • Using Gource to Visualise OpenURL Router Data stood out for me because Tony Hirst’s use of visualisations “[...] elevate[s] data to something that is potentially engaging for an audience.” Not only are tools like Gource helping us to, literally, see data in a new way but visualisations present data in a way that, I would argue, increases its shareability to a wider audience [see websites like http://infosthetics.com/ and http://dailyinfographic.com/ for example].

Reflecting on the entries as a whole I was impressed by the quality of lots of the entries’ supporting notes (which made my job as a judge much easier) and greatly inspired by the possibilities that these competition entries open up in their wake. It’s worth mentioning that all the entries are open source (as a central condition of the competition) so I’ll be watching with interest to see what new applications and use cases emerge in response to the competition entries in the forthcoming months and beyond.


Developers entries help us explore new possibilities in discovery

September 15, 2011

It really was a tough call to pinpoint a clear winner for the #discodev competition. After we gave people a bit more time, using some of the August lull to work on applications, we ended up with a really good array of entries, demonstrating a wide range of possibilities. A key judging criterion (obviously) concerns the usability of the application. But judging aside, I am personally less concerned with how usable a rapidly developed application is – and some of these applications have worked very effectively with complex and often dense datasets – but how much they get me thinking about potential use cases and benefits.

To a large degree, the Discovery programme is about identifying the potential, and where appropriate finding ways to build on someone’s seed of an idea. Applications such as Yogesh Patel’s experiment with Archives Hub linked data might only scratch at the surface of the dataset but they still prompt us to think about some of the great potential that exists. Along with What’s About it hints at the potential of combining historic and contemporary geospatial data to provide new routes through to content; to explore the world of ‘exploration’ spatially as opposed through the linear and hierarchical structure of the archival description. I think the archival community especially is hungry for examples to help us get past some of our entrenched thinking about what discovery interfaces looks like. Along with initiatives such as HistoryPin, OCLCs MapFast these applications give us something tangible to react to and explore ideas around discovering library, archival, or museum data geospatially.

We’re also learning more about the potential for Linked Data. The entry from Mathieu D’Aquin, Discobro, compliments the research and development activity of the JISC-funded LOCAH project perfectly in this regard. These are projects that enable the archival community see how EAD rendered as linked data can become more embedded within the wider web of data; and instantly (it seems to me) we’re forced beyond the finding aid and document-centric mindset, and thinking about our descriptions as data that needs to be interlinkable to be found and used. It is remarkable how well Discobro works. My own search for the Stanley Kubrick archives in the Archives Hub using the bookmarklet immediately provided multiple links out to DBpedia entries on Kubrick’s life, cinematography, and films. All this is not achieved through a manual mashing of data, but an automatic ‘meshing’ that can scale (which is perhaps one of the most heady promises of Linked Data).

Will Linked Data be The Way Forward? The jury’s still out, but applications such as Discobro,  and others help us understand in much more tangible terms what benefits might be delivered.

And some applications demonstrated benefits that we can work on delivering much more immediately. For me the stand out here is the Open URL Router Recommender developed by Dimitrios Sferopoulos and Sheila Fraser at EDINA . My brain’s whirring with the possibility of how we can include this as a functionality into article search services at the local or national level (for example, embedding it into the newly designed Zetoc which will be launched later this year). The use case for recommender functions is already proven, although we have more to learn about such functions in academic and teaching contexts, but what EDINA have demonstrated is what you can achieve through the network effect – gathering data centrally. Patterns and relationships between articles emerge that are not readily available through other means. It’s simple, and the data’s already there waiting to be exploited. As a result we can provide routes through to discovery based on communities of use, disciplinary context, and not descriptive metadata alone.

Neeta Patel’s simple visualisation of the MOSAIC circulation data demonstrates something similar – through my involvement with the SALT and Copac Collections Management projects, we know that libraries are already using their circ data (if they collect it) to inform collection management decisions, but that often this work involves scrutinising spreadsheets and figures. Visual views of the data can really help support such analysis, and give that at-a-glimpse overview that can often tell a whole story.

There’s obviously a lot more that could be said about these entries (I wish I could touch on them all) and hopefully we’ll hear some views from my Discovery cohorts.  I’m now interested in seeing what conversations now open up as a result, and what practical work we can carry forward through new collaborations.


#discodev — announcing the first Discovery developer competition

July 3, 2011

I’m pleased to announce that we’re teaming with UKOLN’s Developer Community Supporting Innovation (DevCSI) to run a developer competition throughout July 2.

There’s a lot of talk about the potential of open data, and how it can support innovation, but we want to try and drive that innovation in ways that help us understand in practical terms what’s possible, and what future use (and business) cases might look like.

It’s fairly simple — we want developers to build open source software applications or tools using at least one of our 10 open data sources collected from libraries, museums and archives. Other sources may be used (we encourage it).

This is a chance to win a nice chunk of Amazon vouchers (from £30 to £100) or an EEE Pad Transformer (a.k.a. hackable equivalent to the iPad)

Enter simply by blogging about your application and emailing the blog post URI to me (joy.palmer@manchester.ac.uk) by the deadline of 2359 (your local time) on Monday 1 August 2011 (now extended to 2359 on Monday 22 August 2011).

More information about the rules, criteria for judging, and ideas for what might be worth trying etc., is all detailed here.

p.s. If you blog, tweet, etc, then use: #discodev (uh huh).


Discovery conference, May 26th. An overview of the day’s discussions

July 1, 2011

Here is a summary of the main ideas and themes from the presentations and discussions at the Discovery Conference at the Wellcome Institute, London, on 26 May 2011. It’s based on notes taken at the time, and is therefore by necessity to some extent selective, but I’ve tried to be comprehensive and true to the spirit of the day. I’ve included references to some of the key twitter themes as these help to highlight issues of interest to the community.

Jane Plenderleith of Glenaffric Ltd (and member of the Discovery Communications Team

Opening Address from David Baker, JISC

Our starting point was the RDTF Vision Statement of 2009. Since then there’s been some discussion about scope, suggesting that the vision should not be limited to UK HE. Following some heated discussion at the 2010 JISC conference, the vision is about opening access for all. But we have to start somewhere, hence the focus on UK HE. In our definition of the future state of the art, it’s important not to try to project too far forward, so the focus is on what we aim to achieve by 2012.

We are aiming for integrated seamless access, focusing on UK HE in the first instance, with a thorough and open aggregated layer, designed to work with all search engines, through a diverse range of personalised and innovative discovery services. Increasing efficiencies is clearly important for sector leaders and managers – the potential of open data to address this priority needs to be emphasised.

At the moment we are in Phase 1 of this process, focusing on open data. More detail about the call from JISC for moving into Phase 2 will follow. Phase 1 achievements include:

ñ  Excellent Open Bibliographic Data Guide

ñ  Projects

ñ  Metadata Guidelines

ñ  Newsletter  (people can sign up, keep in touch, feedback, engage in interchange of ideas and experiences)

There was a successful event in April, with good engagement, proposals for further work, and suggestions for a ‘Call to Arms’. This has resulted in the Statement with eight Open Metadata Principles in the pack for today’s event.

Eight projects have already been funded, focusing on a broad and appropriate range of issues, providing a test-bed for the Phase 1 work, and giving us a good platform on which to build Phases 2 and 3.

We are working on making metadata open and easier to use, distilling advice and guidance from the eight projects. Key stakeholder engagement is vital to this process. This new phase of work is under the brand ‘Discovery’. RDTF was a clunky old name, but when boiled down to the essence, it’s about developing a metadata discovery ecology for UK education and research.

Engaging stakeholders and developing critical mass is key. With the community we want to explore what open data makes possible. Since the first event in Manchester on 18 April many people have signed up to the Statement of Principles. Today’s event is about using this momentum to move the open data agenda forward.

Part 1: The Demand Side — User Expectations in Teachin, Learning and Research

Keynote 1: Stuart Lee

Stuart was addressing the conference (by filmed videolink) wearing two hats, one as a researcher in the humanities and one as an IT service manager at the University of Oxford.

He started with a historical overview of his data usage techniques: ‘When I was writing my PhD thesis, I had to produce a glossary. The normal method at the time was using a card system, which took a long time (a friend took one and a half years). But I was trained to use a text analysis tool, so it took me three hours. Later I was asked to produce a monograph of my thesis, but instead I made a pdf and put my thesis on the web. I didn’t know at the time that this was in fact open publishing, but this had far more impact than if I had published in book form. It’s been downloaded thousands of times, and made my reputation in this field.’

Stuart went on to make reflections on how researchers in the humanities work, and what open data might mean. Researchers in the humanities never really finish. Projects have a long life span and are often revisited. We work in an iterative cycle, our research is unbounded and incomplete. We don’t just publish and move on to the next thing. We tend to work alone, in our own way, not in teams in labs. Print is a very important medium for us. We use primary and secondary resources, we find stuff through browsing catalogues.

In a nutshell, we just want to find ‘useful stuff’. Modern researchers are less worried about provenance, they are more concerned with usefulness. Many collections that we use are built by other academics working in our field.

We use tools to edit, analyse and compare data. We need to organise material so we can quickly find it. We have to present our work in a particular way – present an argument, combine primary and secondary resources. Citation is important, particularly of recognised names in the field. The material we produce has to be safeguarded, archived, so we can come back to it and others can use it. We want it to be available for a long long time. It does not go out of date like science stuff does.

So what opportunities does open data present for researchers in the humanities? We are very interested in open data and the Discovery agenda. We can now achieve the previously impossible – find relevant resources quickly, deal with mass quantities of data (example: corpus linguistics), achieve low cost distribution (example: iTunesU). Storage is no longer a problem, we can search across data silos from our desk and take advantage of cross-searching possibilities.

Perhaps we undervalue serendipity when we are looking for resources. If you are scanning books in the library, you find useful stuff on either side of the one you are looking for. If you are browsing data on a keyword search this throws up lots of possibilities.

There are a lot of chances for collaboration using online tools. We work very much in our own sub domain, with international connections in our field. We need better bibliographic tools like Mendeley.

Inevitably, open data poses some problems and challenges. Who is a researcher? Increasingly libraries have to incorporate meeting the needs of people beyond the usual HE sphere including the public and corporate bodies. There’s a lack of awareness about what is available. There’s a need for better standardisation. Text analysis tools haven’t advanced much in 20 years, and training undergraduates in their use is still necessary. There is still a problem with accessing data when we don’t know its provenance.

We need to break free of the stranglehold of academic publishers – we in the humanities are every bit as fed up about this as people working in the sciences. The system we have at present is unsustainable. We need to make metadata open to make it easier to find things. There are more challenges relating to the analysis of data, and preserving knowledge. We need support for adopting open content, both top down and bottom up.

Stuart ended with some comments on the changing nature of the library itself, the concept being no longer of a physical building, but a whole plethora of bodies holding information and making it available in what Stuart called the ‘cl**d’ (he doesn’t care for the word).

Keynote 2: Peter Murray Rust

PMR’s focus was researchers in the STEM field, and he was provocative from the outset. How many practising scientists are in the room? None. That’s the problem – scientists have no use for university libraries and repositories.

There are global and domain solutions to resource discovery. We have the technical solutions – what we need to make this happen is political will. For example, only those universities which have mandated publishing work in repositories (such as Ghent, Queensland, and to some extent Southampton) actually use them.

By comparison, look at the Open Street Map project (an open information resource for global maps). People have really contributed to this. They even held mapping parties. Example: after the earthquake they created a digital map of Haiti in two days for the rescue services. That’s the power of crowd sourcing. But there is no sense of the power of this in JISC – their strategy is to rely on publishers getting the stuff for us. But publishers, says PMR, produce garbage (this remark aroused amused assent from most of the people in the room).

PMR continues in this provocative vein. ‘It is quite simple for us to produce our own discovery data. Example: I have an interest in UK Theses, so I went to Ethos. I went with a simple and fundamental question – trying to access all Chemistry theses published in the UK in 2010. But they are scattered over different repositories, not searchable, and not available in any integrated way. In France they have SUDOC Catalogue- with 9 million bibliographic data references. If there is one clear message from today, it’s “do what the French do”.

It is technically trivial to turn documents into pdf, but this is an appalling way of managing data. PDF is like turning a cow into a hamburger. You can’t turn the hamburger back into a cow. (The twittersphere took up this comment and retweeted it many times).

Another example of where it doesn’t work: I put 2,000 objects into the Cambridge D-Space, but then I couldn’t get them out. I had to write some bash code to get my own objects back out again.

More provocation: We are paralysed by the absurdity of copyright. I know people who delight in not doing anything because of copyright. Any small interaction that is not automatic kills open data. Google just goes and does it.

PMR’s solution to these problems was to build his own repository – a graduate student did this in a year, which now costs about 0.25 FTE to maintain each year. Some funding was secured under JISC Expo to make open bibliographic data available. We have ‘liberated’ 25 million bibliographic references. It’s important to aim outwards not inwards. Example: PubMed is funded in the UK by Wellcome Trust. This organisation has done more than the whole of UK HE to push the open data agenda forward.

For PMR, what would really make this work is support from the major research funders. Wellcome, RCUK, Cancer Research UK. But they are not here today. If the funders were to mandate that all the work they fund is published openly, and state that if you don’t publish your data you won’t get another grant, this would have a serious impact. All that would then be required would be to manage the bibliography, and that’s easy. Open data just requires political will and management to make it happen.

Research Conversation

The opening keynotes gave rise to a lively debate about open data for research, with comments and questions from lots of people. The tweet wall was also animated, echoing key points and making further suggestions and generating ideas. Here’s a summary of the main questions and comments from this session:

What is the value of open data to researchers? What’s the value of a map to geographers? It’s a vital resource – we need to know who is doing what, with links to everything, with that we have the complete spine of scholarship. Bibliography is the map of scholarship. There are also management uses for data about published papers.

PMR said that data are complicated, diverse, and domain dependent. Every discipline is different and has its own views on what data is. It will take 25 years to sort what scientific data actually is on a technical level.

How important is provenance? Researchers care about provenance and how something came about. We need to exercise critical appraisal when assisting the construct of information sources. But while provenance is important, it is also incredibly difficult. In the first instance what’s important is that the data is available

What do we have to say to the funders to make them listen? Funders want the work they fund to be widely used, discovered, read, computed, built on.

The tweet wall at this point was alive with comments about IPR and copyright risks.

Is there the same ethos of collaboration and openness for museum data? Museums are protective of what is effectively their life’s work. There are copyright worries about the protection of intellectual capital. Providing an open record to a world where it might be challenged or used in a context for which it was not intended is quite challenging for museums. But it was also noted that there are people in museums who do want to share.

Should publicly funded research institutes make their data openly available? PMR praised organisations like BAS and NERC which are dedicated to maintaining data and making it publicly available. He noted that in academic communities this practice is variable. Some researchers would die rather than make their data available, while others are doing this quite freely. In some places there is an embargo on publication for five years, in case people might find out what they are doing. Issues relating to university ethics and data storage policies were mentioned.

It was suggested that what is needed in the sector is strong leadership promoting open data. There’s a particular problem with senior academic managers, working in a factional REF-dominated culture of competition. In industry, competitors manage to work together on issues of common interest, while still maintaining competition.

David Baker summarised the key issue: it is becoming apparent that the political and legal challenges to open data are more difficult than the technical.

Keynote 3: Drew Whitworth

Drew’s focus was the role of open data for teaching and learning in a variety of formal and informal contexts. A key theme was information as a resource in the environment – it does not diffuse itself evenly, it can be controlled, polluted, degraded. Drawing on Rose Luckin’s 2010 work ‘An ecology of resources’, Drew noted that an ecology evolves in a dynamic way. When you use resources you transform them into something else. This can be a problem – if we transfer resources into pollution, we are not using them in a sustainable way. Sustainable development means you meet current needs without damaging the process of meeting needs in the future. How are you using information now? Are you developing resources that will lead to enhanced resources in the future? We need to use resources now to build resources for the future.

In his book Information Obesity (2009) Drew presented the argument that while logically, information is a good thing and we need it, a lot of information can be a bad thing (why do we talk about information ‘overload’ not ‘abundance’?) It is the same with food – it is possible to have too much food or the wrong kind of food. Fitness means eating smaller amounts of right kind of food. We are under pressure to consume, and this works for information as well as food. Obesity is not just about over-consumption. It depends on individuals, and purpose. Athletes process lots of calories. Some of us can process lots of information. But we don’t want to turn learners into information processing machines.

Drew described the JISC-funded MOSI-ALONG project which was trying to connect museum artefacts of local relevance with real people and stories from the community.

In summary: we have to remember that learning and information processing happens all the time through communities. If we don’t look after our information environment it will become polluted. Environments are healthiest when they are diverse. We need to look after these environments, protect against storms and national disaster. It falls upon people in workplaces, and business leaders, to make sure the information environment on which we depend is sustainable. Our task is to look after these environments, and it’s everyone’s responsibility.

Teaching and Learning Conversation

Does the UK discovery ecotecture need to concern itself with usability or are we simply aiming to get the stuff out there? We definitely need a usability strategy. Otherwise people can just shove data in and it’s unusable.

We also need to be aware of our filtering strategies. We are programmed to filter sensory information all the time. We have known tendencies to filter out information that challenges our primary beliefs. You want to give help and guidance, you have a mental model of the data, you have some organising principles, but you need to build in some flexibility in case your mental models do not match those of your data users. This is key to effective use of these resources for learning and teaching. We need to guide, help, but not fix and control.

There is a danger in the paradigm of respected provenance, we need to be wary of gate-keeping, and think about filtering throughout the chain of use. But from a metadata standpoint – if we try to predict how users will use data, we tie ourselves in knots. For the Discovery initiative there is a sense we just have to get the content out there and communities can practice, can start to repurpose for their needs.

Usability and discovery are different but related. The challenge is – we are used to usability in terms of HCI making it easy for people to navigate and use.

But how do we channel that thinking about flexible usability while still making it possible for people to uncover the complexity of the data?

Whatever usability criteria there are need to be continuously reviewed in the light of how people are using the data. Any organising principle can become too restrictive. Scaffolding learning is a good principle – but when the job is done the scaffolding comes down. The challenge is finding a way to use scaffolding for information retrieval then take it down so people can find for themselves.

We need a discussion about the nature of infrastructure, so the scaffolding notion is useful. If we immediately apply this – we have processes that generate metadata, and much of it is context bound. We are moving towards just-in-time metadata, that is generated from processes. You might need the scaffold for 10 seconds or 10 months.

The elephant in the room is VLEs. People who are populating VLEs are not putting together temporary scaffolding, it’s a bit more permanent. There are competing approaches to describing resources and we need to take this into account. The problem with VLEs for learners is that the second they leave the institution they no longer have access.

Information is resource in a context. What else is necessary in order to turn information into learning? It’s not really possible to say what turns information into an educational resource. The quality of teacher, the motivation of student, the relevance in context. You cannot reduce education to a science, it is unpredictable, conversational, context specific.

There has to be redundancy to make our ecology healthy and diverse. Funders make decisions. JISC has a pretty good approach, do consultation, collaboration before they set priorities. For others funding is based more on political expediency, and this is worrying. There is a need to prioritise developments, but let’s do this in the right way and leave some room for flexibility.

At this point in the proceedings there was a welcome break for lunch.

Part 2: The supply side: Opportunities to expand access and visibility

Summing up the morning, David Baker said:

Seamless access, flexible delivery, personalisation – if we can put these three together, there is a very exciting future.

The afternoon session was chaired by Nick Poole. HE/FE says ‘we need this’. Just do it. Politicians say just do it. We say do it, but do it well. The afternoon session was to present examples of people who have just done it.

Veronica Adamson: The Art of the Possible, Special Collections

Key points from the discussion:

Special Collections may be the key that unlocks potential of open data for many people – it resonates, there is an understanding, examples of where LAM can really work together.

Having a business case is essential – LAM managers need to be able to make the case for open data on the basis of efficiency savings, improving the quality of learning and teaching, enhancing research output, widening participation, raising the profile of the institution.

Collections experts may not be the best people to make the business case for managers. It’s not just about listing benefits, it’s about costs and benefits.

There is some interesting work by Keith Walker at IoE looking at learning trails in museums.

Peter Burnhill: Aggregation as Tactic

Aggregation means combining different sources of data, seeing the machine as user.

There is a purpose beyond discovery.

Do we need to decouple the metadata layer from the presentation layer? This is a techie question but it’s important.

Supply and demand – maybe we are all middle folk. We are adding value by bringing different streams of data together, making them more amenable for access. Aggregation is an intervention with some purpose.

There was some activity on the tweet wall at this point relating to important parallels between the discovery agenda and OER in terms of aggregation as a tactic.

 

The final session was a conference discussion about the scope of projects which might catalyse collaboration, focusing on events, celebrations and anniversaries in 2012. The debate continues.


Follow

Get every new post delivered to your Inbox.