Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: A Repository-based Architecture for Capturing Research Projects at the Smithsonian Institution
Speaker(s): Thorny Staples

I have recently returned to the Smithsonian. I got into repositories through lots of digital research projects. I should start off by saying that I’ll show you screenshots for a system that allows researchers to deposit data from the very first moment of research, it’s in their control until it goes off to curators later.

I’m sure most of you know of the Smithsonian. We were founded to be a research institute originally – museums were a result of that. We have 19 museums, 9 scientific research centers, 8 advances study centres, 22 libraries, 2 major archives and a zoo (Washington zoo). We focus on longterm baseline research, especially in biodiversity and environmental studies, lots of research in cultural heritage areas. And all of this, hundreds of researchers working around the world, has had no systematic data management of digital researvh content (except for SAO who work under contract for NASA).

So the problem is that we need to capture research information as it’s created and make it “durable” – it’s not about presevation but about making it durable. The Smithsonian is now requiring a data management plan for ALL projects of ANY time. This is supposed to say where they will put their digital information, or at least get them thinking about it. But we are seeing very complex arrays of numerous types of data. Capturing the full structure and context of the research content is neccasary. It’s a network model, it’s not a library model. We have to think network from the very beginning.

We have to depend on the researvchers to do much of the work, so we have to make it easy. They have to at least minimally describe their data but they have to do something. And if we want them to do it we must provide incentives. It’s not about making them curators. They will have a workspace, not an archive. It’s about a virtual research environment but a repository-enables VRE. Primary goal is to enhance their research capabilities, leaving trusted data as their legacy. So to deliver that we have to care about a content creation and management environment, an analysis environment and a dissemination environment. And we have to think about this as two repositories: there is the repository for the researcher, they are data owners, they set policies, they have control – crucial buy-in and crucial concept for them; And then we have to think about an interoperable gathering service – a place researcher content feeds into and also cross search/access to multiple repositories back in the other direction as these researchers work in international teams.

Key to the whole thinking is the concept of the web as the model. It’s a network of nodes that are units of content, connected by arcs that are relationships. I was attracted to Fedora because of the notion of a physical object and a way to create networks here. Increasingly content will not be sustainable as discrete packages. We will be maintaining our part of the formalized world-wide web of content. Some policies will mean we can’t share everything all the time but we have to enable that, that’s where things are going. Information objects should be ready to be linked, not copied, as policy permits. We may move things from one repository to another as data moves over to curatorial staff but we need to think of it that way.

My conceptual take here is that a data object is one unit of content – not one file. E.g. a book is one object no matter how many pages (all of which could be objects). By the way this is a prototype, this isn’t a working service, it’s a prototype to take forward. And the other idea that’s new is the “concept object”. This is an object with a metadata about the project as a whole then a series of concept objects for the components of that project. If I want to create a virtual exhibition I might build 10 concept objects for those paintings and then pull up those resources.

So if you come into a project you see a file structure idea. Theres an object at the top for the project as a whole. Your metadata overview, which you can edit, lets you define those concepts. The researcher controls every object and all definitions. The network is there, they are operating within it. You can link concepts to each other, it’s not a simple hierachy. And you can see connections already there. You can then ingest objects – right now we have about 8 concept types (e.g. “Research site, plot or area”). When you pick that you then pick which of several forms you want to use. When you click “edit” you can see the metadata editor in a simple web form prepopulated with existing record. And when you look at resources you can see any resources associated with that concept. You can upload resources without adding metadata but it will show in bright yellow to remind you to add metadata. And you can attach batches of resources – and these are offered depending where you are in the network.

And if I click in “exhibit” – a link on each concept – you can see a web version of the data. This takes advantage of the adminstrator screen but allows me to publish my work to the web. I can keep resources private if I want. I can make things public if I want. And when browsing this I can potentially download or view metadata – all those options defined by researcher’s setting of policies.


Q1 – Paul Stanhope from University of Lincoln) Is there any notion of concepts being bigger than the institution, being available to others

A1) We are building this as a prototype, as an idea. So I hope so. We are a good microcosm for most types of data – when the researcher picks that they pick metadata schemas behind the scenes. This think we built is local but it could be global, we’re building it in a way that could work that way. With the URIs othwe intstitutions can link their own resources etc.

Q2) Coming from a university, do you think there’s anything different about your institution? Is there a reason this works differently?

A2) One of the things about the Smithsonian is that all of our researchers are Federal employees and HAVE to make their data public after a year. That’s a big advantage. We have other problems – funding, the government – but policy says that the researchers have to

Q3 – Joseph Green from University College Dublin) How do you convey the idea of concept objects etc. to actual users – it looks like file structures.

A3) Well yes, kind of the idea. If they want to make messy structures they can (curators can fix). The only thing they need is a title for their concept structure. They do have a file system BUT they are building organising nodes here. And that web view is an incentive – it’ll look way better if they fill in their metadata. Thats the beginning… for tabular data objects for instance they will be required to do a “code book” to describe the variables. They can do this in a basic way or they can do better more detailed code book and it will look better on the web. We are trying to incentivise  at every level. And we have to be fine with ugly file structures and live with it.

Topic: Open Access Repository Registries: unrealised infrastructure?
Speaker(s): Richard Jones, Sheridan Brown, Emma Tonkin

I’m going to be talking about an Open Access Repositories project that we have been working on, funded by JISC, looking at what Open Access repositories are being used for and what their potential is via stakeholder interviews, via a detailed review of ROAR and OPENDOAR, and somerecommendations.

So if we thought about a perfect/ideal repository as a starting point… we asked out stakeholders what they would want. They would want it to be authoritative – the right name, the right URL; they want it to be reliable; automated; broad scope; curated; up-to-date. The idea of curation and the role of human intervention would be valuable although much of this would be automated. People particularly wanted the scope to be much wider. If a data set changes there are no clear ways to expand the registry and that’s an issue. But all of those terms are really about the core things you want to do – you all want to benchmark. You want to compare yourself to others and see how you’re doing. And in our sector and funders they want to see all repositories, what are the trends, how are we doing with Open Access. And potentially ranking repositories or universities (like Times HE rankings) etc.

But what are they ACTUALLY being used for right now? Well mainly use them for documenting their own existing repositories. Basic management info. Discovery. Contact info. Lookups for services – use registry for OAI-PMH endpoints. So that’s I think, it looks as if we’re falling a bit short! So, a bit of background on what OA repository registries there are. So we have OpenDOAR, ROAR (Registry of Open Access Repositories) – those are both very broad scope repositories, well known and well used. But there is also the Registry of Biological Repositories. There is re3data.org – all research data so it’s a content type specific repository registry. And, more esoterically, is the Ranking Web of World Repositories. Not clear if this is a registry or a service on a registry. And indeed that’s a good question… what services run on registries. So things like BASE search for OAI-PMH endpoints, very similar to this is Institutional Respositories Search based at Mimas in the UK. Repository 66 is a more novel idea – mashup with Google Maps to show repositories around the world. Then there is the Open Access Repository Junction a multideposit tool for discovery and use of Sword endpoints.

Looking specifically at OpenDOAR and ROAR. OpenDOAR is run at University at Nottingham (SHERPA) and it uses manual curation. Only lists OA and Full-text repositories. It’s been running since 2005. Whereas DOAR is principally Repository Manager added records. No manual curation. And lists both full-text and metadata only. Based at University of Southampton and running EPrints 3, inc. SNEEP elements etc. Interestingly both of these have policy addition as an added value service. Looking at the data here – and these are a wee bit out of date (2011). There seems to be big growth but some flattening out in OpenDOAR in 2011 – probably approaching full coverage. ROAR has a larger number of repositories due to difference in listing but quite similar to OpenDOAR (and ROAR harvests this too). And if we look at where repositories are both ROAR and OpenDOAR are highly international. Slightly more European bias in OpenDOAR perhaps. The coverage is fairly broad and even around the globe. When looking at content type OpenDOAR is good at classifying material into types, reflective of manual curation. We expect this to change over time, especially datasets. ROAR doesn’t really distinguish between content types and repository types – it would be interesting to see these separately. We also looked at what data you typically see about the repository in any record. Most have name, URL, location etc. OpenDOAR is more likely to include a description and contact details than is the case in ROAR. Interestingly the machine to machine interfaces are a different story. OpenDOAR didn’t have any RSS or SWORD endpoint information at all, ROAR had little. I know OpenDOAR are changing this soon. This field has been added on later in ROAR and no-one has come back to update this new technology, that needs addressing.

A quick not about APIs. ROAR has OAI-PMH API, no client library, full data dump available. OpenDOAR has a fulled documented query API, no client library and full data dump available. When we were doing this work almost no one was using the APIs, they just download all data.

We found stakeholders, interviewees etc. noted some key limitations: content count stats are unreliable; not internationalised/multilingual- particularly problematic if a name is translated and is the same as but doesnt appear to be the same thing; limited revisions history; No clear relationships between repos, orgs, etc. And no policies/mechanisms for populating new fields (e.g. SWORD). So how can we take what we have and realise potential for registries? There is already good stuff going on… Neither of those registries automatically harvest data from repositories but that would help to make data more authoritative/reliable/up to date; automated; increased scope of data – and that makes updates so much easier for all.  And we can think about different kinds of quality control – no one was doing automated link checking or spell checking and those are pretty easy to do. And an option for human intervention was in OpenDOAR but not in ROAR, and that could be make available.

But we could also make them more useful for more things – graphical representaqtions of the registry; better APIs and Data (with standards compliance where relevent); versioning of repositories and record counts; more focus on policy tools.  And we could look to encourage overlaid services: repository content stats analysis; comparitive statistics and analytics; repository and OA rankings; text analysis for identifying holdings; error detection; multiple deposits. Getting all of that we start hitting that benchmarking objective.


Q1 – Owen Stephens) One of the projects I’m working on is CORE project from OU and we are harvesting repositories via OpenDOAR. We are producing stats about harvesting. Others do the same. It seems you are combining two things – benchmarking and repositories. We want OpenDOAR to be comprehensive, and we share your thoughts on need to automate and check much of that. But how do we make sure we don’t build both at the same time or separate things out so we address that need and do it properly?

A1) The review didn’t focus on structures of resulting applications so much. But we said there should be a good repository registry that allows overlay of other services – like the benchmarking services. CORE is an example of something you would build over the registry. We expect the registry to provide mechanism to connect up to these though. And I need to make an announcement: JISC, in the next few weeks, will be putting out an ITT to take forward some of this work. There will be a call out soon.

Q2 – Peter from OpenDOAR) We have been improving record quality in OpenDOAR. We’ve been removing some repositories that are no longer there – link checking doesn’t do it all. We also are starting to look at including those machine to machine interfaces. We are doing that automatically with help from Ian Stuart at EDINA. But we are very happy to have them sent in too – we’ll need that in some case

A2) you are right that link checkers are not perfect. More advanced checking services can be built on top of registries though.

Q3) I am also working on the CORE project. The collaboration with OpenDOAR where we reuse their data, it’s very useful. Because we are harvesting we can validate the repository and share that with OpenDOAR. The distinction between registries and harvesting is really about an ecosystem that can work very well.

Q4) Is there any way for repositories to register with schema.org to enable automatic discovery?

A4) We would envision something like that, that you could get all that data in a sitemap or similar.

A4 – Ian Stuart) If registering with Schema.org then why not register with OpenDOAR?

A4 – chair) Well with scheama.org you host the file, its just out on the web.

Q5) How about persistant URLs for repositories?

A5) You can do this. The Handle in DSpace is not a persistant URL for the repository.

Topic: Collabratorium Digitus Humanitas: Building a Collaborative DH Repository Framework
Speaker(s): Mark Leggott, Dean Irvine, Susan Brown, Doug Reside, Julia Flanders

I have put together a panel for today but they are in North America so I’ll bring them in virtually… I will introduce and then pass over to them here.

So… we all need a cute title and Collaboratory is a great word we’ve heard before. I’m using that title to describe a desire to create a common framework and/or set of interoperable tools providing a DH Scholars Workbench. We often create great creative tools but the idea is to combine and make best use of these in combination.

This is all based on Islandora. A Drupal+ Feora framework from UPEI. Flexible UI on top of Fedora and other apps. It’s deployed in over 100 institutions and that’s growing. The ultimate goal of those efforst is to release a Digital Humanities solutions packs with various tools integrated in, in a framework that would be of interest to scholarly DH context – images, video, TEI, etc.

OK so now my colleagues…

Dean is visiting professor in Yale, and also professor at Dalhousie University in Canada and part of a group that creates new versions of important modernism in canada prints. Dean: so this is the homepage for Modernist Commons. This is the ancillery site that goes with the Modernism in Canada project. One of our concerns is about long term preservation about digital data stored in the commons. What we have here is both the repository and a suite of editing tools. When you go into the commons you will find a number of collections – all test collections and samples from the last year or so. We have scans of a bilingual publication called Le Nigog, a magazine that was published in Canada. You can view images, mark-up, or you can view all of the different ways to organise and orchestrate the book object in a given collection. You can use an Internet Archive viewer or alternative views. The IA viewer frames things according to the second to last image in the object, so you might want to use an alternative. In this viewer you can look at the markup, entities, structures, RDF relations or whether you want to look at image annotations. The middle pane is a version of CWRC Writer that lets us do TEI and RDF markup. And you see the SharedCanvas tools provided with other open annotation group items. As you mark up a text you can create author authority files that can be used across collections/objects.

Next up Victoria Brown, her doctorate is on Victorian feminist literature. She currently researches collaborative systems, interface design, usability. Victoria: I’ll be talking more generally than Dean. The Canadian Writing Research Council is looking to do something pretty ambitios that only works in a collaborative DH environment. We have tools that can aim as big as we can. I want to focus on talking about a couple of things that define a DH Collaboratory. It needs to move beyond institutional repository model. To invoke persoective of librarian colleagues I want to address what makes us so weird… What’s different about us is that storing final DH materials is only part of the story, we want to find, amass, collect materials; to sort and organise them; to read, analyse and visualize. That means environments much be flexible, porous, really robust. Right now most of that work is on personal computers – we need to make these more scalable and interoperable. This will take a huge array of stakeholders buying into these projects. So a DH repository environment needs to be easy o manage, diverse and flexible. And some of these will only have a small amount of work and resources. In many projects small teanms of experts will be working with very little funding. So the CWRC Writer here shows you how you edit materials. On the right you see TEI markup. You can edit this and other aspects – entities, RDF open annotation mark up etc, notations allows you to construct triples from within the eidt. One of the ways to encourage interoperability is through use of common entities – connecting your work to the world of linked data. The idea is to increase consistency across projects with TEI markup and RDF means better metadata than the standard working in Word, publishing in HTML many use. So this is a flexible tool. Embedding this in a repository does raise questions about revisioning and archiving though. One of the challenges for repositories and DH is how we handle those ideas. Ultimately though we think this sort of tool can broaden participation in DH and collaboration in DH content. I think the converse challenge for DH is to work on more generalised environments to make sure that work can be interoperable. So we need to take something from solid and stable structure and move to the idea of shared materials – a porous silo maybe – where we can be specific to our work but share and collaborate with others.

The final speaker is Doug, he became first digital curator at NYPL. He’s currently editing music of the month blog at NYPL. Doug: the main thing we are doing is completely reconfiguring our repository to allow annotation of Fedora and take in a lot of audio nad video content. And particularly for large amounts of born digital collections. We’ve just started working with a company called BrightCove to share some of our materials. Actually we are hiring an engineer to design the interface for that – get in touch. We are also working on improved display interfaces. Right now it’s all about the idea of th egallery – the idea was that it would self-sustain through selling prints. We are moving to a model where you can still view those collections but also archival materials. We did a week long code sprint with DH developers to extend the Internet Archive book reader. We have since decided to move from that to New York Times backed reader – the NYT doc viewer with OCR and annotation there.


Q1) I was interested in what you said about the CWRC writer – you said you wnated to record every key stroke. Have you thought about SVN or GIT that do all that versioning stuff already.

A1 – Susan) They are great tools for version control and it would be fascinating to do that. But do you put your dev money into that or do you try to meet needs of greatest number of projects? But we would definitely look in that direction to look at challenges of versioning introduced in dynamic online production environments.


 July 11, 2012  Posted by at 12:27 pm LiveBlog, Updates Tagged with: , , , ,

Sorry, the comment form is closed at this time.