Jul 122012

Today we are liveblogging from the OR2012 conference at George Square Lecture Theatre (GSLT), George Square, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Kevin: I am delighted to introduce my colleague Peter Burnhill, Director of EDINA and Head of the Edinburgh University Data Library, who will be giving the conference summing up.
Peter: When I was asked to do this I realised I was doing the Clifford Lynch slot here! So… I am going to show you a Wordle. Our theme for this years conference was Local In for Global Out… I’m not sure if we did that but here is the summing up of all of the tweets from the event. Happily we see Data, open, repositories and challange are all prominent here. But Data is the big arrival. Data is now mainstream. If we look back on previous events we’ve heard about services around repositories… we got a bit obsessed with research articles, in the UK because of the REF, but data is important and great to see it being prominent. And we see jiscmrd here so Simon will be pleased he did come on his crutches [he has broken his leg].
I have to confess that I haven’t been part of the organising committee but my colleagues have. We had over 460 register from over 40 different nations so do all go to PEI. Edinburgh is a beautiful city but when you got here is was rather damp but it’s nicer now – go see those things. Edinburgh is a bit of a repository itself – we have David Hume, Peter Higgs and Harry Potter to boast – and that fits with local in for global out as I’m sure you’ve heard of two of them. And I’ve like to than John Howard, chair of the OR Steering Committe and our Host Organising Committee
Our opening keynote Cameron Neylon talked about repositories beyond academic walls and the idea of using them for turning good research outputs into good research outcomes. We are motivated to make sure we have secure access to content… as part of a more general rumbling with workshops before the formal start there was this notion of disruption. Not only the Digital Economy but also a sense of not being passive about that. We need to take command of the scholarly communication area that is our job – that cry to action from Cameron and we should heed that.
And there was talk of citation… LinkedIn, Academia.edu etc. is all about linking back to research to data. And that means having reliable identifiers. And trust is a key part of that. Publishers have trust, if repositories are to step up to that trust level you have to be sure that when you access that repository you get what it says it is. As a researcher you don’t use data without knowing what it is and where it came from. The respoitory world needs to think about that notion of assurance, not quality assurance exactly. And also that object may be interrogatable to say what it is and really help you reproduce that object.
Preservation and Provenance is also crucial,
Disaster recovery is also important.. When you fail, and you will, you need to know how you cope, really interesting to see this picked up in a number of sessions too.
I won’t  summarise everything but there were some themes…
We are beginning to deal with the idea on registries and how those can be leveaged for linking resources and identifiers. I don’t think solutions were found exactly but the conversations were very valuable.And we need to think about connectivity, as flagged by Cameron. And these places l,e twitter and Facebook… WE don’t own them but we need to be I them, to make sure that citations come back to us from here.And finally, we have been running a thing called repository fringe for the last four years, and then we won the big One. But we had a little trepidation as There afe a lot lf hou! And we had an uncondference strand. Ad i can say that UoE intends to do repository fringe in 2013.

We hope you enjoyed that unconference strand – an addition to complement the open repositories, not to take away from it but to add an extra flavour. We hope that the PEI folk will keep a bit f that flavour at OR and we will be running the fringe a wee bit later in the year, nearer the edinburgh fringe.

As I finish up I wanted to mention an organisation in IASSIST, librarians used to be about the demand side of services but things have shifted over time. We would encourage that those of us here lik up to groups like IASSIST (and we will suggest the same to them) and we can finds way to connect up, to commune together at PEI and to kshare experience. And so finally I think this is about the notion of connectivity. We have the technology, we have the opportunity to connect up more to our colleagues!

And with that I shall finish up!

Begin with an apology….

We seem to have the builders in. We have a small event coming up… The biggest festival in the world… Bt we didn’t realise that the builders would move in about the same week as you….what you haven’t seen yet is out 60x40ft upside down purple cow… If you are here a bit longer you may see it! We hope you enjoyed your time nonetheless

It’s a worrying thing hosting a conference like this… Lke hosting a party you worry if anyone will show up. But the feedback seems to have been good and and I have many thank yous. Firstly to all of those who reviewed papers. To our sponsors. To the staff here – catering, edinburgh first,nthe tech staff. Bt particularly to my colleagues on the local Host Orgnaising Committee: Stuart Macdonald, William Nixon, james toon,  andrew bevan – most persuasive committee member getting our sponsors on board, saly Macgregor, nicola osborne who has led our social media activity, and to Florance Kennedy, who has been using her experience of wrangling 1000 developers at FLOc a few years ago.

The Measure of success for any event like this is about the quality of conversation, of collaboration, of idea sharing, and that seems to have worked well and we’ve really enjoyed having you here. The conference doesn’t end now of course but changes shape.. And so we move onto the user groups!

 July 12, 2012  Posted by at 11:33 am LiveBlog, Updates Tagged with: ,  2 Responses »
Jul 122012

Today we are liveblogging from the OR2012 conference at George Square Lecture Theatre (GSLT), George Square, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Kevin Ashley is introducing us to this final session…

How many of you managed to get along to a Pecha Kucha Session? It looks like pretty much all of you, that’s fantastic! So you will have had a chance to see these fun super short presentations. Now as very few will have seen all of these we are awarding winners for each session. And I understand that the prizes are on their way to us but may not be at the podium when you come up. So… for the first session RF1, and in the spirit of the ceilidh, I believe it has gone to a pair: Theo Andrew and Peter Burnhill! For the second stream, strand RF3 it’s Peter Sefton – and Anna! For RF3 it’s Peter Van de Joss! And for RF4 it’s Norman Grey!

And now over to Mahendra Mahey for the Developer Challenge winners…

The Developer Challenge has been run by my project, DevCSI: Developer Community Supporting Innovation and we are funded by JISC, which is funded by UK Government. The project’s aims it about highlighting the potential, value and impact of the work developers do in UK Universities in the area of technical innovation, this is through sharing experience, training each other and often on volunteer basis. It’s about using tecnology in new ways, breaking out of silos. And running challenges… so onto the winners of the Developers Challenge at DevCSI this year.

The challenge this year was “to show us something new and cool in the use of repositories”. First of all I’d like to thank Alex Wade of Microsoft Research for sponsoring the Developer Challenge and he’ll be up presenting their special prize later. This year we really encouraged non developers to get involved to, but also to chat and discuss those ideas with developers. We had 28 ideas from splinter apps, repositories that blow bubble, SWORD buttons.. .and mini challenege appeared – Rob Sanderson from Los Alamos put out a mini idea! That’s still open for you to work on!

And so.. the final decisions… We will award the prizes and redo the winning pitches! I’d like to also thank our judges (full list on DevCSI site) and our audience who voted!

First of all honourable mentions:

Mark McGillivray and Richard Jones – getting academics close to repositories or Getting Researchers SWORDable.

Ben O’Steen and Cameron Neylon – Is this research readable

And now the Microsoft Research Prize and also the runners up for the main prize as they are the same team.

Alex: What we really loved was you guys came here with an idea, you shared it, you changed it, you worked collaboratively on it and

Keith Gilmerton and Linda Newman for their mobile audio idea.

Alex: they win a .Net Gadgeteer rapid prototyping kit with motherboard, joystick, monitor, and if you take to Julie Allison she’ll tell you how to make it blow bubbles!

Peter Sefton will award the main prize…

Peter: Patricks visualisation engine won as we’re sick of him entering the developer challenge

The winners and runners up will share £1000 of Amazon Vouchers and the winning entry – the team of one – will be funded to develop the idea – 2 days development time. Patrick: I’m looking for collaborators and also an institution that may want to test it get in touch.

Linda and Keith first

Linda: In Ohio we have a network of DSpace repositories including the Digital Archive of Literacy Narratives – all written in real peoples voices and using audio files, a better way to handle these would be a boon! We also have an Elliston Poetry Curator – he collects audio on analogue devices, digital would be better. And in the field we are increasingly using mobile technologies and the ability to upload audioj or video at the point of creation with transcript would greatly increse the volume of contribution

MATS – Mobile AudioVisual Transcription Service

Our idea is to create an app to deposit and transcript audio – and also video – and we used SWORDShare, an idea from last years conference, as we weren’t hugely experienced in mobile development. We’ve done some mock ups here. You record, transcribe and submit all from your phone. But based on what we saw in last years app you should be able to record in any app as an alternative too. Transcription is hugely important as that makes your file indexable. And it provides access for those with hearing disabilities, and those that want to preview/read the file when listening isn’t an option. So when you have uploaded your file you request your transcription. You have two options. Default is Microsoft Mavis – mechanical transcription. But you can also pick Amazon Mechanical Turk – human transcription, and you might want that if the audio quality was very poor or not in English.

MAVIS allows some additional functionality – subtitling, the ability to jump to a specific place in the file from a transcript etc. And a company called GreenButton offers a webservices API to MAVIS. We think that even if your transcription isn’t finished you can still submit to the repository as new version of SWORD supports updating. That’s our idea! We were pitching this idea but now we really want to build it! We want your ideas, feedback, tech skills, input!

And now Patrick McSweeney and DataEngine.

My friend Dave generated 1TB data in every data run and the uni wouldnt host that. We found a way to get that data down to 10 GB for visualisation. It was back ups on a home machine. It’s not a good preservation strategy. You should educate and inform people and build solutions that work for them!

See: State of the Onion. A problem you see all the time… most science is long tail, and support is very poor in that long tail. You have MATLAB and Excel and that’s about it. Dave had all this stuff, he had trouble managing his data and graphs. So the idea is to import data straight from Dave’s kit to the repository. For Dave the files were CSV. And many tools will export to it, its super basic unit of data sharing – not exciting but it’s simple and scientists understand it.

So, at ingest you give your data provenance and you share your URIs, and you can share the tools you use. And then you have tools for merging and manipulation. the file is pushed into storage form where you can run SQL processing. I implemented this in an EPrints repository – with 6 visualisation but you could add any number. You can go from source data, replay experiment, and get to visualisations. Although rerunning experiments might be boring you can also reuse the workflow with new similar data. You can create a visualisation of that new data and compare it with your original visualisation and know that the process has been entirely the same.

It’s been a hectic two days. It’s a picture (of two bikers on a mountain) but it’s also a metaphor. There are mountains to climb. This idea is a transitional idea. There are semantic solutions, there are LHC type ideas that will appear eventually but there are scientists at the long tail that want support now!

And finally… thank you everyone! I meant what I said last night, all who presented yesterday I will buy a drink! Find me!

I think 28 ideas is brilliant! The environment was huge fun, the developers lounge were a lovely space to work in.

And finally a plug… I’ve got a session at 4pm in the EPrints track and that’s a real demonstration of why the Developer Challenge works as the EPrints Bazaar, now live, busy, changing how we (or at least I) think about repositories started out at one of these Developer Challenges!

At the dinner someone noted that there are very few girls! Half our user base are women but hardly any women presented at the challenge, Ladies, please reprasent.

And also… Dave Mills exist. It is not a joke! He reckons he generated 78 GB of data – not a lot, you could probably get it on a memory stick! Please let your researchers have that space centrally! I drink with reseachers and you should too!

And Ben, Ben O’Steen had tech problems yesterday but he’s always here and is brilliant. isthisresearchreadable.org is live right now, rate a DOI for whether its working.

And that’s all I have to say.

And now over to Prince Edward Island – Proud Host of OR 2013

I’m John Eade, CEO of DiscoveryGarden and this is Mark Leggot. So, the first question I get is where are you? Well we are in Canada! We are tiny but we are there. Other common questions…

Can I walk from one end of the island to the other? Not in a day! And you wouldn’t enjoy it if you did

How many people live there? 145,000 much more than it was

Do Jellyfish sting? We have some of the warmest waters so bring your swimsuit to OR2013!

Can you fly there? Yes! Direct from Toronto, Montreal, Halifax, Ottawa,(via Air Canada and Westjet) and from New York City (via Delta). Book your flights early! And Air Canada will add flights if neccassary!

We will work diligently to get things up on line as early as possible to make sure you can book travel as soon as possible.

Alternatively you can drive – you won’t be landlocked – we are connected to mainland. Canada is connected to us. We have an 8 mile long bridge that took 2 and a half years to build and its 64 metres high – its the highest point in PEI and also the official rollercoaster!

We are a big tourism destination – agriculture, fishing, farming, software, aerospace, bioresources. We get 1 million tourists per year. That means we have way more things to do there than a place our size should – championship quality gold courses. Great restaurants and a culinary institute. We have live theatre and we are the home of Anne of Green Gables, that plucky redhead!

We may not have castles… but we have our own charms…!

Cue a short video…

Mark: free registration if you can tell me what the guy was doing?

Audience member: gathering oysters?

Mark: yes! See me later!

So come join us in Prince Edward Island. Drop by our booth in the Concourse in Appleton Tower concourse for another chance to win free registration to next years event. We’ve had lots of support locally and this shoudl be a great event!

 July 12, 2012  Posted by at 10:34 am LiveBlog, Updates Tagged with:  Comments Off on Developer’s Challenge, Pecha Kucha Winners and Invitation to OR2013 LiveBlog
Jul 122012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Digital Preservation Network, Saving the Scholarly Record Together
Speaker(s): Michele Kimpton, Robin Ruggaber

Michelle is CEO of DuraSpace. Myself and Robin are going to be talking about a new initiative in the US. This initiative wasn’t born out of grant funding but by university librarians and CIOs who wanted to think about making persistent access to scholarly materials and knew that something needed to be done at scale and now. Many of you will be well aware that libraries are being asked to preserve digital and born digital materials and there are not good solutions to do that in scale. Many of us have repositories in place. Typically there is an online or regular backup but these aren’t at preservation scale here.

So about a year ago a group of us met to talk about how we might be able to approach this problem. And from this D-P-N.org – Digital Preservation and Network – was born. DPN is not just a technical architecture. It’s an approach that requires replication of complete scholarly record access nodes with diverse architectures without single points of failure. It’s a fedration. And it is a community allowing this to work at mass scale.

At the core of DPN are a number of replicated nodes. There are minimum of three but up to five here. The role of the nodes is to have complete copies of content, full replications of each replicating nodes. This is a full content object store, not just a metadata node. And this model can work with multiple contributing nodes in different institutions – so those nodes replicate across architectures, geographic locations, institutions.

DPN Principle 1: Owned by the community

DPN Principle 2: Geographical diversity of nodes

DPN Principle 3: Diverse organisations – Uof Michigan, Stanford, San Diego, Academic Presrvation Trust, University of Virginia.

DPN Principle 4: Diverse Software Architectores – including iRODS, HATHI Trust, FedoraCommons, Standford Digital Library

DPN Principle 5: Diverse Political Environments – we’ve started in the US but the hope it to expand out to a more diverse global set of locations

So DPN will preserve scholarship for future generations, fund replicating ndes to ensure functional independence, audit ad verify content, provide a legal framework for holding succession rights – so if a node goes down this means the content will not be lost. And we have a diverse governance group taking responsibility for specific areas.

To date 54 partners and growing, about $1.5 million in funding – and this is not grant funding – and we now have a project manager in place.

Over to Robin…

Robim: Many of the partners in the APTrust have also been looking at DPN. APTrust ia a consortium committed to creation and management of an aggregated preservation repository and, now that DPN underway, to be a replicating node. APTrust was formed for reasons of community-building, economies of scale – things we could do together that we could not do agin, aggregated content, long term preservation, disaster recovery – particularly relevent given recent east coast storms.

The APTrust has several arms: Business and marketing strategy; governance policy and legal framework; preservation and collection framework; repository implementation plan – the technical side of APTrust and being a DPN node. So we had to bring together University librarians, technology liaisons, ingest/preservation. The APTrust services are the aggregation repository, the separate replicating node for DPN, and the access service – initially for administaration but also thinking about more services for the future.

There’s been a lot of confusion as APTrust and DPN started emerging at about the same time. And we are doing work with DPN. So we tend to think of the explanation here being about Winnowing of Content with researchers repository of files at the top, then local institutions repositories, then AP trust – preservation for our institutions that provide robustness for our content, and DPN is then for long term preservation. APTrust is preservation and access. DPN is about preservation only.

So the objectives of the initial phase of the APTrust is engaging partners, defining sustainable business model, hiring a project director, building the aggregation repository and setting up our DPN node. We have an advisory group for the project looking at governance. The service implementation is a phased approach building on experience, leveraging open soure – cloud storage, compute notes, DuraCloud all come into play, economies of scale, TRAC – we are using as a guideline for architecture. APTrust will be sitting at the end of legacy workflows for ingest, it will take that data in, ingest to DuraCloud services, synching to Fedora aggregation repository, and anything for long term preservation will also move to the APTrust DPN Noder with DuraCloud OS via cloudsync.

In terms of the interfaces there will be a single administrative interface which gives access to admin of DuraCloud, CloudSync and Fedora. Which will allow audit reports, functionality in each individual area etc. And that uses the API for each of those services. We will have a proof of that architecture at end of Q4 2012. Partners will feedback on that and we expect to deploy in 2013. Then we will be looking at disaster recovery access services, end-user acces, format migration services – considered a difficult issue so very interesting, best practices fro content types etc., coordinated collection development – across services, hosted repository services. Find out more at http://aptrust.org and http://d-p-n.org/


Q1) In Denmark we are building our national repository which is quite like DPN. Something in your preserntation: it seems that everything is fully replicated to all nodes. In our organisation services that want to preserve something they can enter a contract with another/a service and that’s an economic way to do things but it seems that this model is everthing for everyone.

A1 – Michelle) Right now the principle is everyone gets a copy in everything. We may eventually have specialist centres for video, or for books etc. Those will probably be primarily access services. We do have a diverse ecosystem – back ups across organisations in different ways. You can’t choose stuff in one or another node.

Q2) This looks a lot like LOCKSS – what is the main difference between DPN and a private LOCKSS network.

A2) LOCKSS is a technology for preservation but it’s a single architecture. It is great at what it does so it will probably be part of the nodes here – probably Stanford will use this. But part of the point is to have multiple architectural systems so that if there is an attack on one architecture just one component of the whole goes down.

Q3) I understand the goal is replication but what about format obsolescence – will there be format audit and conversion etc?

A3 – Michelle) I think who does this stuff, format emulation, translation etc. has yet to be decided. That may be at node level not network level.

Topic: ISO : Trustworthy Digital Repository Certification in Practice

Speaker(s): Matthew Kroll, David Minor, Bernie Reilly, Michael Witt

This is a panel session chaired by Michael Witt of Purdue University. This is about ISO 16363 and TRAC, the Trustworthy Repository Audit Checklist – how can a user trust that data is being stored corrrectly and securely, that it is what it says it is.

Matthew: I am a graduate research assistant working with Micheal Witt at Purdue. I’ve been preparing the Purdue Research Repository (PURR) for TRAC. We are a progressive repository with online workspace and data sharing platform, to user archiving and access, to preservation needs of Purdue University graduates, researchers and staff. So for today I will introduce you to ISO 16363 – this is the users guide that we are using to prepare ourselves, I’ll give an example of trustworthiness. So a neccassary and valid question to ask ourselves is “what is “trustworthiness” in this context?” – it’s a very vague concept and one that needs to grow as the digital preservation community and environment grows.

I’d like to offer 3 key qualities of trustworthiness (1) integrity, (2) sustainability, (3) support. And I think it’s important to map these across your organisations and across the three sections of ISO 16363. So, for example, integrity might be that the organisation has sufficient staff and funding to work effectively. Or for the repository it might be that you do fixity checks, procedures and practices to ensure successful migration or translation, similarly integrity in infrastructure may just be offsite backup. Similarly sustainability might be about staff training being adequate to meet changing demands. These are open to interpretation here but useful to think about.

In ISO 16363 there are 3 sections of criteria (109 criteria in all): (3) Organizational Infrastructure; (4) Digital Object management; (5) Infrastructure and Security Risk Management. There isn’t a one-to-one relationship in documentation here. One criteria might have multiple documents, a document might support multiple criteria.

Myself and Micheal created a PURR Gap Analysis Tool – we graded ourselves and brought in experts from the organisation in the appropriate areas and we gave them a pop quiz. And we had an outsider read these things. This had great benefit – being prepared means you don’t overrate yourself. And secondly doing it this way – as PURR was developing and deploying our process here – we gained a real understanding of the digital environment

David Minor, Chronopolic Program Manager, UC San Diego Libraries and San Dieo Supercomputer Center: We completed the Trac process this April. We did it through the CDL organisation. We wanted to give you an overview of what we did, what we learnt. So a bit about us first. Chronopolis is a digital preservation network based on geographic replication – UCSD/SDSC, NCAR, UMIACS. We were initially funded vid the Livrary of Congress NDIIPP program. We spun out into a different type of organisation recently, a FIFA service. Our management and finances are via UCSD. All nodes are independent entities here – interesting questions arise from this for auditors.

So, why do TRAC? Well we wanted to do validation of our work – and this was a last step in our NDIIPP process – an important follow on for development. We wanted to learn about gaps, things we could do better. We wanted to hear what others in the community had to say – not just those we had worked for and served but others. And finally it sounds cyncial but it was to bring in more business – to let us get out there and show what we could do particularly as we moved into FIFA service mode.

The process logistics were that we began in Summer 2010 and finished Winter 2011. We were a slightly different model. We were a self-audit that then went to auditors to follow up, ask questions, speak to customers. The auditors were three people who did a site visit. It’s a closed process except for that visit though. We had management, finances, metadata librarians, and data centre managers – security, system admin etc all involved – equiverlent of 3 FTE. We had discussed with users and customers. IN the end we had hundreds of pages of documentation – some writen by us, some log files etc.

Comments and issues raised by auditors were that we were strong on technology (we expected this as we’d been funded for that purpose) and spent time commenting on connections with participant data centres. They found we were less strong on business plan – we had good data on costs and plans but needed better projections for future adoption. And we had discussion of preservation actions – auditors asked if we were even doing preservation and what that might mean.

Our next steps and future plans based on this experience has been to implement recommendations working to better identify new users and communities, improve working with other networks. How do changes impact audit – we will “re-audit|” in 18-24 months – what if we change technologies? What is management changes? And finally we definitely have had people getting in touch specifically because of knowing we have been through TRAC. All of our audit and self-audit materials are on the web too so do take a look.

Bernie from the Centre for Research Libraries Global Resources Network: We do audits and certification of key repositories. We are one of the publishers of the TRAC checklist. We are a publisher not an author so I can say that it is a brilliant document! We also participated in development of recent ISO standard 16363. So, where do we get the standing to do audits, certification and involvement in standards. Well we are a specialist centre in

We started in UofChichargo, Northwestern etc. established in the 1949. We are a group of 167 universities in US, Canada and Hong Kong and we are about preserving key research information for humanities and social science. Almost all of our funding comes from the research community – also where are stakeholders and governance sit. And the CRL Certification program has the goal to support advanced research. We do audits of repositories and we do analysis and evaluations. We take part in information sharing and best practice. We aim to do landscape studies – recently been working on digital protest and documentation

Portico, Cronopolic, currently looking at PURR and PTAB test audits. The process is much as described by my colleagues. The repository self-audits, then we request documentation, then a site visit, then report is shared via the web. In the future we will be doing TRAC certification alongside ISO 16363 and we will really focus on Humanities and social science data. We continue to have the same mission as when we were founded in 1949, to enable the resiliance and durability of research information.


Q1 Askar, State University of Denmark) The finance and sustainability for organisations in TRAC… it seemed to be predicated on a single repository and that being the only mission. But national archives are more “too big to fail”. Questionning long term funding is almost insulting to managers…

A1) Certification is not just pass/fail. It’s about identifying potential weakness, flaws, points of failure for a repository. So for a national library they are too big to fail perhaps but the structure and support for the organisation may impact the future of the repository – cost volitility, decisions made over management and scope of content preserved. So for a national institution we look at finance for that – is it a line item in national budget. And that comes out in the order, the factors governing future developments and sustainability.

Topic: Stewardship and Long Term Preservation of Earth Science Data by the ESIP Federation
Speaker(s): Nancy J. Hoebelheinrich

I am principle of knowledge management at Knowledge Motifs in California. And I want to talk to you about preservation of earth science data by ESIP – Earth Science Informaion Partners. My background is in repositories and metadata and I am relatively new to earth sciences data and there are interesting similarities. We are also keen to build synergies with others so I thought it would be interesting to talk about this today.

The ESIP Federation is a knowledge network for science data and technology practitions – people who are building component for a science data infrastructure. It is distributed geographically, in terms of topic, interest. It’s about a community effort, free flowing ideas in a collaborative environment. It’s a membership organisation but you do not have to be a member to participate. It was started by NASA to support Earth Obervation data work. The idea was to not just rely on NASA for environmental resewarch data. They are interested in research, application in education etc. The areas of interest include climate, ecology, hydrometry, carbon management, etc. Members are of four types: Type 4 are large organisations and sponsors including NOAA and NASA. Type 1 are data centres – similar to libraries but considered separate. Type 2 are researchers and Type 3 are Application developers. There is a real cross sectoral grouping so really interesting discussion arises.

The type of things the group is working on are often in data informatics and data science. I’ll talk in more detail in a second but it’s important to note that organisations are cross functional as well – different stakeholders/focuses in each. We coordinate the community via In Person Meetings, ESIP Commons, Telecons/WebEx, Clusters, Working Groups and Committes and these all feed into making us interoperable. We are particularly focused on Professional development, outreach and collaboration. We have a number of active groups, committees and clusters.

Our data and informatics area is about collaborative activities in data preservation and stewardship, semantic web, etc. Data preservation and stewardship is very much about stewardship principles, ditation guidelines, provenance context and content standards, and linked data principles. Our Data Stewardship Principles are hat they are for data creators, intermediaries and data users. So this is about data management plans, open exchange of data, metadata and progress etc. Our data citation guidelines were accepted by ESIP Membership Assembley in January 2012. These are based on existing best practice from International Polar Year citation guidelines. And this ties into geospatial data standards and these will be used by tools like the new Thomson Reuters new Data Citation Index.

Our Provenance, Context and Content Standard are about thinking about the data you need about a data set to make it preservable into the long term. So this is about what you would want to collect and how you would collect that. Initially based on content from NASA and NOAA and discussions associated to them. It was developed and shared via the ESIP wiki. The initial version was in March 2011. latest version is June 2011 but this will be updated regularly. The categories are focused mostly on satellite remote setting – preflight/preopertional instrument descriptions etc. And these are based on Use cases – based on NASA work from 1998. What has happened as a result of that work is that NASA has come up with a specification for their data for earth sciences. They make  a distinction betweeen documentation and metadata, a bit differently from some others. Categories here in 8 areas – many technical but also rationale. And these categories help set baseline etc.

Another project we are working on is Identifiers for data objects. There was an abstract research project on use cases – unique identification, unique location, citable location, scientifically unique identification. They came up with categories and characterstics and questions to ask each ID schemes. The recommended IDs ended up being DOI for a citable locator and UUID for unique identifier but we wanted to test this. We are in the process of looking at this at the moment. Questions and results will be compared again.

And one more thing the group is doing is Semantic Web Cluster Activities – they are creating different ontologies for specific areas such as SWEET – an ontology for environmental data. And there are services built on top of those ontologies (Data Quality Screening Service on weather and climade data from space (AIRE) for instance) – both are available online. Lots of applications for this sort of data.

And finally we do education and outreach – data management training short courses. given that it’s important that researchers know how to manage their data we have come up with a short training courses based on the Khan Acadaemy model. That is being authored and developed by volunteers at the moment.

And we have associated activities and organisations – DataOne, DataConservancy, NSF’s Earth Cube. If you are interested to work with ESIP please get in touch. If you want to join our meeting in Madison in 2 weeks time there’s still time/room!


Q1 – Tom Kramer) It seems like ESIP is an eresearch community really – is there a move towards mega nodes or repository or is it still the Wild West?

A1) It’s still a bit like the Wild West! Lots going on but les focus on distribution and preservation, the focus is much more about making data is ingested and made available – where the repositories community was a few years ago. ESIP is interested in the data being open but not all scientists agree about that, so again maybe at the same point as this community a few years ago.

Q2 – Tom) So how do we get more ESIP folk without a background in libraries to OR2012?

A2) Well I’ll share me slides, we probably all know people in this area. I know there are organisations like EDINA here. etc.

Q3) [didn’t hear]

A3) EarthCube area to talk about making data available. A lot of those issues are being discussed. They are working out the common standard OGC, ISO, sharing ontologies but not nessaccarily preservation behind repositories. It’s sort of data centre by data centre.

Topic: Preservation in the Cloud: Three Ways (A DuraSpace Moderated Panel)
Speaker(s): Richard Rodgers, Mark Leggott, Simon Waddington, Michele Kimtpon, Carissa Smith

Michelle: DuraCloud was developed in the last few years. It’s a software but also a SAAS (Software As A Service) service. So we are going to talk about different usage etc.

Richard Rodgers, MIT: We at MIT libraries participated in several pilot processes in which DuraCloud was defined and refined. The use case here was to establish a geo distributed replication of the repository. We had content in our IR that was very heterogenous in type. We wanted to ensure system administration practices only  address HW or admin failues – other errors unsecured. Service should be automatic yet visible.We developed a set of DSpace tools geared towards collection and administration. DuraCloud provided a single convenient point of service interoperation. Basically it gives you an abstractiojn to multiple backend services. That’s great as it means that applications and protects against lock-in. Tools ad APIs for DSpace integration. High bandwidth acces to developers. Platform for preservation system and institution-friendly service terms.

Challenges and solutions here… It’s not clear how the repository system should create and manae the files yourself. Do all aspects need to have correllated archival units. So we decided to use AIPs – units of replication which packages items together, they gather loose files. There is repository managere involveement – admin UI, integration, batch tools. There is an issue of scale – big data files really don’t suit interactivity in the cloud, replication can be slow, queued not synchronous. And we had to design a system were any local error wouldn’t be replicated (e.g. deletion locally isn’t repeated in replication version). However deletion is forever – you can remove content. The code we did for the pilot has been refined some what and is available for DSpace as an add on – we hink it’s fairly widely used in the DSpace community.

Mark Leggott, University of PEI/DiscoveryGarden: I would echo the complicated issues you need to consider here. We had the same experience in terms of very responsive process with DuraSpace team. Just a quick bit of info on Islandora. It is a Drupal + Fedora framework from UPEI. Flexible UI and apps etc. We think of DuraCloud as a natural extension of what we do. The approach we have is to leverage DuraCloud and CloudSync. The idea is to maintain the context of individual objects and;/or complete collections. To enable a single button restore of damaged edits. And it integrate with standard or private DC. We have an initial release coming. There is a new component in the Manage tab in the Admin panel called “Vault”. It provides full access to DuraCloud and CloudSync services. It’s accessible through Islandora Admin Panel – you can manage settings. you can integrate it with your DuraSpace enabled service. Or you can do this via DiscoveryGarden where we manage DuraCloud on client’s behalf. And in maaging youe maerials you can access or restore at an item or collection level. You can sync to DuraCloud or restore from the cloud etc. You get reports on synching etc. And reports on matches or mismatches so that you can restrore data from the cloud as needed. And you can then manually check the object.

Our next steps are to provide tihhter integratione nad more UI functions, to move to automated recovery, to enable full Fedora/Collection restore, and to include support for private DuraCloud  instances.

Simon: I will be talking about the Kindura project funded by JISC which was a KCL, STFC and ? initiative. The problem is that storage of research outputs (data, documents) is quite ad hoc but it’s a changing language and UK funders can now require data for 10 years+ so it’s important. We wer elooking at hybrid cloud solutions – commercial cloud is very elastic, rapid deployments, transparent cost, but risky in terms of data sensitivity, data protection law, service availablily and loss. In house storage and cloud storage seem like the best way to gain the benefits but mitigate risks.

So Kindura was a proof of concept repository for research data combining commercial cloud and internal storage (iRODS). Based on Fedora Commons. DuraCloud provides a common stoarge intereface and we deployed from source code – we found Windows was best for this and have created some guidelines on this sort of set up. And we developed a storage management framework based on policies, legal and technical constraints as well as cost (including cost of transmitting data in/out of storage) We tried to implement something as flexible as possible. We wanted automated decisions for storage and migration. Content replicaion across storage providers for resiliance. Storage providers transparant to users.

The Kindura system is based on our Fedora Repository feeding Azure, iRODS and Castor (another use case for researchers to migrate to cheaper tape storage) as well as AWS and Rackspace, it also feeds DuraCloud. The repository is populated via web browser depositing into the management server and down into Fedora Respoitory AND DuraCloud.


Q1) For Richard – you were talking about deletion and how to deal with them
A1 – Richard) There are a couple of ways to gather logically delete items. So you can automate based on a policy for garbage collection – e.g. anything deleted and not restored within a year. But you can also  manually delete (you have to do it twice but you can’t mitigate against that).

Q2) Simon, I had a question. You integrated a rules engine and that’s quite interesting. It seems that Rules probably adds some significant flexibility.

A2 – Simon) We actually evaluated several different sorts of rules engines. Jules is easy, open source and for this set up it seemed quite logical to do this. It sits totally separate to DuraCloud set up at the moment but it seemed like a logical extension


 July 12, 2012  Posted by at 8:02 am LiveBlog, Updates Tagged with:  1 Response »
Jul 122012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Eating your own dog food: Building a repository with API-driven development
Speaker(s): Nick John Jackson, Joss Luke Winn

The team decided they wanted to build a wholly new RDM, with research data as a focus for the sake of building the best tool for that job. This repository was also designed to store data during research, not just after.

Old repositories work very well, but they assume the entry of a whole file (or a pointer), only retrievable in bulk and in oddly organized pieces. They have generally limited interface methods and capacities. These old repositories also focus on formats, not form (structure and content) unless there is fantastic metadata.

The team wanted to do something different, and built a great backend first. They were prepared to deal with raw data as raw data. The API was built first, not the UI. APIs are the important bit. And those APIs need to be built in a way that people will want to use them.

This is wear eating your own dog food comes in. The team used their own API to build the frontend of the system, and used their own documentation. Everything had to be done well because it was all used in house. Then, they pushed it out to some great users, and made them do what they wanted to do with the ‘minimum viable product’. It works, and you build from there.

Traditional repos have a database, application, users. They might tack an API on at the end for manual and bulk control, but it doesn’t even include all of the functionality of the website usually. That or you screen scrape, and that’s rough work. Instead, this repository builds an API and then interacts with that via the website.

Research tends to happen on a subset of any given data set, nobody wants that whole data set. So forget the containers that hold it all. Give researches shared, easily usable databases. APIs put stuff in and out automatically.

This was also made extensible from day one. Extensible and writeable by everybody to the very core. The team also encourages re-usable modularity. People do the same things to their data over and over – just share that bit of functionality at a low data level. And they rely on things to do things to get things done – in other words, there’s no sense in replicating other people’s work if it’s done well.

The team ended up building better stuff because it uses its own work – if it doesn’t do what it’s meant to, it annoys them and they have to fix it. All functionality is exposed so they can get their work done quick and easy. Consistent and clean error handling were baked in for the sake of their own sanity, but also for everybody else. Once it’s all good and easy for them, it will be easy for 3rd parties to use, whether or not they have a degree in repo magic. And security is forcibly implemented across the board. API-level authentication means that everything is safe and sound.

Improved visibility is another component. Database querying is very robust, and saves the users the trouble of hunting. Quantitative information is quick and easy because the API gives open access to all the data.

This can be scalable horizontally, to as many servers as needed. It doesn’t use server states.

There are some problems involved in eating your own dog food. It takes time to design a decent API first. You also end up doubling up some development, particularly for frontend post-API development. APIs also add overhead. But after some rejigging, it all works with thousands of points per second, and it’s humming nicely.

Q: Current challenges?

A: Resourcing the thing. Lots of cutting edge technology and dependence on cloud architecture. Even with money and demand, IT infrastructure aren’t keeping up just yet.

Q: How are you looking after external users? Is there a more discoverable way to use this thing?

A: The closest thing we have is continuous integration to build the API at multiple levels. A discovery description could be implemented.

Q: Can you talk about scalability? Limitations?

A: Researchers will sometimes not know how to store what they’ve got. They might put pieces of data on their own individual rows when they don’t need to be. That brings us closer to our limit. Scaling up is possible, and doing it beyond limits is possible, but it requires a server-understood format.

Q: Were there issues with developers changing schemas mysteriously? Is that a danger with MongoDB?

A: By using our own documentation, forcing ourselves to look at it when building and questioning. We’ve got a standard object with tracking fields, and  if a researcher starts to get adventurous with schemas it’s then on them.


Topic: Where does it go from here? The place of software in digital repositories
Speaker(s): Neil Chue Hong

Going to talk about the way that developers of software are getting overlapping concerns with the repository community. This isn’t software for implementing infrastructure, but software that will be stored in that infrastructure.

Software is pervasive in research now. It is in all elements of research.

The software sustainability institute does a number of things at strategic and tactical levels to help create best practices in research software development.

One question is the role of software in the longer term – five and ten years on? The differences between preservation and sustainability. The former holds onto things for use later on, while the latter keeps understanding in a particular domain. The understanding, the sustainability, is the more important part here.

Several purposes for sustaining and preserving software. For achieving legal compliances (architecture models ought to be kept for the life of a building). For creating heritage value (gaining an overall understanding of influences of a creator). For continued access to data (looking back, through the lens of the software). For software reuse (funders like this one).

There are several approaches. Preserving the technology, whether it’s physical hardware or an emulated environment. Migration from one piece of software to another over time while ensuring functionality, or transitioning to something that does similar. There’s also hibernation, just making sure it can be picked apart some day if need be.

Computational science itself needs to be studied to do a good job of this. Software carpentry teaches scientists basic programming to improve their science. One thing, using repositories, is an important skill. Teaching scientists the exploratory process of hacking together code is the fun part, so they should get to do it.

Re-something is the new black. Reuse, review, replay, rerun, repair. But also reward. How can people be rewarded for good software contributions, the ones that other people end up using. People get pats on the back, glowing blog posts, but really reward in software is in its infancy. That’s where repositories come in.

Rewarding good development often requires publication which requires mention of the developments. That ends up requiring a scientific breakthrough, not a developmental one. Software development is a big part of science and it should be viewed/treated as such.

Software is just data, sure, but along with the Beyond Impact team these guys have been looking at software in terms of preservation beyond just data. What needs to get kept in software and development? Workflows should, because they show the boundaries of using software in a study – the dependencies and outputs of the code. Looking at code on various levels is also important. On the library/software/suite level? The program or algorithm or function level. That decision is huge. The granularity of software needs to be considered.

Versioning is another question. It indicates change, allows sharing of software, and confers some sort of status. Which versions should go in which repositories, though? That decision is based on backup (github), sharing (DRYAD), archiving (DSpace). Different repositories do each.

One of the things being looked at in sustaining software are software metapapers. These are scholarly records including ‘standard’ publication, method, dataset and models, and software. This enables replay, reproduction, and reuse. It’s a pragmatic approach that bundles everything together, and peer review can scrutinize the metadata, not the software.

The Journal of Open Research Software allows for the submission of software metapapers. This leads to where the overlap in development and repositories occurred, and where it’s going.

The potential for confusion occurs when users are brought in and licensing occurs. It’s not CC BY, it’s OSI standard software licenses.

Researchers are developing more software than ever, and trying to do it better. They want to be rewarded for creating a complete scholarly record, which includes software. Infrastructure needs to enable that. And we still don’t know the best way to shift from one repository role to another when it comes to software – software repositories from backup to sharing to archival. The pieces between them need to be explored more.

Q: The inconsistency of licensing between software and data might create problems. Can you talk about that?

A: There is work being done on this, on licensing different parts of scholarly record. Looking at reward mechanisms and computability of licenses in data and software need to be explored – which ones are the same in spirit?


Topic: The UCLA Broadcast News Archive Makes News: A Transformative Approach to Using the News in Teaching, Research, and Publication
Speaker(s): Todd Grappone, Sharon Farb

UCLA has been developing an archive since the Watergate hearings. It was a series of broadcast television recordings for a while, but not it’s digital libraries of broadcast recordings. That content is being put into a searchable, browsable interface. It will be publicly available next year. It grows about a terabyte a month (150000+ programs and counting), which pushes the scope of infrastructure and legality.

It’s possible to do program-level metadata search. Facial recognition, OCR of text on screen, closed caption text, all searchable. And almost 10 billion images. This is a new way for the library to collect the news since papers are dying.

Why is this important? It’s about the mission of the university copyright department: public good, free expression, and the exchange of ideas. That’s critical to teaching and learning. The archive is a great way to fulfill that mission. This is quite different from the ideas of other Los Angeles organizations, the MPAA and RIAA.

The mission of higher education in general is about four principles. The advancement of knowledge through research, through teaching, and of preservation and diffusion of that knowledge.

About 100 news stations being captured so far. Primarily American. International collaborators are helping, too. Pulling all broadcast, under a schedule scheme with data. It’s encoded and analyzed, then pushed to low-latency storage in H.264 (250MB/hr). Metadata is captures automatically (timestamp, show, broadcast ID, duration, and full search by closed captioning). The user interface allows search and browse.

So, what is news? Definitions are really broad. Novelties, information, and a whole lot of other stuff. The scope of the project is equally broad. That means Comedy Central is in there – it’s part of the news record. Other people doing this work are getting no context, little metadata, less broadcasts. And it’s a big legal snafu that is slowly untangling.

Fortunately, this is more than just capturing the news. There’s lots of metadata – transformative levels of information. Higher education and libraries need these archives for the sake of knowledge and preservation.

Q: Contextual metadata is so hard to find, and knowing how to search is hard. How about explore? How about triangulating with textual news via that metadata you do have?

A: We’re pulling in everything we can. Some of the publishing from these archives use almost literally everything (court cases, Twitter, police data, CCTV, etc). We’re excited to bring it all together, and this linkage and exploration is the next thing.

Q: In terms of tech. development, how has this archive reflected trends in the moving image domain? Are you sharing and collaborating with the community?

A: An on-staff archivist is doing just that, but so far this is just for UCLA. It’s all standards-driven so far, and community discussion is the next step.


Topic: Variations on Video: Collaborating toward a robust, open system to provide access to library media collections
Speaker(s): Mark Notess, Jon W. Dunn, Claire Stewart

This project has roots in a project called Variations in 1996. It’s now in use at 20 different institutions, three versions. Variations on Video is a fresh start, coming from a background in media development. Everything is open source, working with existing technologies, and hopefully engaging with a very broad base of users and developers.

The needs that Variations on Video are trying to meet are archival preservation, access for all sorts of uses. Existing repositories aren’t designed for time-based media. Storage, streaming, transcoding, access and media control, and structure all need to be handled in new ways. Access control needs to be pretty sophisticated for copyright and sensitivity issues.

Existing solutions have been an insufficient fit. Variations on Video offers basic functionality that goes beyond them or does them better. File upload, transcoding, and descriptive metadata will let the repository stay clean. Navigation and structural metadata will allow users to find and actually use it all.

VoV is built on a Hydra framework, Opencast Matterhorn, and a streaming server that can serve up content to all sorts of devices.

PBCore was chosen for descriptive metadata, with an ‘Atomic’ content model: parent objects for intellectual descriptions, child objects for master files, children of these for derivatives. There’s ongoing investigation for annotation schemes.

Release 0 was this month (upload, simple metadata, conversion), and release one will come about in December 2012. Development will be funded through 2014.

Uses Backlight for discover, Strobe media player for now. Other media players with more capabilities are being considered.

Variations on Video is becoming AVALON (Audio Video Archives and Libraries Online).

Using the agile Scrum approach with a single team at the university for development. Other partners will install, test, provide feedback. All documentation, code, workflow is open, and there are regular public demos. Hopefully, as the software develops, additional community will get involved.

Q: Delivering to mobile devices?

A: Yes, the formats video will transcode into will be selectable, but most institutions will likely choose a mobile-appropriate format. The player will be able to deliver to any particular device (focusing on iOS and Android).

Q: Can your system cope with huge videos?

A: That’s the plan, but ingesting will take work. We anticipate working with very large stuff.

Q: How are you referencing files internally? Filenames? Checksums? Collisions of named entries?

A: Haven’t talked about identifiers yet. UUIDs generated would be best, since filenames are a fairly fragile method. Fedora is handling identifiers so far.

Q: Can URLs point to specific times or segments?

A: That is an aim, and the audio project already does that.

 July 12, 2012  Posted by at 7:59 am LiveBlog, Updates Tagged with:  Comments Off on P6A: Non-traditional content LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 1 (LT1), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Hi there, I’m Mahendra Mahey, I run the DevCSI project, my organisation is funded by JISC. This is the fifth Developer Challenge. This is the biggest to date! We had 28 ideas. We have 19 presentations, each gets 3 minutes to present! You all need a voting slip! At the end of all of the presentations we will bring up a table with all the entries. To vote write the number of your favourite pitch. If it’s a 6 or a 9 please underline to help us! We will take in the votes and collate them. The judges won’t see that. They will convene and pick their favourites and then we will see if they agree… there will then be a final judging process.

The overall prize and runner up shares £1000 in Amazon vouchers. The overall winner will be funded to develop the idea (depending on what’s logitically possible). And Microsoft research have a .Net gadgeteer prize for the best development featuring Microsoft technology. So we start with…

1 – Matt Taylor, University of Southampton – Splinter: Renegade Repositories on Demand

The idea is that you have a temporary offshoot of your repository, can be disposed or reabsorbed, ideal for conferences or workshops, reduces overhead, network of personal microrepositories – the idea is that you don’t have to make accounts for anyone temporarily using your repositoriy. It’s a network of personal microrepository, A lightweight standalone anotation system. Its independent of the main repository. Great for inexperienced users, particularly important if you are a high prestige university. And the idea is that it’s a pseudopersonal workspace – can be shared on the web but separate of your main repository. And it’s a simplified workflow – so if you make a splinter repository for an event you can use contextual information – conference date, location, etc. to populate metadata. Microrepository already in development and tech exists: RedFeather.ecs.soton.ac.uk. Demo at Bazaar workshop tomorrow. Reabsorption trivial using SWORD.

2 – Keith Gilmerton and Linda Newman – MATS: Mobile Audio Transcription and Submission

The idea is that you submit audio to repositories from phones. You set up once. You record audio. You select media for transcription, you add simple metadata You can review audio. Can pick from Microsoft Research’s MAVIS or Amazon’s Mechanical Turk. When submission back you get transcription and media to look at, can pick which of those two – either or both – you upload. And even if transcript not back its OK – new SWORD protocol does updates. And this is all possible using Android devices and code reused from one of last years challenges! Use cases – digital archive of literacy studies seek audio files, elliston poetry curator make analogue recordings , tablets in the field – Pompeii Archeaological Research Project would greatly increase submissions of data from the field.

3 – Joonas Kesaniemi and Kevin Van de Velde – Dusting off the mothballs introducing duster

The idea is to dust off time series here.  The only thing constant is change (Heraclitus 500BC). I want to get all the articles from AAlto university. It’s quite a new university but there used to be three universities that merged together. It would help to describe that the institution changed over time. Useful to have a temporal change model. Duster (aka Query expansion service) takes a data source that is a complex data model and then makes that available. Makes a simple Solr document for use via API. An example Kevin made – searching for one uni searches for all…

4 – Thomas Rosek, Jakub Jurkiewicz [sorry names too fast and not on screen] – Additional text for repository entries

In our repository we have keywords on the deposits – we can use intertext to explain keywords. Polish keywords you may not know them – but we can see that in English. And we can transliterate cyrillic. The idea is to build a system from blogs – connected like lego bricks. Build a blog for transliteration, for translating, for wikipedia, blog for geonames and mapping. And these would be connected to repository and all work together. And it would show how powerful

5 – Asger Askov Blekinge – SVN based repositories 

Many repositories have their own versioning systems but there are already well established versioning systems for software development that are better (SVN, GIT) so I propose we use SVN as the back end for Fedora.

Mass processing on the repository dowsn’t work well. Checkout the repo to a hadoop cluster, run the hadoop job, and commit the changed objects back. If we used standardised back end to access repository we could use Gource – software version control visualisation. I have developed a proof of concept that will be on Github in next few days to prove that you can do this, you can have a Fedora like interace on top of SVN repository.

6. Patrick McSweeney, University of Southampton – DataEngine

This is a problem we encountered, me and my friend Dabe Mills. For his PhD he had 1 GB of data, too much for the uni. Had to do his own workaround to visualise the data. Most of our science is in tier 3 where some data, but we need support! So the idea is that you put data into repository, allows you to show provenance, can manipulate data in the repository, merge into smaller CSV files, create a visualisation of your choice. You store intermediary files, data and the visualisations. You could do loads of visualisations. Important as first step on road to proper data science. Turns repository into tool that engages researchers from day one. And full data trail is there and is reproducable. And more interesting than that. You can take similar data, use same workflow and compare visualisation. And you can actually compare them. And I did loads in 2 days, imagine what I could do in another 2!

7. Petr Knoth from the Open University –  Cross-repository mobile application 

I would like to propose an application for searching across all repositories. You wouldn’t care about which repository it’s in, you would just get search it, get it, using these apps. And these would be provided for Apple and Google devices. Available now! How do you do this? You use APIs to aggregate – we can use applications like CORE, can use perhaps Microsoft Academic Search API. The idea of this mobile app is that it’s innovation – it’s a novel app. The vision is your papers are everywhere through syncing and sharing. It’s relevance to user problems: WYFIWYD: What you find is what you download. It’s cool. It’s usable. Its plausible for adoption/tech implementation.

8. Richard Jones and Mark MacGillivray, Cottage Labs – Sword it!

Mark: I am also a PhD student here at Edinburgh. From that perspective I know nothing of repositories… I don’t know… I don’t care… maybe I should… so how do we fix it. How do we make me be bothered?! How do we make it relevent.

Richard: We wrote Sword it code this week. It’s a jQuery plugin – one line of javascript in your header – to turn the page into a deposit button. Could go in repository, library website, your researchers page… If you made a GreaseMonkey script – we could but we haven’t – we could turn ANY page into a deposit! Same with Google results. Let us give you a quick example…

Mark: This example is running on a website. Couldn’t do on Informatics page as I forgot my login in true researcher style!

Richard: Pick a file. Scrapes metadata from file. Upload. And I can embed that on my webpage with same line of code and show off my publications!

9. Ben O Steen – isthisresearchreadable.org

Cameron Neylon came up to me yesterday saying that lots of researchers submit papers to repositories like PubMed but also to publishers… you get DOIs. But who can see your paper? How can you tell which libraries have access to your papers? I have built isthisresearchreadable.org. We can use CrossRef and a suitable size sample of DOIs to find out the bigger picture – I faked some sample numbers but CrossRef is down just now. Submit a DOI, see if it works, fill in links and submit. There you go.

10. Dave Tarrant – The Thing of Dreams: A time machine for linked data

This seemed less brave than kinect deposit! We typically publish data as triples… why aren’t people publishing this stuff when they could be… well because they are slightly lazy. Technology can solve problems I’ve created LDS3.org. It’s very Sword, very CRUD, very Amazon webs services… So in a browser… I can look at a standard Graphite RDF document. But that information is provided by this endpoint, gets annotated automatically. Adds date submitted and who submitted it. So, the cool stuff… well you can click view doc history… it’s just like Apple time machine that you can browse through time! And cooler yet you can restore it and browse through time. Techy but cool! But what else does this mean… we want to get to semantic web, final frontier.. how many countries have capital cities with an airport and a population over 2 million… on 6th June 2006. Can do it using Memento. Time travel for the web + time travel for data! The final frontier.

11. Les Carr – Boastr – marshalling evidence for reposting outcomes

I have found as a researcher I have to report on outcomes. There is technology missing. Last month a PhD student tweeted that he’d won a prize for a competition from the world bank – with link to World bank page and image of him winning prize, and competition page. We released press release, told EPSRC, they press released. Lots of dissemination, some of that should have been planned in advance. All published on the web. And it disappears super fast. It just dissapates… we need to capture that stuff for 2 years time when we report that stuff! It all gets lost! We want to capture imagination while it happens. We want to put stuff together. Path is a great app for stuff like Twitter has a great interface – who, what, where. Tie to sources of open data, maybe Microsoft Academic Live API. Capture and send to repositories! So that’s it: Boastr!

12. Juagr Adam Bakluha? – Fedora Object Locking

The idea is to allow multiple Fedora webapps working together to allow multiheaded fedora working we can do mass processing like: Fedora object store on a Hadoop File System, one fedora head, means bottlenecks, multiple heads mean multiple apps. Some shared stat between webapps. Add new rest methods – 3 lines in some jaxrs.xml. Add the decorator – 3 lines in Fedora.fcfg and you have Fedora Object locking

13. Graham Triggs – SHIELD

Before the proposal lets talk SWORD… its great, but just for deposit. With SWORD2 you can edit but you get edit iri and you need those, what if you lose them. What if you want to change content in the repository? So, SWORD could be more widely used if edit iris were discoverable. I want an ATOM feed. I want it to support authentication. Better replacement for OMI-PMH. But I want more. I want it to complete non archived items, non complete items, things you may have deposited before. Most importantly I want the edit iri! So I said I have a name…. I want a Simple Harvest Interface for Edit Link Discovery!

14. Jimmy Tang, DRI – Redundancy at the file and network level to protect data

I wanted to talk about redundancy at file and network level to protect data. One of the problems is that people with multi-terabyte archives like to protect it. Storage costs money. Replicating data is wasteful and expensive I think. LOCKSS/Replicating data can be wasteful. Replication means N times cost and money. My idea is to take an alternative approach… Possible solutions is using forward error correcting or erasure codes to a persistant layer – like setting up a RAID disc. You keep pieces of files and you can reconstruct it – move complexity from hardware to software world and save money with the efficiency. There are open source libraries to do this, most are mash ups. Should be possible!

15. Jose Martin – Machine and user-friendly policifying

I am proposing a way to embed data from SHERPA ROMEO webservices into records waiting to be reviewed in a repository. Last week I heard how SHERPA/ROMEO receives over 250K requests for data, he was looking for a script to make that efficient, a script to run on a daily or weekly basis. Besides this task is often fairly manual. Why not put machines to work instead… so we have an ePrints repository with 10 items to be reviewed. We download SHERPA/ROMEO information here. We have the colour code that give a hint about policy. Script would go over all items looking for ISSN matches and find colour code. and let us code those submissions – nice for repository manager and means the items are coded by policy ready to go. And updated policy info done in just one request for, say, 10 items. More efficient and happier! And retrieve journal title whilst at it.

16. Petr Knoth – Repository ANalytics

Idea to make repository managers lives very easy. They want to know what is being harvested and if everything is correct in their system. It’s good if someone can check from the outside. The idea is that analytics sit outside repository, lets them see metadata harvested, if it works OK and also provides stats on content – harvesting of full text PDF files. Very important. even though we have OMI-PMH there are huge discrepancies between the files. I am a repository manager I can see that everything is fine, that it has been carried out etc.  So we can see a problem with an end point. I propose we use this to automatically notify repository manager that something is wrong. Why do we count metadata not PDFs – latter are much more important. Want to produce other detailed full text stats, eg citation levels!

17. Steffan Godskesen – Current and complete CRIS with Metadata of excellent quality 

Researchers don’t want to do thinsg with metadata but librarians do care. In many cases metadata is already available from other sources and in your DI. So When we query the discovery iunterface cleverly we can extract metadata inject into CRIS, have librarians quality check it and obtain excellent CRIS. Can we do this? We have done this between our own DI (discovery system) and CRIS. And again when we changed CRIS, again when we changed DI. Why do again and again… to some extent we want help from DI and CRIS developers to help make these systems extract data more easily!

18. Julie Allison and Ben O’Steen – Visualising Repositories in the Real World

We want to use .Net Gadgeteer or Arduino to visualise repository activity, WHy? to demonstrate in the real world what happens in the repository world. Screens showing issues maybe. A physical guage for hits for hourse – great demo tool. A bell that ring when met deposits per day target. Or blowing bubbles for each deposit. Maybe 3D printing of deposited items? Maybe online Chronozoom, PivotViewer – explore content, JavaScript InfoVis – set of visualisation tools. Repository would be mine – York University. Using query interface to return creation date etc. Use APIs etc. So for example a JSPN animation of publications and networks and links between objects.

19. Ben O’Steen – Raid the repositories!

Lots of repositories with one managers, no developers. Raid them! VM that pulls them all in, pull in text mining, analysis, stats, enhancer etc. Data. Sell as a PR tool £20/month as a demo. Tools for reuse.

Applause meter in the room was split between Patrick MacSweeney  and Richard Jones & Mark MacGillivray’s presentation.

 July 11, 2012  Posted by at 4:03 pm LiveBlog, Updates Tagged with:  1 Response »
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Repositories and Microsoft Academic Search
Speaker(s): Alex D. Wade, Lee Dirks

MSResearch seeks out innovators from the worldwide academic community. Everything they produce is freely available, non-profit.

They produce research accelerators in the for of Layerscape (visualization, storytelling, sharing), DataUp (used to be called DataCuration for Excel), and Academic Search.

Layerscape provides desktop tools for geospatial data visualization. It’s an Excel add-in that creates live-updating earth-model visuals. It provides the tooling to create a tour/fly-through of the data a researcher is discussing. Finally, it allows people to share their tours online – they can be browsed, watched, commented on like movies. If you want to interact with the data you can download the tour with data and play with it.

DataUp aids scientific discovery by ensuring funding agency data management compliance and repository compliance of Excel data. It lets people go from spreadsheet data to repositories easily. This can be done through an add-in or via cloud service. The glue that sticks theses applications together is repository agnostic, with minimum requirements for ease of connection. It’s all open source, driven by DataOne and CDL. It is in closed beta now with a wide release later this summer.

Now, Academic Search. It started by bringing together several research projects in MSResearch. It’s a search engine for academic papers from the web, feeds, repositories. Part of the utility of it is a profile of information around each publication, possibly from several sources, coalesced together. As other full-text documents cite in, those can be shown in context. Keywords can be shown and linked to DOI, can be subscribed to for change alerts. These data profiles are generated automatically, and that can build automatic author profiles as well. Conferences and journals they’ve published in, associations, citation history, institution search.

The compare button lets users compare institutions by different publication topics – by the numbers, by keywords, and so on. Visualizations are also available to be played with. The Academic Map shows publications on a map.

Academic Search will also hopefully be used a bit more than as a search engine. It is a rich source of information that ranks journals, conferences, academics, all sortable in a multitude of ways.

Authors also have domain-specific H-Index numbers associated with them.

Anyone can edit author pages, submit new content, clean things up. Anyone can also embed real-time pulls of data from the site onto their own site.

With the Public API, and an API key, you can fetch information with an even broader pull. Example: give me all authors associated with University of Edinburgh, and all data associated with them (citations, ID number, publications, other others, etc). With a publication ID, a user could see all of the references included, or all of the documents that cite it.

Q: What protocol is pushing information into the repositories?

A: SWORD was being looked at, but I’m uncertain about the merit protocol right now. SWORD is in the spec, so it will be that eventually.

Q: Does Academic Search harvest from repositories worldwide?

A: We want to, but first we’re looking at aggregations (OCLC Oyster). We want to provide a self-service registration mechanism, plus scraping via Bing. Right now, it’s a cursory attempt, but we’re getting better.

Q: How is the domain hierarchy generated?

A: The Domain hierarchy is generated manually with ISI categories. It’s an area of debate: we want an automated system, but the challenge is that more dynamic systems make rank lists and comparison over time more difficult. It’s a manual list of categories (200 total, at the journal level).

Q: Should we be using a certain type of metadata in repos? OAIPMH?

A: We use OAIPMH now, but we’re working on analysis of all that now. It’s a long term conversation about the best match.


Topic: Enhancing and testing repository deposit interfaces
Speaker(s): Steve Hitchcock, David Tarrant, Les Carr

Institutional repositories are facing big challenges. How are they presenting a range of services to users? How is presentation of repositories being improved, made easier? The DepositMO project hopes to improve just that. It asks how we can reposition the deposit process in a workflow. SWORD and V2 enable this.

So, IRs are under pressure. The Finch report suggests a transition with clear policy direction toward open access. This will make institutional open access repositories for publication obsolete, but not for research data. Repositories are taking a bigger view of that, though. Even if publications are open access, they can still be part of IR stores.

DepositMO has been in Edinburgh before. It induced spontaneous applause. It was also at OR before, in 2010.

This talk was a borderline accepted talk, perhaps because there is not a statement included: few studies of user action with repositories.

There are many ways that users interact with repositories, which ought to be analyzed. SWORD for Facebook, for Word.

SWORD gives a great scope of use between the user and repository, especially with V2. V2 is native in many repositories now, partially because of DepositMO.

With convenient tools built into already used software, like Word, work can be saved into repositories as it is developed. Users can set up Watch Folders for adding data, either as a new record or an update to an older version if changed locally. The latter example is quite a bit like Dropbox or Skydrive, but repositories aren’t harddrives. They aren’t designed as storage devices. They are curation and presentation services. Depositing means presenting very soon. DepositMO is a bit of a hack to prevent presentation while iteratively adding to repository content. Save for later, effectively.

Real user tests of DepositMO have been done – set up some laptops running created services and inviting users to test in pairs. This wasn’t about download, installation, and setup, but actual use in a workflow. Is it useful in the first place? Can it fit into the process? Task completion and success rates of repository user tasks were collected as users did these things.

On average, Word and watch folder deposit tools improved deposit time amongst other things. However, these entries aren’t necessarily as well documented as is typically necessary. The overall summary suggests that while there is a wow-factor in terms of repository interaction, the anxiety level of users increases as the amount of information they have to deposit increases. Users sometimes had to retrace steps, or else put things in the wrong places as they worked. They needed some trail or metadata to locate deposit items and fix deposit errors.

There are cases for not adding metadata during initial entry, though, so low metadata might not be the worst thing.

Now it’s time to do more research, exploring the uses with real repositories. That project is called DepositMOre. Watch Folder, EasyChair one-click submission, and to an extent the word add in will be analyzed statistically as people actually deposit into real repositories. It’s time to accomodate new workflows, to accomodate new needs, and face down challenges of publishers offering open access.

Q: Have you looked into motivations for user deposit into repositories?

A: No, it was primarily a study of test users through partners in the project. The how and what of usage and action, but not the why. There was a wonder whether more data about the users would be useful. If more data was obtainable, the most interesting thing would be understanding user experience with repositories. But mandate motivation, no, not looking into that.

Q: You’ve identified a problem users have with depositing many things and tracking deposits. Did you identify a solution?

A: It’s more about dissuading people from reverting to previous environments and tools. There are more explicit metadata tools, and we could do a better job of showing trails of submission, so that will need to filter back in. Unlike cloud drives, losers use control of an object once they are submitted to a repository. So, suddenly something else is doing something, and the user it’s disconcerting.


Topic: OERPub API for Publishing Remixable Open Educational Resources (OER)
Speaker(s): Katherine Fletcher, Marvin Reimer

This talk is about a SWORD implementation and client. Most of this work has happened in the last year, very quick.

Remixable open education repositories target less academic and more multi-institution, open repos. Remixability lets users learn anywhere. It’s a ton of power. All these open resources can seed a developer community for authoring and creation, machine learning algorithms, and it all encourages lots of remixable creation.

Remixability can be hard to support, though. Connexions, and other organizations, had grand ambition but not a very large API. And you need an importer/editor that is easy to use. Something that can mash data up.

In looking at APIs needed for open education, discoverability is important, but making publishing easier is important, too. We need to close the loop so that we stop losing the remixed work externally. That’s where SWORD comes in. V2.

Why SWORD V2 for OER? It has support for workflow. The things being targeted are live edited objects, versioned. Those versions need to be permanent so that changes are nondestructive. Adapting, translating, deriving are great, but associating them with common objects helps tie it all together.

OERPub extends SWORD V2. It clarifies and adds specificity to metadata. Specificity is required for showing the difference between versions and derivatives, specifically. And documentation is improved. Default values, repository controlled and auto-generated values are all documented. Precedents have been made clear, that’s it.

OERPub also merges semantics header for PUT. It simplifies what’s going on. Also added a section on Transforms under packaging. If a repository will transform content, it has a space to explain its actions. It provides error handling improvements, particularly elaboration on things like transform and deposit fails.

This is the first tool to submit to Connexions from outside of Connexions.

Lessons learned? Specification detail was great. Good to model on top of and save work. Bug fixes also lead the project away from multiple metadata specifications – otherwise bugs will come up. Learned that you always need a deposit receipt, which is normally optional. Finally, auto-discovery – this takeaway suggests a protocol for accessing and editing public item URLs.

A client was built to work with this – a transform tool to remixable format in very clean HTML, fed into Connexions, and pushed to clients on various devices. A college chemistry textbook was already created using this client. And a developer sprint got three new developers fixing three bugs in a day – two hours to get started. This is really enabling people to get involved.

Many potential future uses are cropping up. And all this fits into curation and preservation – archival of academic outputs as an example.

Q: Instead of PUT, should you be using PATCH?

A: Clients aren’t likely to not know repositories, but it is potentially dangerous to ignore headers. Other solutions will be looked at.

Q: One lesson learned was to avoid multiple ways of specifying metadata. What ways?

A: DublinCore fields with attributes and added containers. That caused errors. XML was mixed in, but we had to eventually specify exactly which we wanted.

 July 11, 2012  Posted by at 2:31 pm LiveBlog, Updates Tagged with:  Comments Off on P5A: Deposit, Discovery and Re-use LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: ORCID update and why you should use ORCIDs in your repository
Speaker(s): Simeon Warner

I am speaking with my Cornell hat on and my ORCID hat on today. So this is a game of two halves. The first half is on ORCID and what it is. And the second half will be about the repository case and interfacing with ORCID.

So, the scholarly record is broken because there is no reliable attribution of authors and contributors is impossible without unique person-level identifiers. I have an unusual name so the issue is mild, but if you have a common name you are in real trouble. We want to find unique identities to person records across data sources and types and to enlist a huge range of stakeholders to do this.

So ORCID is an amazing opportunity that emerged a couple of years ago. Suddently publishers, achivists, etc. all started talking about the same issue. It is an international, interdisciplinary, open and not for profit organization. We have stakeholders that include research institutions, funding organizations, publishers and researchers. We want to create a registry of persistent unique identifyer fo all sorts of roles – not just authors – and all sorts of contributions. We have a clear scope and set of principles. We will create this registry and it will only work if it’s used very widely. The failure of previous systems have been because the scope hasn’t been wide enough. One of the features of research is that things move –  I was a physicist, now repositories, libraries… I don’t live in one space here. To create an identity you need som einformation to manage that. You need a name, an email, some other bits of information, and the option for users to update their profile with stuff that is useful for them. Privacy is an issue – of course. So we have  a principle in ORCID is opt-in. You can hide your record if you want. You can control what is displayed about you. And we have a set of open principles about how ORCID will interact with other systems and infrastructure.

So ORCID will disambiguate researchers and allow tracking. automate repository deposition, and other tasks that levage use of this sort of ID. We have 328 participan organizations, 50 of which have provided sponsorship. And that’s all over the world.

So to go through a research organization workflow: for an organisation it’s a record of what researchers have done in that institution. But you don’t want a huge raft of staff needed to do this. So the organisation registers with ORCID. At some stage ORCID looks for a record of a person and the organisation pulls out data on that person. Once that search is done on already held information. Identifiers can then be created ready for researchers to claim these.

So, granting bodies, in the US there is always a complaint and a worry about the buden of reporting. So what if we tied this up to an ORCID identity? Again the granting body registers with ORCID and then an ORCID::grant linking sent to PI or researcher for confirmation. Same idea again with the publisher. If you have granted the publisher the ability to do it you can let them add the final publication to your name, saving effort and creating a more accurate record.

So a whole set of workflows gives us a sort of vision for researchers as early as possible in the creation of research here. And in phase I system the researcher can self-claim a profile, delegate management and institutional record creation. Fine grained control of privacy settings. Data exchange into grant and manuscript submission system, authorised organisations/publications etc. So right now we have an API, a sandbox server, etc. Now working out launch partners and readying for launch. ORCID registry will launch in Q4 of 2012. Available now: ORCID identifier structure (coordinated with ISNI) will have a specific structure. Code, APIs, etc. available.

So why should you use ORCID in your repository?

Well we have various stakeholders in your repository – authors, academic community and the institutions themselves. Institutional authors want credit for their work, ORCID should and will increase the likelihood of authors publications being recognised. It opens the door to link to articles that wouldn’t be linked up – analyses of citations etc. Opens the door to more nuanced notions of attributions. And it saves efforts by allowing data reuse across institutions. For readers it offers better discovery and analysis tools. Valuable information for improving tools like Microsoft Academic search, better ways to measure research contributions etc. And institutions allows robust links between local and remote repositories, better track and measure use of publications.

And from an arXiv position we’ve looked for years for something to unify author details across our three repositories. We have a small good quality repositories but we need that link between the author and materials. And from UK/JISC perspective there is a report from JISC Research Identifier task force group that indicates the benefits of ORCID. I think for repositories ORCID helps make repositories count in a field we have to play in.

So, you wnat to integrate with ORCID. There are two tiers to the API right now, I’ll talk about both. All APIs return XML or JSON data. The tier 1 API is available to all for free, no access controls. With this you can ask a researcher for their ORCID ID and look at data made public. You could provide pop up in your repository deposit process to check for their ORCID ID. There is a competition between functionality and privacy here but presuming they have made their ID public this will be very useful.

Tier 2 API members will have access to an OAuth2 authentication between service and ORCID allow users to grant certain rights to a service. Access to both public and (if granted) protected data. Ability to add data (if granted). Really three steps to this process. Any member organisation in the process would get an ORCID ID in first stage of the process. Secondly if you have a user approaching the repository that user can login and grant data access to the client repository. The user can be redirected back to the repository along with an access permisssion. And if access is granted then the repository continues to have access to the user’s profile until this permission is revoked by the user (or ORCID). And data can be added to the users profile by the repository if it becomes available.

All code etc. on dev.orcid.org. Follow the project on Twitter @ORCID_Org.


Q1 – Ryan) You mentione dthat ORCID will send information to CrossRef, what about DataCite?

A1) I don’t think I said that. We import data from CrosRef, not an import the other way. I think that would be led by DOI owner, not ORCID. DOI is easy, someone has the right to a publication, people don’t work that way.

Q1) In that case I encourage you to work with DataCite.

A1) If it’s public on ORCID anyone can harvest it. And ORCID can harvest any DOI source.

Q2 – Natasha from Griffith University) An organisation is prompted to remove duplicates? How does that work?

A2) We are working on that. We are not ready to roll out bulk creation of identifiers for third party at th emoment. The initial creation will be by individuals and publications. We need to work out how best to do that. Researchers want this to be more efficient so we need to figure that question out.

Topic: How dinosaurs broke our system: challenges in building national researcher identifier services
Speaker(s): Amanda Hill

So I am going to talk about the wider identifier landscape that ORCID and others fits into. So on the one hand we have book-level data, it’s labour intensive, disambiguation first, authors not involved, open. And then we have publisher angle – automatic, disambiguation later, authors can edit, proprietary. In terms of current international activity we have ISNI as well as ORCID. ISNI is very library driven, disambiguation first, authors not involved, broad scope. ORCID is more publisher instigated, disambiguation later, authors can submit/edit, current researchers. ISNI is looking at fictional entities etc. as well as researchers etc. so somewhat different.

We had a Knowledge Exchange meeting on Digital author identifiers in March 2012 and both groups were encouraged and present, they are aware and working with each other to an extent. Both ISNI and ORCID will use of existing pools of data to populate them. There are a number of national author ID systems – in 2011 there was a JISC-funded survey to look at systems and their maturity. We did this via a survey to national organisations. The Lattes system in Brazil is very long term – its been going since 1999 – and very mature and very well populated but there is  a diverse landscape.

In terms of populating systems there is a mixture – some prepopulated, some manual, some authors edit themselves. In Japan there was an existing researcher identifiers, thesaurus of author names in Netherlands. In Norway they use human resources data for the same purpose. With more mature systems a national organisation generally has oversight – e.g. in Brazil, Norway, Netherlands. There is integration with research fields and organisations etc. It’s a bit different in the UK. The issue was identified in 2006 as part of call for proposals for the JISC-funded repositories and preservation programme. Mimas and British Library proposed a two year project to investigate requirements and build a prototype system. This project, the Names project, can seem dry but actually it’s a complex problem. Everyone has stories of ambiguation.

The initial plan was to use the British Library Zetoc service to create author IDs – journal article information from 1993 but it’s too vast, too international. And it’s only last names and initials, no institutional affiliation. So we scrapped that. And luckily the JISC Merit project used 2008 Research Assessment Exercise data to pre-populate the Names database. It worked well except for twin brothers with the same initials both writing on paleantology and often co-authoring papers… in name authority circles we call this the “Siveter problem” (the brothers surnames). We do have both in the system now.

Merit data covers around 20% of active UK researchers. And we are working to enhance records and create new ones with information from other sources. Working with institutional repositories, british library data sets (Zetoc), Direct input from researachers. With current EPrints the RDF is easy to grab so we’ve used that with Huddersfield data and it works well. And we have a Submission form on the website now so people can submit themselves. Now, an example of why this matters… I read the separatedbyacommonlanguage blog and she was stressing about the fact that her name appears in many forms and the REF process. This is an example of why identifiers matter and why names are not enough. And how strongly people feel about it.

Quality really matters here. Automatic matching can only achieve so much – it’s dependent on data source. And some poeple have multiple affiliations. There is no size fits all solution hre. We have colleagues at the British Library who perform manual check of results of matching new data sources – allows for separation/merging of records – they did similar on ISNI. At the moment people can contribute a record but cannot update it. In the long term we plan to allow poeple to contribute their own information.

So our ultimate aim is to have a high quality set of unique identifiers for UK researchers and research institutions. Available to other systems – national and international (e.g. Names records exported to ISNI in 2011). Business model wise we have looked at possible additional services – such as disambiguation of existing data sets, identification of external researchers. About a quarter of those we asked would be interested in this possibility and paying for such added value services.

There is an API for the Names data that allows for flexible searching. There is an EPrints plugin – based on the API – which was released last year. It allows repository users to choose form a list of Names identifiers – and to create a Names record if none exists.

So, what’s happening with names now? We are hopefully funded until the end of 2012. Simeon mentioned the JISC convened Researcher ID group – final meeting will take place in September. That report went out for consultation in June, the report of the consultants went to JISC earlier this week. So these final aspects will lead to recommendations. And we have been asked to produce an Options Appraisakl Report for Uk national researcher identifier service in December. And we are looking at improving data and adding new records via repositories search.

So Names is kind of a hybrid of library/publisher approaches. Automatic matching/disambiguation; human quality checks; data immediately available for re-use in other systems; and authors can contribute and will be able to edit. When Names set up ORCID was two years away, ISNI hadn’t started yet. Things are moving fast. The main challenges here are cultural and political rather than technical. National author/researcher ID services can be important parts of research infrastructure. It’s vital to get agreement and co-ordination at national level here.


Q1) I should have asked Simeon this but you may have some appreciation here. How are recently deceased authors being handled? You have data since 1993 – how do you pick up deceased authors.

A1) No, I don’t think that we would go back to check that.

Q1) These people will not be in ID systems but retrospective materials will be in repositories so hard to disambiguate these.

A1) It is important. Colleagues on Archives Hub are interestied in disambiguation of long dead people. Right now we are focusing on active resaerchers.

A2 – Simeon) Just wanted to add that ORCID has a similar approach to deceased authors.

Q2 – Lisa from University of Queensland) We have 1300 authors registered with author id – how do you marry national and ORCID ID?

A2) We can accomodate all relevant identifiers as needed, in theory ORCID ID would be one of these.

Q3) How do you integrate this system by Web of Science and other commercial databases?

A3) We haven’t yet but we can hold other identifiers so could do that in theory but it’s still a prototype system.

Q4) Could you elaborate on national id services vs. global services?

A4) When we looked across the world there was a lot of variation. It would depend on each countries requirements. I feel a national service can be more responsive to the need of that community. So in the UK we have the HE statisticas agency who want to identify those in universities for instance, ORCID may not be right for that purpose say. I think there are various ways we could be more flexible or responsible as a national system vs ORCID with such a range of stakeholders.

Topic: Creating Citable Data Identifiers
Speaker(s): Ryan Scherle, Mark Diggory

First of all thank you for sticking around to hear about identifiers! I’m not sure even I’m that excited about identifiers! So instead lets talk about what happened to me on Saturday. I was far away… it was 35 degrees hotter… I was at a little house on the beach. The Mimosa House. It’s at 807 South Virginia Dare Trail. Kill Devil Hills, NC USA. 27898. It isn’t a well known town but it was the place where the first Orville bros. flight tests took place at [gives exact geocordinators]. But I had a problem. My transmission [part number] in my van [engine number] and opened the vent to  a new spider and a deadly spider crawled out [latin name]. I’m fine but it occured to me that we use some really strange combinations of identifiers. And a lot of these are very unusable for humans – those geocoordinates are not designed for humans to read out loud in a presentation [or livebloggers to grab!].

When you want data used and reused we need to make identifiers human friendly. Repositories use identifiers… EPrints can use a 6 digit number and URL, not too bad, In Fedora there isn’t an imposed scheme. In this one there is a short accession number but it’s not very prominent, you have to dig around a long URL. Not really designed for humans (I’ll confess I helped come up with this one so my bad too). DSpace does impose a structure. It’s fairly short and easy to cite. If you are used to repositories. But if you look at Nature – a source scientists understand. They use DOIs. When scientists see a DOI they know what this is and how to cite this. So why don’t repositories do this?

So I’m not going to get controversial. I am going to suggest some principles for citable identifiers, you won’t all agree!

1 ) Use DOIs – they are very familiar to scientists and others. Scientists dont understand handles, purls or info URI. They understand DOI. And using it adds weight to your citation – it looks important. And loads of services and tools are compatible with DOIs. Currently EPrints and DSpace don’t support them, Fedora only with a lot of work.

2) Keep identifiers simple – complex identifiers are fine for machines but bad for humans. Despite our best intentions humans sometimes need to work with identifiers manually. So keep as short and sweet as possible. So do repositories support that? Yes all three do but you need the right policies set up.

3) Use syntax to illustrate relationships – this is the controversial bit. But hints in identifiers can really help the user. A tiny bit of semantics to an identifier is increadibly useful. e.f. http://dx.doi.org/10.5061/dryad.123ab/3. A few slashes here help humans look at higher level objects. Useful for human hacks and useful for stats. You can aggregate stats for higher level stuff. Could break in the future, probably wont! Again EPrints and DSpace don’t enable this. Fedora only with work.

4) When “meaning-bearing” content changes, create a versioned identifier – scientists are pretty picky. Some parts objects have meaning, some don’t. For some objects you might have an excel file. Scientists want that file to be entirely unchanged – and only with new URL. Scientists want datat to be invariant to enable reuse by machines, even a single bit makes a difference. Watch out for implicit abstractions – e.g. thumbnails of different images etc. This kind of process seems intuitive but it kinda flies in face of DOI system and conventions. A DOI for an article it resolves to a landing page that could change every day and contain any number of items. Could be with a different publisher. What the scientist cares about is the article of text itself, webpage not so much of an issues.

Contrast that with…

5) When “meaningless” content changes, retain the current identifier – descriptive metadata must be editable without creating a new identifier. Humans rearely care about metadata changes, especially for citation purposes. Again repositories dont handle this stuff so well. EPrints supports flexible versioning/relationships. DSpace has no support. Fedora has implicit versioning of all data and metadata – useful but too granular!

So to build a repository with all of these features we had a lot of work to do. We had previously been using DSpace so we had some work to do here. What we did was add a new DSpace identifier service. It allows us to handle DOI, and to extend to new identifiers in the future. It allows us granular control of when a new DOI is registered and it lets us send these to citation services as required. So our DSpace identifier system uses EZCite at CDL and then also to DataCite. The DataCite content service lets you look up DOIs, they are linked data compliant – you can see relationships in the metadata. You can export metadata in various formats for textual or machine processing purposes. And we added some data into our citations information. When you load a page in Dryad there is a clear “here’s how to cite this item” note as we really want people to cite our material.

In terms of versioning we have put this under the control of the user and that means that when you push a button a new object is created and goes through all the same creation processes – just a copy of the original. So we can also connect back to related files on the service. And we thus have versioning on files. We plan to do more on versioning on the file and track changes on these. We need to think about tracking information in the background without using new identifiers in the foreground. We are contributing much of this back to DSpace but we want to make sure that the wider DSpace community finds this useful, it meets their requirements.

So, how well has it worked? Well it’s been OK. Lots of community change needed around citing data identifiers. Last year we looked at 186 articles associated with Dryad deposits – 77% had “good” citations to the data. 2% had “bad” citations to the data. And 21% had no data citations at all. We are owrking with the community to raise awareness about that last issue. Looking at articles a lot of people cite data in the text of the article, sometimes in supplementary materials at the end. And a bad citation – they called their identifier an “accession number”.

So, how many of you disagree with me here? [some, not tons of people] Great! come see me at dinner! But no matter whether you agree or not do think about identifiers and humans and how they use them. And finally we are hiring developer and user interface posts at the moment, come talk to me!


Q1 – Rob Sanderson, Los Alamos Public Laboratories) I agree with (4) and (5) but DOIs? I disagree! They are familiar but things can change on a DOI, that’s not what you want!

A1) I maybe over simplified. When you resolve a DOI you get to an HTML landing page. There is content – in our case data files. Those data files we guarantee to be static for a given DOI. We do offer an extension to our DOI – you can add /bitstream to get the static bits. But that page does change and restyle from time to time.

Q2 – Robin Rice, Edinburgh University Data Library) We are thinking about whether to switch from handles for DOI but you can’t have a second DOI for a different location… What do you do if you can’t mind a new DOI for something?

A2) You can promote the existing DOI. I question that you can’t have more than one DOI though, you can have a DOI for each instance for each object.

Q2) Earlier it seemed that the DOI issuing agency wouldn’t allow that

A2)  We haven’t come across that issue yet

A2 – audience) I think the DOI agency would allow your sort of use.

 July 11, 2012  Posted by at 2:30 pm LiveBlog, Updates Tagged with:  Comments Off on P5B: Name and Data Identifiers LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: A Repository-based Architecture for Capturing Research Projects at the Smithsonian Institution
Speaker(s): Thorny Staples

I have recently returned to the Smithsonian. I got into repositories through lots of digital research projects. I should start off by saying that I’ll show you screenshots for a system that allows researchers to deposit data from the very first moment of research, it’s in their control until it goes off to curators later.

I’m sure most of you know of the Smithsonian. We were founded to be a research institute originally – museums were a result of that. We have 19 museums, 9 scientific research centers, 8 advances study centres, 22 libraries, 2 major archives and a zoo (Washington zoo). We focus on longterm baseline research, especially in biodiversity and environmental studies, lots of research in cultural heritage areas. And all of this, hundreds of researchers working around the world, has had no systematic data management of digital researvh content (except for SAO who work under contract for NASA).

So the problem is that we need to capture research information as it’s created and make it “durable” – it’s not about presevation but about making it durable. The Smithsonian is now requiring a data management plan for ALL projects of ANY time. This is supposed to say where they will put their digital information, or at least get them thinking about it. But we are seeing very complex arrays of numerous types of data. Capturing the full structure and context of the research content is neccasary. It’s a network model, it’s not a library model. We have to think network from the very beginning.

We have to depend on the researvchers to do much of the work, so we have to make it easy. They have to at least minimally describe their data but they have to do something. And if we want them to do it we must provide incentives. It’s not about making them curators. They will have a workspace, not an archive. It’s about a virtual research environment but a repository-enables VRE. Primary goal is to enhance their research capabilities, leaving trusted data as their legacy. So to deliver that we have to care about a content creation and management environment, an analysis environment and a dissemination environment. And we have to think about this as two repositories: there is the repository for the researcher, they are data owners, they set policies, they have control – crucial buy-in and crucial concept for them; And then we have to think about an interoperable gathering service – a place researcher content feeds into and also cross search/access to multiple repositories back in the other direction as these researchers work in international teams.

Key to the whole thinking is the concept of the web as the model. It’s a network of nodes that are units of content, connected by arcs that are relationships. I was attracted to Fedora because of the notion of a physical object and a way to create networks here. Increasingly content will not be sustainable as discrete packages. We will be maintaining our part of the formalized world-wide web of content. Some policies will mean we can’t share everything all the time but we have to enable that, that’s where things are going. Information objects should be ready to be linked, not copied, as policy permits. We may move things from one repository to another as data moves over to curatorial staff but we need to think of it that way.

My conceptual take here is that a data object is one unit of content – not one file. E.g. a book is one object no matter how many pages (all of which could be objects). By the way this is a prototype, this isn’t a working service, it’s a prototype to take forward. And the other idea that’s new is the “concept object”. This is an object with a metadata about the project as a whole then a series of concept objects for the components of that project. If I want to create a virtual exhibition I might build 10 concept objects for those paintings and then pull up those resources.

So if you come into a project you see a file structure idea. Theres an object at the top for the project as a whole. Your metadata overview, which you can edit, lets you define those concepts. The researcher controls every object and all definitions. The network is there, they are operating within it. You can link concepts to each other, it’s not a simple hierachy. And you can see connections already there. You can then ingest objects – right now we have about 8 concept types (e.g. “Research site, plot or area”). When you pick that you then pick which of several forms you want to use. When you click “edit” you can see the metadata editor in a simple web form prepopulated with existing record. And when you look at resources you can see any resources associated with that concept. You can upload resources without adding metadata but it will show in bright yellow to remind you to add metadata. And you can attach batches of resources – and these are offered depending where you are in the network.

And if I click in “exhibit” – a link on each concept – you can see a web version of the data. This takes advantage of the adminstrator screen but allows me to publish my work to the web. I can keep resources private if I want. I can make things public if I want. And when browsing this I can potentially download or view metadata – all those options defined by researcher’s setting of policies.


Q1 – Paul Stanhope from University of Lincoln) Is there any notion of concepts being bigger than the institution, being available to others

A1) We are building this as a prototype, as an idea. So I hope so. We are a good microcosm for most types of data – when the researcher picks that they pick metadata schemas behind the scenes. This think we built is local but it could be global, we’re building it in a way that could work that way. With the URIs othwe intstitutions can link their own resources etc.

Q2) Coming from a university, do you think there’s anything different about your institution? Is there a reason this works differently?

A2) One of the things about the Smithsonian is that all of our researchers are Federal employees and HAVE to make their data public after a year. That’s a big advantage. We have other problems – funding, the government – but policy says that the researchers have to

Q3 – Joseph Green from University College Dublin) How do you convey the idea of concept objects etc. to actual users – it looks like file structures.

A3) Well yes, kind of the idea. If they want to make messy structures they can (curators can fix). The only thing they need is a title for their concept structure. They do have a file system BUT they are building organising nodes here. And that web view is an incentive – it’ll look way better if they fill in their metadata. Thats the beginning… for tabular data objects for instance they will be required to do a “code book” to describe the variables. They can do this in a basic way or they can do better more detailed code book and it will look better on the web. We are trying to incentivise  at every level. And we have to be fine with ugly file structures and live with it.

Topic: Open Access Repository Registries: unrealised infrastructure?
Speaker(s): Richard Jones, Sheridan Brown, Emma Tonkin

I’m going to be talking about an Open Access Repositories project that we have been working on, funded by JISC, looking at what Open Access repositories are being used for and what their potential is via stakeholder interviews, via a detailed review of ROAR and OPENDOAR, and somerecommendations.

So if we thought about a perfect/ideal repository as a starting point… we asked out stakeholders what they would want. They would want it to be authoritative – the right name, the right URL; they want it to be reliable; automated; broad scope; curated; up-to-date. The idea of curation and the role of human intervention would be valuable although much of this would be automated. People particularly wanted the scope to be much wider. If a data set changes there are no clear ways to expand the registry and that’s an issue. But all of those terms are really about the core things you want to do – you all want to benchmark. You want to compare yourself to others and see how you’re doing. And in our sector and funders they want to see all repositories, what are the trends, how are we doing with Open Access. And potentially ranking repositories or universities (like Times HE rankings) etc.

But what are they ACTUALLY being used for right now? Well mainly use them for documenting their own existing repositories. Basic management info. Discovery. Contact info. Lookups for services – use registry for OAI-PMH endpoints. So that’s I think, it looks as if we’re falling a bit short! So, a bit of background on what OA repository registries there are. So we have OpenDOAR, ROAR (Registry of Open Access Repositories) – those are both very broad scope repositories, well known and well used. But there is also the Registry of Biological Repositories. There is re3data.org – all research data so it’s a content type specific repository registry. And, more esoterically, is the Ranking Web of World Repositories. Not clear if this is a registry or a service on a registry. And indeed that’s a good question… what services run on registries. So things like BASE search for OAI-PMH endpoints, very similar to this is Institutional Respositories Search based at Mimas in the UK. Repository 66 is a more novel idea – mashup with Google Maps to show repositories around the world. Then there is the Open Access Repository Junction a multideposit tool for discovery and use of Sword endpoints.

Looking specifically at OpenDOAR and ROAR. OpenDOAR is run at University at Nottingham (SHERPA) and it uses manual curation. Only lists OA and Full-text repositories. It’s been running since 2005. Whereas DOAR is principally Repository Manager added records. No manual curation. And lists both full-text and metadata only. Based at University of Southampton and running EPrints 3, inc. SNEEP elements etc. Interestingly both of these have policy addition as an added value service. Looking at the data here – and these are a wee bit out of date (2011). There seems to be big growth but some flattening out in OpenDOAR in 2011 – probably approaching full coverage. ROAR has a larger number of repositories due to difference in listing but quite similar to OpenDOAR (and ROAR harvests this too). And if we look at where repositories are both ROAR and OpenDOAR are highly international. Slightly more European bias in OpenDOAR perhaps. The coverage is fairly broad and even around the globe. When looking at content type OpenDOAR is good at classifying material into types, reflective of manual curation. We expect this to change over time, especially datasets. ROAR doesn’t really distinguish between content types and repository types – it would be interesting to see these separately. We also looked at what data you typically see about the repository in any record. Most have name, URL, location etc. OpenDOAR is more likely to include a description and contact details than is the case in ROAR. Interestingly the machine to machine interfaces are a different story. OpenDOAR didn’t have any RSS or SWORD endpoint information at all, ROAR had little. I know OpenDOAR are changing this soon. This field has been added on later in ROAR and no-one has come back to update this new technology, that needs addressing.

A quick not about APIs. ROAR has OAI-PMH API, no client library, full data dump available. OpenDOAR has a fulled documented query API, no client library and full data dump available. When we were doing this work almost no one was using the APIs, they just download all data.

We found stakeholders, interviewees etc. noted some key limitations: content count stats are unreliable; not internationalised/multilingual- particularly problematic if a name is translated and is the same as but doesnt appear to be the same thing; limited revisions history; No clear relationships between repos, orgs, etc. And no policies/mechanisms for populating new fields (e.g. SWORD). So how can we take what we have and realise potential for registries? There is already good stuff going on… Neither of those registries automatically harvest data from repositories but that would help to make data more authoritative/reliable/up to date; automated; increased scope of data – and that makes updates so much easier for all.  And we can think about different kinds of quality control – no one was doing automated link checking or spell checking and those are pretty easy to do. And an option for human intervention was in OpenDOAR but not in ROAR, and that could be make available.

But we could also make them more useful for more things – graphical representaqtions of the registry; better APIs and Data (with standards compliance where relevent); versioning of repositories and record counts; more focus on policy tools.  And we could look to encourage overlaid services: repository content stats analysis; comparitive statistics and analytics; repository and OA rankings; text analysis for identifying holdings; error detection; multiple deposits. Getting all of that we start hitting that benchmarking objective.


Q1 – Owen Stephens) One of the projects I’m working on is CORE project from OU and we are harvesting repositories via OpenDOAR. We are producing stats about harvesting. Others do the same. It seems you are combining two things – benchmarking and repositories. We want OpenDOAR to be comprehensive, and we share your thoughts on need to automate and check much of that. But how do we make sure we don’t build both at the same time or separate things out so we address that need and do it properly?

A1) The review didn’t focus on structures of resulting applications so much. But we said there should be a good repository registry that allows overlay of other services – like the benchmarking services. CORE is an example of something you would build over the registry. We expect the registry to provide mechanism to connect up to these though. And I need to make an announcement: JISC, in the next few weeks, will be putting out an ITT to take forward some of this work. There will be a call out soon.

Q2 – Peter from OpenDOAR) We have been improving record quality in OpenDOAR. We’ve been removing some repositories that are no longer there – link checking doesn’t do it all. We also are starting to look at including those machine to machine interfaces. We are doing that automatically with help from Ian Stuart at EDINA. But we are very happy to have them sent in too – we’ll need that in some case

A2) you are right that link checkers are not perfect. More advanced checking services can be built on top of registries though.

Q3) I am also working on the CORE project. The collaboration with OpenDOAR where we reuse their data, it’s very useful. Because we are harvesting we can validate the repository and share that with OpenDOAR. The distinction between registries and harvesting is really about an ecosystem that can work very well.

Q4) Is there any way for repositories to register with schema.org to enable automatic discovery?

A4) We would envision something like that, that you could get all that data in a sitemap or similar.

A4 – Ian Stuart) If registering with Schema.org then why not register with OpenDOAR?

A4 – chair) Well with scheama.org you host the file, its just out on the web.

Q5) How about persistant URLs for repositories?

A5) You can do this. The Handle in DSpace is not a persistant URL for the repository.

Topic: Collabratorium Digitus Humanitas: Building a Collaborative DH Repository Framework
Speaker(s): Mark Leggott, Dean Irvine, Susan Brown, Doug Reside, Julia Flanders

I have put together a panel for today but they are in North America so I’ll bring them in virtually… I will introduce and then pass over to them here.

So… we all need a cute title and Collaboratory is a great word we’ve heard before. I’m using that title to describe a desire to create a common framework and/or set of interoperable tools providing a DH Scholars Workbench. We often create great creative tools but the idea is to combine and make best use of these in combination.

This is all based on Islandora. A Drupal+ Feora framework from UPEI. Flexible UI on top of Fedora and other apps. It’s deployed in over 100 institutions and that’s growing. The ultimate goal of those efforst is to release a Digital Humanities solutions packs with various tools integrated in, in a framework that would be of interest to scholarly DH context – images, video, TEI, etc.

OK so now my colleagues…

Dean is visiting professor in Yale, and also professor at Dalhousie University in Canada and part of a group that creates new versions of important modernism in canada prints. Dean: so this is the homepage for Modernist Commons. This is the ancillery site that goes with the Modernism in Canada project. One of our concerns is about long term preservation about digital data stored in the commons. What we have here is both the repository and a suite of editing tools. When you go into the commons you will find a number of collections – all test collections and samples from the last year or so. We have scans of a bilingual publication called Le Nigog, a magazine that was published in Canada. You can view images, mark-up, or you can view all of the different ways to organise and orchestrate the book object in a given collection. You can use an Internet Archive viewer or alternative views. The IA viewer frames things according to the second to last image in the object, so you might want to use an alternative. In this viewer you can look at the markup, entities, structures, RDF relations or whether you want to look at image annotations. The middle pane is a version of CWRC Writer that lets us do TEI and RDF markup. And you see the SharedCanvas tools provided with other open annotation group items. As you mark up a text you can create author authority files that can be used across collections/objects.

Next up Victoria Brown, her doctorate is on Victorian feminist literature. She currently researches collaborative systems, interface design, usability. Victoria: I’ll be talking more generally than Dean. The Canadian Writing Research Council is looking to do something pretty ambitios that only works in a collaborative DH environment. We have tools that can aim as big as we can. I want to focus on talking about a couple of things that define a DH Collaboratory. It needs to move beyond institutional repository model. To invoke persoective of librarian colleagues I want to address what makes us so weird… What’s different about us is that storing final DH materials is only part of the story, we want to find, amass, collect materials; to sort and organise them; to read, analyse and visualize. That means environments much be flexible, porous, really robust. Right now most of that work is on personal computers – we need to make these more scalable and interoperable. This will take a huge array of stakeholders buying into these projects. So a DH repository environment needs to be easy o manage, diverse and flexible. And some of these will only have a small amount of work and resources. In many projects small teanms of experts will be working with very little funding. So the CWRC Writer here shows you how you edit materials. On the right you see TEI markup. You can edit this and other aspects – entities, RDF open annotation mark up etc, notations allows you to construct triples from within the eidt. One of the ways to encourage interoperability is through use of common entities – connecting your work to the world of linked data. The idea is to increase consistency across projects with TEI markup and RDF means better metadata than the standard working in Word, publishing in HTML many use. So this is a flexible tool. Embedding this in a repository does raise questions about revisioning and archiving though. One of the challenges for repositories and DH is how we handle those ideas. Ultimately though we think this sort of tool can broaden participation in DH and collaboration in DH content. I think the converse challenge for DH is to work on more generalised environments to make sure that work can be interoperable. So we need to take something from solid and stable structure and move to the idea of shared materials – a porous silo maybe – where we can be specific to our work but share and collaborate with others.

The final speaker is Doug, he became first digital curator at NYPL. He’s currently editing music of the month blog at NYPL. Doug: the main thing we are doing is completely reconfiguring our repository to allow annotation of Fedora and take in a lot of audio nad video content. And particularly for large amounts of born digital collections. We’ve just started working with a company called BrightCove to share some of our materials. Actually we are hiring an engineer to design the interface for that – get in touch. We are also working on improved display interfaces. Right now it’s all about the idea of th egallery – the idea was that it would self-sustain through selling prints. We are moving to a model where you can still view those collections but also archival materials. We did a week long code sprint with DH developers to extend the Internet Archive book reader. We have since decided to move from that to New York Times backed reader – the NYT doc viewer with OCR and annotation there.


Q1) I was interested in what you said about the CWRC writer – you said you wnated to record every key stroke. Have you thought about SVN or GIT that do all that versioning stuff already.

A1 – Susan) They are great tools for version control and it would be fascinating to do that. But do you put your dev money into that or do you try to meet needs of greatest number of projects? But we would definitely look in that direction to look at challenges of versioning introduced in dynamic online production environments.


 July 11, 2012  Posted by at 12:27 pm LiveBlog, Updates Tagged with: , , , ,  Comments Off on P4B: Shared Repository Services and Infrastructure LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Multivio, a flexible solution for in-browser access to digital content
Speaker(s): Miguel Moreira

Multivio is a generic browser and visualizer for digital objects, a presentation layer for document servers, and an add-on for other infrastructure. It’s main principle: when searching a document server, users are provided with immediate access to content. It’s origins lie in RERO and its digital library. In 2006, an internal survey showed desire for a service that eventually became Multivio – an adequate presentation layer for full-text, structure-rich files and show patrimonial (heritage) collections. It does all of this quickly and directly, as opposed to traditional solutions.

Multivio was developed because other solutions were not flexible enough. It is co-funded by RERO and the Electronic Library of Switzerland. Development took place between 2008-2011, with an official release in 2011.

Using Multivio is straightforward. Provide a URL to a file (PDF, image, sound, video, etc) or a combination of files. Then Multivio will investigate structure and content, and provide it to the user in a convenient searchable interface in browser.

Multivio, given content, shows in a window over a given page. It pulls in content very quickly and shows it off visually with JavaScript and HTML. No pre-indexing necessary.

Multivio is a full-featured HTML5 document viewer. It allows zoom, search, copy and paste. It also has an elegant way of handling large and multi-file documents, which can be shown together without downloading. It is low-bandwidth consumptive, and based on widely accepted web standards. All it requires is a modern browser client-side. Server-side, the role of Multivio is rendering, search and extraction. It uses Python and Poppler (for PDFs). The only other requirement is that remote content be fetched and stored on-server.

Multivio.org to check it out. For a public demonstrator, go to demo.multivio.org – usable with any web-accessible document.

The advantage of Multivio is performance, customization, access control. It only requires a Unix server running Python.

The CORE Portal is using Multivio now.

In the future, support for audio and video will be added and improved, along with authentication and access control. Calendar-based navigation of publications is coming as well.

Q: Do you do PDF file processing beforehand?

A: No, it is all done on the fly. Poppler is very effective at doing this. It wastes no time or bandwidth in grabbing what it needs.

Q: Do you do OCR processing? Can individual pages of a document be shared/navigated to directly?

A: No OCR processing. As for page-specific URLs, the client API allows for file URLs with page numbers. This isn’t being used for analysis of document usage yet, but that is very interesting.

Q: For multimedia, what experience do you have working with it?

A: We are starting to have and use that content. Prototypes are showing one video format so far – we must work on that. It’s a challenge, but we know it’s possible. We will rely on HTML5 and modern browsers, and if needed maybe fall back on Flash. Further investigation has to be done.

Q: More details on access control?

A: It’s on the todo list. Right now the solution is to install the Multivio server alongside protected documents. Multivio needs access rights, then it can restrict what it displays.

Q: How will this interact with usage metrics?

A: There’s an intention to work on this in the future. It’s important. We will still provide direct download, and do basic view analysis, but we hope to go much farther.


Topic: Biblio-transformation-engine: An open source framework and use cases in the digital libraries domain
Speaker(s): Kostas Stamatis, Nikolaos Konstantinou, Anastasia Manta, Christina Paschou, Nikos Houssos

This will be a backend talk. Sorry in advance. This is an open source framework that has been in development for 4-5 years. It facilitates digital transformations in library systems. It’s a solution to a common problem.

This tool has been used extensively so far. Digital transformations are a necessary reality in libraries, repositories, everything. You need to transform data to get into any publishing system or database, to migrate it or share it. Such processes need to constantly be evolving, so the framework provides systematic management of code that does all that. This will accelerate common transformation tasks.

The first step in a framework is creating an analysis, finding the abstractions that will represent common procedures. From that, the steps are retrieving data records, applying processing and changing any given records or field values, then finally generating the desired output. The less obvious finding is that there is a demand for incremental or selective data loading – breaking up the task, say.

The design goals demanded customisability, non-intrusiveness, ease of use, and the ability to integrate or extend for anyone who needs the Biblio-transformation-engine.

The components of the engine. The Data Loader itself, which retrieves data from sources according to its own spec. The Processing step transforms information with a filter, then modifier, then initializer. The output generator actually creates the desired product.

The FLOSS library was developed in Java (maven-based). FLOSS is available online in EU Public License – free to download, use, comment upon.

Use cases. One is generating linked open data in repository records, legacy cultural material records, CERIF information. Corresponding data loaders are reused. Filters and modifiers can be totally agnostic of RDF and input formats. JENA RDF generates triples. It also adds or generates appropriate identifiers/URI for entities.

Another is populating repositories from EndNote, RIS, Bibtex, UNIMARC. A third and fourth are feeding VOA3R and European aggregators.

In the future, the project hopes to support more data transformations, extend declarative specification of mapping for complex cases. Also some infrastructure to reuse Filter and Modifier implementations. Finally, the project would like to study user experience to sort out the little things and make life easier.

Q: You’re using CSL and JS – are you running JS on client or server side?

A: JS on server side. A modifier calls a JS server.

 July 11, 2012  Posted by at 12:23 pm LiveBlog, Updates Tagged with:  Comments Off on P4A: Accessing Digital Content LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Panel Discussion Proposal: “Effective Strategies for Open Source Collaboration” Panel Proposal “Effective Strategies for Open Source Collaboration”
Speaker(s): Tom Cramer, Jon William Butcher Dunn, Valorie Hollister, Jonathan Markow

This is a panel session, so it’s a little bit different. We’ve asked all of our DuraSpace experts here about collaborations they have been engaged in and then turn to you for your experiences of collaboration – what works and what doesn’t.

So, starting with Tom. I’m going to talk about 3 different Open Source technologies we are involved in. First of these is Blacklight which is in use in many places. It’s a faceted search application – it’s Ruby-on-Rails on solr. Originally developed at UVa around 2007, 1st adopted external to UVa in 2009. It’s had multiple installations, 10+ committer institutions etc.

Hydra is a framework for creating digital asset management apps to supplement Fedora. Started in 2008 in Hull, Stanford and Virginia with FedoraCommons. It’s institutionally-driven And developer-led.

And the last item is IIIF: International Image Interoperability Framework – I’ll be talking more on this later – an initiative by major research libraries across the world – a cooperative definition of APIs to enable cross-repository image collections. It’s a standards not technology project.

Lessons learned…

DO: Work from a common vision; be productive, welcoming and fun; engineer face-time is essential; get great contributors – they lead to more great contributors too!

DON’T: over-plan, over-govern; establish too many cross institution dependencies; get hooked on single sources of funding.

Now over to Jon. A few collaborations. First up Sakaibrary. Sakai is an eLearning/course management tool used by dozens of institutions. There was a collaborative project between Indiana University and University of Michigan Libraries to develop extensions to Sakai and facilitate use of library resources in teaching and learning. Top down initiative from university head librarians. Mellon funding 2006-2008 (http://sakaibrary.org).

The second project is Variations on Video. This one is a collaboration between Indiana University and Northwestern University Libraries – with additional partners for testing and feedback. This is a single cross institution team using AGILE Scrum approaches.

Lessons learned from these projects… Success factors: initial planning periods – shared values and vision being established – helped very much; good project leadership and relationships between leaders important; collaborative development model. Some challenges: Divergent timelines; electronic communication vs. face-to-face – very important to meet face to face; existing community culture; shifts in institutional priorities and sustainability.

Now over to Val, Director of Community Programs for DuraSpace. Part of my role is to encourage teams to collaborate and gain momentum within the DSpace community. We are keen to get more voices into the development process. We had DSpace Developer meeting on Monday and have made some initial tweaks, and continue to tweak, the programme. So what is the DSpace Community Advisory Team? Well we are a group of mostly repository managers/administrators. Developers wanted help/users wanted more input. Formed in Jan 2011, 5-7 active members. DCAT helps review/refine new feature requests – get new voices in there but also share advice, provide developer help. We had a real mission to assess feature requests, gauge interest, and enable discussion.

Some of the successes of DCAT. We have reviewed/gathered feedback on 15_ new feature requests – 3 were included in the last release. It really has broadened development discussion – developers and non-developers, inter/intra-institution. And it has been useful help/resource for developers – community survey by DCAT and provided recommendation on the survey. Feedback on feature implementation.

Challenges for us: no guarantee that a feature makes it in – despite everyone’s efforts features still might not make it in, because of resource limitations; continue to broaden discussion and broaden developer pool; DCAT could also be more helpful during the release process itself – to help with testing, working out bugs etc.

So the collaboration has been successful with discussion and features but continue to do better at this!

Now Jonathan is asking the panel: how important is governance in this process? How does decision making take place?

Tom: Different in different communities. And bottom up vs. top down makes a big difference. In bottom up it’s about developers working together, trusting each other, building the team but keeping code quality is challenging on a local and broader level for risk averse communities.

Jon: governance different between the two projects. In both cases we did have a project charter of sorts. for Sakaibrary it was more consensus based – good in some ways but maybe a bit less productive as a project as a result. In terms of prioritisation of features in the video project we are making use of the scrum concept really and the idea of product owners is very useful there. We try to involve whole team but product owner define priorities. When we expand to other institutions with their own interests we may have to explore other ways of doing things – we’ll need to learn from Hydra etc.

Val: I think DCAT is a wee bit different. Initially this was set up between developers and DCAT and that has been an ongoing conversation. Someone taking the lead on behalf of developers was useful. And for features DCAT members tend to take the lead on a particular request or other to lead analysis etc. of it.


Q1) In a team development effort there is great value to being able to pop into someone’s office and ask for help. And lots of decisions made for free – a discussion really quickly. When collaborative even a trivial decision can mean a 1 hr conference call. How do you deal with that.

A1 – Jon) In terms of the video project we take a couple of approaches – we use IRC channel and Microsoft Link for one-t0-one discussion as needed. We also have daily 15 min stand up meeting via telephone or video conference. And that agile approach with 2 week cycles means it’s not hugely costly to take the wrong approach or find we want to change something.

A1 – Tom) With conference calls we now feel if it takes an hour we shouldn’t make that decision. Move to IRC rather than email is a problem in different time zones. Email lets you really consider things through and that’s no bad thing.. one member of the Blacklight community is loquacious but often answers his own questions inside of an hour! you just learn how to work together.

A1 – Jonathan) We really live on Skype and that’s great. But I miss water cooler moments, tacit understandings that develop there. There’s no good substitute for that.


Topic: High North Research Documents – a new thematic and global service reusing all open sources
Speaker(s): Obiajulu Odu, Leif Longva

Our next speakers are from the University of Tromso. The High North Research Documents is a project we began about six months ago. You may think that you are high in the North but we are from far arctic Norway. This map gives a different perspective on the globe, on the north. We often think of the north as the North of America, of Asia etc. but the far north is really a region of it’s own.

The Norwegian government has emphasized the importance of northern areas and the north is also of interest on an international level – politically and strategically; environmental and climate change issues; resource utilization; the northern sea route to the Pacific. And our university, Tromso, is the northernmost university in the world and we are concerned with making sure we lead research in the north. And we are involved in many research projects but there can be access issues. The solution is Open Access research literature and we thought that it would be a great idea to look at the metadata to extract a set of documents concerned with High North research.

The whole world is available through aggregators like OAIster (OCLC) and BASE (University of Bielefeld) and they have been harvesting OA documents across the world. We don’t want to repeat that work. We contacted the guys a Bielefeldand they were very useful. We have been downloading their metadata local allowing us to do what we wanted to do to analyse the metadata.

Our hypothesis was if we selected a set of keywords and they are in the metadata then the thematic scope of the document can be identified. So we set up a set of filtering words (keywords) applied to the metadata of BASE records based on: geographic terms; species names; languages and folks (nations); other keywords. We have mainly looked for English and Norwegian words, but there is a bigger research world out there.

The quality of keywords is an issue – are their meanings unambiguous. Labrador for instance for us is about Northern Canada, it has a different meaning – farmer or peasant – in Spanish. Sami is a term for people but it is also a common given name in Turkey and Finland! So we have applied keywords filtering a selection of elements – so “sami AND language” or “sami AND people”. The filter process is applied only to selected metadata elements – title, description, subject. But it’s not perfect.

Looking at the model we have around 36 million documents from 2150 scholarly resources. These are filtered, extracted. And one subset of keywords go right into the High North Research Documents database. Another set of keywords we don’t trust as much so they go through a manual quality control first. Now over to my colleague Obiajulu.

Thank you Leif. We use a series of modules in the High North System model. The Documents service itself is DSpace. The import module gets metadata records and puts them in our MySQL database. After documents are imported we have the extraction module – applies the extraction criteria on the metadata. The Ingest module transforms metadata records relevant to the high north into DSpace XML format and imports them into a DSpace repository. And we have the option of addicting custom information – including use of facets.

Our Admin Module allows us to add, edit or display all filtering words (keywords). And it allows us to edit the status of a record or records – Blacklist/reject; approved; modified. So why do we use DSpace? Well we have used it for 8 or 9 years to date. It provides end use with both a regular search interface and faceted search/browsing. Our search and discovery interface is an extension of DSpace and it allows us to find out about any broken links in the system.

We are on High North RD v 1.1. 151,000 documents extracted from more than 50% of the sources appealing in BASE and from all over the world. Many different languages – even if we apply mainly English and Norwegian and Latin in the filtering process. Any subject but weight on the hard sciences. And we are developing the list of keywords as a priority so we have more and better keywords.

When we launched this we tried to get word out as far and wide as possible. Great feedback received so far. The data is really heterogeneous in quality, full text status etc. so feedback received has been great for finding any issues with access to full text documents.

Many use their repository for metadata only. That would be fine if we could identify where a record is metadata only. We could use the dc:rights but many people do not use this. How do we identify records without any full text documents. We need to weed out many non-OA records from High North RD – we only want OA documents, it’s not a bibliographic service we want to make. Looking at document types we have a large amount of text and articles/journals but also a lot of images (14-15% ish). The language distribution shows English. Much smaller percentage in French, Norwegian… and other languages.

So looking at the site (http://highnorth.uit.no/). It’s DSpace and everything in it is included in a single collection. So… if I search for pollution we see 2200 results and huge numbers of keywords that can be drilled down into. You can filter by document type, date, languages etc.

And if we look at an individual record we have a clear feedback button that lets users tell us what the problem is!


Q1) You mentioned checking quality of keywords you don’t trust, and that you have improvements coming to keywords. Are you quality checking the “trusted” keywords.

A1) When we have a problem record we can track back over the keywords and see if one of those is giving is giving us problems, we have to do that this way.

We believe this to be a rather new method, to use keywords in this way to filter content. We haven’t come across it before, it’s simple but interesting. We’d love to hear about any other similar system if there are any. And it would be applicable to any topic.

Topic: International Image Interoperability Framework: Promoting an Ecosystem of Open Repositories and Open Tools for Global Scholarship
Speaker(s): Tom Cramer

I’m going to talk about IIIF but my colleagues here can also answer questions on this project. I think it would be great to get the open repositories community involved in this process and objectives.

There are huge amounts of image resources on the web – books, manuscripts, scrolls, etc. Loads of images and yet really excellent image delivery is hard, it’s slow, it’s expensive, it’s often very disjointed and often it’s too ugly. If you look at bright spots: CDragon, Google Arts, or other places with annotation or transcription it’s amazing to see what they are doing vs. what we do. Its like page turners a few years ago – there were loads, all mediocre. Can we do better?! And we – repositories, software developers, users, funders – all suffer because of this stuff.

So consider…

… a paleographer who would like to compare scribal hands from manuscripts at two different repositories – very different marks and annotations.

— an art and architecture instructor trying to assemble a teaching collection of images from multiple sources..

… a humanities scholar who would like to annotate a high resolution image of an historical map – lots of good tools but not all near those good resources.

… a repository manager who would like to drop a newspaper viewer with deep zoom into her site with no development of customization required

… a funder who would like to underwrite digitization of scholarly resources and decouple content hosting and delivery.

We started last September a year long project to look at this – a group of 6 of the worlds leading libraries and Stanford. Last September we looked at the range of different image interfaces. Across our 7 sites there were 15 to 20 interfaces, including Oxford it was more like 40 or 50 interfaces. Oxford seems to have lots of legacy humanities interfaces – lovely but highly varied – hence the increase in numbers.

So we want specialised tools but less specialised environment. So we have been working on Parker on the web project – mediaeval manuscripts project with KCL and Stanford. the La Munda Le Rose is similar in type. Every one of these many repositories is a silo – no interoperability. Every one is a one-off – big overhead to code and keep. And every user is forced to cope – many UIs, little integration. no way to compare one resource with another. They are great for researchers who fed into the design but much less useful for others.

Our problem is we have confused the role of responsibilities of the stakeholders here. We have scholars who want to find, use, analyze, annotate. they want to mix and match, they want best of breed tools. We have toolers – build useful tools and apps – want users and resources. And we have the repositories who want to host, preserve and enrich records.

So for the Parker project we had various elements managed via APIs. We have a TPEN transcription tool. We sent TPEN hard drive full of Tiffs to work on. Dictionary of Old English, they couldn’t take a big file of TIFFs but we gave them access to the database. We also had our own app. So our data fed into three applications here and we could have taken the data on some round trips – adding annotations before being fed into database. And by taking those APIs into a Framework and up into an Ecosystem we could enable much more flexible solutions – ways to view resources in the same environment.

So we began some DMS Tech work. We pulled together technologists from a dozen institutions to look at best tools to use, best adaptations to make etc. and we came up with basic building blocks for ecosystem: image delivery API (speced and built); data model for medieval manuscripts (M3/SharedCanvas) – we anticipate people wanting to page through documents – for this type of manuscript the page order, flyleafs, inserts etc. are quite challenging; support for authentication and authorization – it would be great if everything was open and free but realistically it’s not; reference implementations of load balanced, performant Djatoka server – this seemed to be everyone’s page turning software solution of choice; interactive open source page turning and image viewing application; OAC-compatible tools for Annotation (Digital Mappaemundi) and transcription (T-PEN).

We began the project last October, some work already available. the DMS Index pulls data from remote repositories and you can explore in a common way as the data is structured in a common way. You can also click to access annotation tools in DM, or to transcribe the page from TPEN etc. So one lets you explore and interact with this diverse collection of resources.

At the third DMS meeting we started wondering if, if this makes sense for manuscripts, doesn’t this make sense for other image materials. IIIF basically takes the work of DMS and looks how we can bring these to the wider world of images. We’ve spent the least 8 or 9 months putting together the basic elements. So there is a Restful interface to pic up an image from a remote location. We have a draft version of the specification available for comment here: http://library.stanford.edu/iiif/image-api. What’s great is the possibility to bring in functionality on images into your environment that you don’t already offer but would like to. Please do comment into 0.9 proclamation you have until 4pm Saturday (Edinburgh time).

The thing about getting images into a common environment is that you need metadata. We want and need to focus on just what the key metadata needs to be – labels, title, sequence, attribution etc. Based on http://shared-canvas.org (synthesis of OAC (open annon. collab) and DMS.

from a software perspective we are not doing software development but we hope to ferment lots of software development. So we have thought of this in terms of tiers for sharing images. Lots of interest in Djatoka, IIIF Image API and then sets of tools for deep panning, zooming, rotating etc. And then moving into domain and modality specific apps. And so we have a wish list for what we want to see developed.

This was a one year planning effort – Sept 2011 – Aug 2012. We will probably do something at DOF as well. We have had three workshops. We are keen to work with those who want to expose their data in this sort of way. Just those organisations in the group have millions of items that could be in here.

So… What is the collective image base of the Open Repository community? What would it take to support IIIF APIs natively from the open repository platforms? What applications do you have that could benefit from IIIF? What use cases can you identify that could and should drive IIIF? What should IIIF do next? Please do let us know what we could do or what you would like us to do.

Useful links: IIIF: http://lib.stanford.edu/iiif; DMS Interop: http://lib.stanford.edu/dmm; Shared-canvas: http://shared-canvas.org.


Q1) Are any of those tools available, open source?

A2) T-Pen and DM are probably available. Both Open Source-y. Not sure if code distributed yet. Shared Canvas code is available but not easy to install.

Q2) What about Djatoka and improved non buggy version?

A2) There is a need for this. Any patches or improvements would be useful. There is a need and no-one has stepped up to the plate yet. We expect that as part of IIIF that we will publish. The national library of Norway rewrote some of the coding in C, which improved performance three-fold. They are happy to share this. It is probably open source but hard to find the code – theoretically open source.

And with that we are off to lunch…

 July 11, 2012  Posted by at 10:03 am LiveBlog, Updates Tagged with:  Comments Off on P3B: Open Source: Software and Frameworks LiveBlog