Sep 062012

As we approach publishing a final post of highlights from Open Repositories 2012 and move this website towards being an archive of this year’s event we wanted to let you know how you can begin connecting with next year’s conference.

Open Repositories 2013 will be taking place on Prince Edward Island (PEI), Canada and you may recall that in the very warm welcome the team gave at OR2012 they promised to have their website live very soon… Well the OR2013 website is now live! Bookmark it now:

On the website already you’ll find some introductory information on the Island and highlights of what you’ll be able to enjoy during your conference stay. An OR2013 Crowdvine has also been set up so do go and sign up.

OR2013 have also launched their Twitter account: you can find them as @openrepos2013 and they are using the hashtag #OR2013 to get the conversation around next year’s conference started.

So, over the next few months you can not only look forward to some updates from the OR2012 team but you can also look forward to hearing much more about OR2013 from the Prince Edward Island team and start planning your ideas, papers, etc.





 September 6, 2012  Posted by at 9:45 am Updates Tagged with: , , ,  Comments Off
Aug 202012

It has now been a month since we gathered in Edinburgh for Open Repositories 2012 and we are delighted to report that there has been plenty of new content and reflection about the conference appearing since then.

Well over 90 blog posts and reports on the conference are now out there – you have been absolutely brilliant over the last few weeks sharing your reports, reflections and thoughts on how to take forward the fantastic ideas shared by speakers, posters, fellow delegates. We are sure there are more posts to come (since it has taken us a while to update this blog and we’re sure we’re not the only ones still thinking about talks, ideas, discussions had) so do let us know as you add any reports or write ups of your own. For now here are a few more highlights we wanted to share while everything is still fairly fresh – look at the bottom of this post for links to a more thorough collection of posts.

Firstly we have noticed lots of you sharing links to your slides on SlideShare. We will be making sure all of the programme content, slides and videos are connected up here on the website but for now we are making sure we gather these links to your shared presentations. For instance Todd Grappone and Sharon Farb at UCLA have shared their slides on the broadcast news archival work. This ambitious project is one to keep an eye out for, especially when it opens to the public in the future.

Research data has featured prominently in many of your write ups as it was a major theme of this year’s Open Repositories:

Leyla Williams blogged up a summary on the conference for the Center For Digital Research and Scholarship, with particular attention paid to research data and public access to hives of content.

Meanwhile Leslie Johnston of the Library of Congress gave a talk on big data, and also wrote up a great post on the significance of data in a repository setting where publications were once the center focus.

Tyrannosaurus and Shark in National Museum

Some people say open access policy has no teeth… (‘OR2012 012′ by wr_or2012, 22-07-12)

In addition to delegates and attendees who have been sharing their experiences some of our workshop facilitators have been sharing rich reflections on their workshops. For example Angus Whyte of the Digital Curation Centre further developed the idea of research data in repositories, and wrote up the conference workshop on the subject

Most of you will have seen some of the Developer Challenge Show & Tell sessions and we are delighted that the DevCSI team have shared their videos of OR2012 and they are a great collection of Developer Challenge presentations and short interview recordings, like this clip of Peter Sefton, chair of the judges:

We are also starting to see some really interesting posts about how OR2012 ideas and talks can be operationalised. For instance Simon Hodson of JISC has posted a whole series of excellent OR2012 write ups and reflections at the JISC Managing Research Data blog.

And we have also started to see publications based on the conference appearing.  Steph Taylor has written about OR2012 for Ariadne (Issue 69) as an example to frame her advice from getting the most from a conference – it’s a super article and should prove handy for planning your trip to OR2013 on Prince Edward Island. OR2012 has also featured very prominently in the latest issue of Digital Repository Federation Monthly, which includes 10 Japanese attendees’ reports of the conference – huge thanks to @nish_ku for bringing this to our attention.

The Digital Repository Federation article is far from the only non-English write up we’ve had – so far we have spotted write ups of the conference in GermanFinnishPolish, more posts in Japanese and this fantastic series of images of the conference dinner from the Czech Klíštěcí šuplátko photo blog. We know our language skills can’t match up to the incredible diversity of languages spoken by OR2012 delegates so we would really you to let us know if we’ve missed any of the write ups, reports, or reflections shared, particularly if they have been shared in another language.

As we have shared a number of write ups that draw on major conference themes it seems appropriate to close this post with the video of Peter Burnhill of EDINA delivering the closing session this year and wrapping everything up. It’s worth re-watching and, like all of the OR2012 videos, you can watch, share and comment on this on YouTube:

YouTube Preview Image

And finally….

We have several OR2012 conference bags left to give away. These are the perfect size for a laptop and papers which makes them fantastic for meetings but they are also great for looking stylish and well-travelled around the office or for transporting your craft kit to coffee shops and meet ups. We will be posting these remaining bags out with a few bonus edible Scottish treats so make sure you comment here or tweet with #or2012bags quickly to make sure you secure one of our last three remaining bags!

Where to find even more highlights…

  • Images can be found on Flickr, Highlights are gathered on our Pinterest board.
  • We have several gatherings of useful links which you can find on Delicious: write ups (blog posts, reports, etc.) of OR2012, useful resources shared in presentations and via Twitter, and OR2012 presentations.
  • Videos are on YouTube.
  • We have gathered tweets with Storify for browsing and exploring (please note this archive is updated once a week).
  • If you want to analyse or browse the text of all tweets you can access the full spreadsheet containing thousands of #OR2012 tweets on Google Docs. Please ignore colour codings – these are being used to remove unwanted content (tweets intended for other hashtags) and to ensure we capture all links to useful resources shared.
 August 20, 2012  Posted by at 1:31 pm Updates Tagged with: , , ,  Comments Off
Jul 252012

We’ve had some time to sort our videos out and get things organised. You may have already found our YouTube channel, but here’s a handful of useful links to get you browsing through what was said and done inside each session.

First, an apology. The recording for one of Tuesday’s talks, Research Data Management and Infrastructure (or P1A), didn’t end up working. Our AV team is still trying to salvage it, but for now we’re going to say there won’t be a video for that session. Fortunately, we’ve got a liveblog of P1A up, so you can refresh your memory on the subject or see what went on behind closed doors there. Also, session P3A on the same topic does have a video, which is embedded below.

YouTube Preview Image

Now on to the good stuff. We’re putting together playlists for each day of talks and for the Pecha Kucha sessions. We’ve also posted a bunch of new individual Pecha Kucha videos for your convenience. Check out the second RepoFringe Pecha Kucha session (RF5) below. If you just want to see the winner, Norman Grey’s first up.

YouTube Preview Image

At 65 uploaded videos and almost 2000 views so far, we think there’s something for pretty much all Open Repositories folk to enjoy!

 July 25, 2012  Posted by at 8:00 am Updates Comments Off
Jul 182012

OR2012 has wrapped up, tweets are now just slowly fluttering in, and blog posts are popping up like new database entries in springtime. We wanted to gather together a sampling of the best stuff we’ve come across since last week and put it all in plain sight. We know you guys eat broken links and buried content for breakfast, but we figured this could be your pre-meal cup of coffee. …or something. Anyway, here’s what we’ve got.

Keita Bando was active throughout the conference. Here's a shot taken at the drinks and poster session. Click through to see the rest of Keita's lovely photos

Natasha Simons was one of our volunteer bloggers, and she did a fantastic job of it. Mixing summary, analysis, and flair into each post makes each and every one a pleasure to read. Here’s one on arriving in Edinburgh and hearing about the ‘Building a National Network’ workshop, one on conference day 2 (and haggis balls), and one with a sporran full of identifiers chat.

Rob Hilliker immortalized some of the software archiving workshop whiteboard notes for us. Linked to his Twitter post, which leads to a few more pictures and his epic stream of OR2012 tweets

Nick Sheppard, another of our volunteer bloggers, wrote up his reflections of the first two days of the conference on the train ride home. He was keen to write it, and you should be keen to read it. Trust us.

Owen Stephens put together some notes and commentary on repository services, and especially on ResourceSync for folks that are into that sort of thing.

We’re also pleased that discussing the Anthologizr project inspired an Edinburgh University MSc student to focus on that work for his e-Learning dissertation.

An amazing bit of #OR2012 activity analytics by Martin Hawkseye using Carrot2. Click through for full details on how it was made.

The JISC MRD folks took superb notes about the session on institutional perspectives in research data management and infrastructure.

Brian Kelly weighed in on Cameron Neylon’s opening plenary and the significance of connectedness, with particular focus on social media platforms. His site is always worth a browse, so keep tabs on it. View the plenary below.
YouTube Preview Image

The DevCSI developer challenge was quite a lively segment of the conference, no matter which side of the mic you were on. Stuart Lewis drummed up excitement about the collaboration between developers and managers that the challenge aimed for this year, and the result was more than we could hope for. The number of submissions was higher than ever. Check out the competition show and tell and read about the winners.

A mockup of Clang! It was the runner-up project in the DevCSI developer challenge. Click through for a post about the idea

That’s what we’ve gathered so far, but it isn’t enough to do you all justice. That’s why we want you to comment, write in, tweet, and photograph everything you think we missed. We need slide decks, papers, pictures, and everything else. Speakers, if you haven’t passed on slides to session chairs, don’t be shy. And everybody else, drop us a line. We’ll be sure to include whatever you’ve got.

"Coder we can believe in." Click through for Adam Field's first tweet of the image

All this work isn’t just for the website. Everything we gather up will be going into a repository of open repository conference content. What can we say, we’re pretty single-minded when it comes to keeping it all open access for you lot. Get sending, and we’ll share more soon.

 July 18, 2012  Posted by at 11:02 am Updates Tagged with: , ,  Comments Off
Jul 132012

As the conference draws to a close we wanted to thank all of you that came along or followed the event online, and we wantnd to fill you in on what would be happening around the conference after the in-person part of Open Repositories 2012.

In the next few weeks we will be going through the over 4000 tweets and the fantastic photos, blog posts, presentations, conference materials and commentary that you have been producing throughout the conference and we’ll be summarising all that right here, linking to your blogs and reports, and highlighting where you can access all of the official conference content.

Here are eight ways to keep in touch:

  1. Fill in our survey – tell us what you liked, what we could have done better… we value all of your feedback on the event whether you were here in person or via reading our blogs, tweets, seeing videos etc:
  2. Stick with us on Twitter – we will continue sharing blog posts, updates, and conference-related new via the #or2012 tag and the @OpenRepos2012 account. And you should start following the new @ORConference Twitter account which will keep you in touch with Open Repositories throughout the year! Remember to reply, comment, retweet!
  3. Blog with us – we did our best to liveblog from the parallel strands but we would love to hear what you thought of these and other sessions – did you go to or run a fantastic workshop? Was there something increadibly useful from the user group you’d like to see shared more widely? We would love your contributions to the blog or to hear about where you’ve been writing up the event – just drop us an email or leave a comment here!
  4. Keep an eye on the OR2012 YouTube channel – you will find over 40 videos of the parallel sessions (excluding P1A unfortunately, our AV team have been unable to correct a corrupt file of that recording) there already and Pecha Kucha sessions will be appearing over the next few weeks.
  5. Share your pictures – if you haven’t already joined our Flickr group please do get in touch – we’d love to see more of your pictures of the event!
  6. Pin with us! – We have begun the process of gathering our favourite images and videos from OR2012 on Pinterest. We would love to add your highlights, your favourite parts of the event so do let us know what you’d like to see appear!
  7. Connect on CrowdVine! Now that you’ve had a chance to meet and chat it’s a great time to use the OR2012 CrowdVine to stay in touch, make further connection, discuss your thoughts on the event. For instance there’s already a great thread on “highlights and things you’ll take home“.
  8. And finally… Look out for emails about Open Repositories 2013. If you’ve let us know your email address via the feedback form we’ll be in touch. You can also join the Open Repositories Google Group and stay in touch that way. Or you can simply drop us a note to and we’ll make sure we add you to our list for staying in touch.

We really enjoyed Open Repositories 2012 and really hope you did too!

 July 13, 2012  Posted by at 4:04 pm Updates Tagged with: ,  Comments Off
Jul 122012

Today we are liveblogging from the OR2012 conference at George Square Lecture Theatre (GSLT), George Square, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Kevin: I am delighted to introduce my colleague Peter Burnhill, Director of EDINA and Head of the Edinburgh University Data Library, who will be giving the conference summing up.
Peter: When I was asked to do this I realised I was doing the Clifford Lynch slot here! So… I am going to show you a Wordle. Our theme for this years conference was Local In for Global Out… I’m not sure if we did that but here is the summing up of all of the tweets from the event. Happily we see Data, open, repositories and challange are all prominent here. But Data is the big arrival. Data is now mainstream. If we look back on previous events we’ve heard about services around repositories… we got a bit obsessed with research articles, in the UK because of the REF, but data is important and great to see it being prominent. And we see jiscmrd here so Simon will be pleased he did come on his crutches [he has broken his leg].
I have to confess that I haven’t been part of the organising committee but my colleagues have. We had over 460 register from over 40 different nations so do all go to PEI. Edinburgh is a beautiful city but when you got here is was rather damp but it’s nicer now – go see those things. Edinburgh is a bit of a repository itself – we have David Hume, Peter Higgs and Harry Potter to boast – and that fits with local in for global out as I’m sure you’ve heard of two of them. And I’ve like to than John Howard, chair of the OR Steering Committe and our Host Organising Committee
Our opening keynote Cameron Neylon talked about repositories beyond academic walls and the idea of using them for turning good research outputs into good research outcomes. We are motivated to make sure we have secure access to content… as part of a more general rumbling with workshops before the formal start there was this notion of disruption. Not only the Digital Economy but also a sense of not being passive about that. We need to take command of the scholarly communication area that is our job – that cry to action from Cameron and we should heed that.
And there was talk of citation… LinkedIn, etc. is all about linking back to research to data. And that means having reliable identifiers. And trust is a key part of that. Publishers have trust, if repositories are to step up to that trust level you have to be sure that when you access that repository you get what it says it is. As a researcher you don’t use data without knowing what it is and where it came from. The respoitory world needs to think about that notion of assurance, not quality assurance exactly. And also that object may be interrogatable to say what it is and really help you reproduce that object.
Preservation and Provenance is also crucial,
Disaster recovery is also important.. When you fail, and you will, you need to know how you cope, really interesting to see this picked up in a number of sessions too.
I won’t  summarise everything but there were some themes…
We are beginning to deal with the idea on registries and how those can be leveaged for linking resources and identifiers. I don’t think solutions were found exactly but the conversations were very valuable.And we need to think about connectivity, as flagged by Cameron. And these places l,e twitter and Facebook… WE don’t own them but we need to be I them, to make sure that citations come back to us from here.And finally, we have been running a thing called repository fringe for the last four years, and then we won the big One. But we had a little trepidation as There afe a lot lf hou! And we had an uncondference strand. Ad i can say that UoE intends to do repository fringe in 2013.

We hope you enjoyed that unconference strand – an addition to complement the open repositories, not to take away from it but to add an extra flavour. We hope that the PEI folk will keep a bit f that flavour at OR and we will be running the fringe a wee bit later in the year, nearer the edinburgh fringe.

As I finish up I wanted to mention an organisation in IASSIST, librarians used to be about the demand side of services but things have shifted over time. We would encourage that those of us here lik up to groups like IASSIST (and we will suggest the same to them) and we can finds way to connect up, to commune together at PEI and to kshare experience. And so finally I think this is about the notion of connectivity. We have the technology, we have the opportunity to connect up more to our colleagues!

And with that I shall finish up!

Begin with an apology….

We seem to have the builders in. We have a small event coming up… The biggest festival in the world… Bt we didn’t realise that the builders would move in about the same week as you….what you haven’t seen yet is out 60x40ft upside down purple cow… If you are here a bit longer you may see it! We hope you enjoyed your time nonetheless

It’s a worrying thing hosting a conference like this… Lke hosting a party you worry if anyone will show up. But the feedback seems to have been good and and I have many thank yous. Firstly to all of those who reviewed papers. To our sponsors. To the staff here – catering, edinburgh first,nthe tech staff. Bt particularly to my colleagues on the local Host Orgnaising Committee: Stuart Macdonald, William Nixon, james toon,  andrew bevan – most persuasive committee member getting our sponsors on board, saly Macgregor, nicola osborne who has led our social media activity, and to Florance Kennedy, who has been using her experience of wrangling 1000 developers at FLOc a few years ago.

The Measure of success for any event like this is about the quality of conversation, of collaboration, of idea sharing, and that seems to have worked well and we’ve really enjoyed having you here. The conference doesn’t end now of course but changes shape.. And so we move onto the user groups!

 July 12, 2012  Posted by at 11:33 am LiveBlog, Updates Tagged with: ,  2 Responses »
Jul 122012

Today we are liveblogging from the OR2012 conference at George Square Lecture Theatre (GSLT), George Square, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Kevin Ashley is introducing us to this final session…

How many of you managed to get along to a Pecha Kucha Session? It looks like pretty much all of you, that’s fantastic! So you will have had a chance to see these fun super short presentations. Now as very few will have seen all of these we are awarding winners for each session. And I understand that the prizes are on their way to us but may not be at the podium when you come up. So… for the first session RF1, and in the spirit of the ceilidh, I believe it has gone to a pair: Theo Andrew and Peter Burnhill! For the second stream, strand RF3 it’s Peter Sefton – and Anna! For RF3 it’s Peter Van de Joss! And for RF4 it’s Norman Grey!

And now over to Mahendra Mahey for the Developer Challenge winners…

The Developer Challenge has been run by my project, DevCSI: Developer Community Supporting Innovation and we are funded by JISC, which is funded by UK Government. The project’s aims it about highlighting the potential, value and impact of the work developers do in UK Universities in the area of technical innovation, this is through sharing experience, training each other and often on volunteer basis. It’s about using tecnology in new ways, breaking out of silos. And running challenges… so onto the winners of the Developers Challenge at DevCSI this year.

The challenge this year was “to show us something new and cool in the use of repositories”. First of all I’d like to thank Alex Wade of Microsoft Research for sponsoring the Developer Challenge and he’ll be up presenting their special prize later. This year we really encouraged non developers to get involved to, but also to chat and discuss those ideas with developers. We had 28 ideas from splinter apps, repositories that blow bubble, SWORD buttons.. .and mini challenege appeared – Rob Sanderson from Los Alamos put out a mini idea! That’s still open for you to work on!

And so.. the final decisions… We will award the prizes and redo the winning pitches! I’d like to also thank our judges (full list on DevCSI site) and our audience who voted!

First of all honourable mentions:

Mark McGillivray and Richard Jones – getting academics close to repositories or Getting Researchers SWORDable.

Ben O’Steen and Cameron Neylon – Is this research readable

And now the Microsoft Research Prize and also the runners up for the main prize as they are the same team.

Alex: What we really loved was you guys came here with an idea, you shared it, you changed it, you worked collaboratively on it and

Keith Gilmerton and Linda Newman for their mobile audio idea.

Alex: they win a .Net Gadgeteer rapid prototyping kit with motherboard, joystick, monitor, and if you take to Julie Allison she’ll tell you how to make it blow bubbles!

Peter Sefton will award the main prize…

Peter: Patricks visualisation engine won as we’re sick of him entering the developer challenge

The winners and runners up will share £1000 of Amazon Vouchers and the winning entry – the team of one – will be funded to develop the idea – 2 days development time. Patrick: I’m looking for collaborators and also an institution that may want to test it get in touch.

Linda and Keith first

Linda: In Ohio we have a network of DSpace repositories including the Digital Archive of Literacy Narratives – all written in real peoples voices and using audio files, a better way to handle these would be a boon! We also have an Elliston Poetry Curator – he collects audio on analogue devices, digital would be better. And in the field we are increasingly using mobile technologies and the ability to upload audioj or video at the point of creation with transcript would greatly increse the volume of contribution

MATS – Mobile AudioVisual Transcription Service

Our idea is to create an app to deposit and transcript audio – and also video – and we used SWORDShare, an idea from last years conference, as we weren’t hugely experienced in mobile development. We’ve done some mock ups here. You record, transcribe and submit all from your phone. But based on what we saw in last years app you should be able to record in any app as an alternative too. Transcription is hugely important as that makes your file indexable. And it provides access for those with hearing disabilities, and those that want to preview/read the file when listening isn’t an option. So when you have uploaded your file you request your transcription. You have two options. Default is Microsoft Mavis – mechanical transcription. But you can also pick Amazon Mechanical Turk – human transcription, and you might want that if the audio quality was very poor or not in English.

MAVIS allows some additional functionality – subtitling, the ability to jump to a specific place in the file from a transcript etc. And a company called GreenButton offers a webservices API to MAVIS. We think that even if your transcription isn’t finished you can still submit to the repository as new version of SWORD supports updating. That’s our idea! We were pitching this idea but now we really want to build it! We want your ideas, feedback, tech skills, input!

And now Patrick McSweeney and DataEngine.

My friend Dave generated 1TB data in every data run and the uni wouldnt host that. We found a way to get that data down to 10 GB for visualisation. It was back ups on a home machine. It’s not a good preservation strategy. You should educate and inform people and build solutions that work for them!

See: State of the Onion. A problem you see all the time… most science is long tail, and support is very poor in that long tail. You have MATLAB and Excel and that’s about it. Dave had all this stuff, he had trouble managing his data and graphs. So the idea is to import data straight from Dave’s kit to the repository. For Dave the files were CSV. And many tools will export to it, its super basic unit of data sharing – not exciting but it’s simple and scientists understand it.

So, at ingest you give your data provenance and you share your URIs, and you can share the tools you use. And then you have tools for merging and manipulation. the file is pushed into storage form where you can run SQL processing. I implemented this in an EPrints repository – with 6 visualisation but you could add any number. You can go from source data, replay experiment, and get to visualisations. Although rerunning experiments might be boring you can also reuse the workflow with new similar data. You can create a visualisation of that new data and compare it with your original visualisation and know that the process has been entirely the same.

It’s been a hectic two days. It’s a picture (of two bikers on a mountain) but it’s also a metaphor. There are mountains to climb. This idea is a transitional idea. There are semantic solutions, there are LHC type ideas that will appear eventually but there are scientists at the long tail that want support now!

And finally… thank you everyone! I meant what I said last night, all who presented yesterday I will buy a drink! Find me!

I think 28 ideas is brilliant! The environment was huge fun, the developers lounge were a lovely space to work in.

And finally a plug… I’ve got a session at 4pm in the EPrints track and that’s a real demonstration of why the Developer Challenge works as the EPrints Bazaar, now live, busy, changing how we (or at least I) think about repositories started out at one of these Developer Challenges!

At the dinner someone noted that there are very few girls! Half our user base are women but hardly any women presented at the challenge, Ladies, please reprasent.

And also… Dave Mills exist. It is not a joke! He reckons he generated 78 GB of data – not a lot, you could probably get it on a memory stick! Please let your researchers have that space centrally! I drink with reseachers and you should too!

And Ben, Ben O’Steen had tech problems yesterday but he’s always here and is brilliant. is live right now, rate a DOI for whether its working.

And that’s all I have to say.

And now over to Prince Edward Island – Proud Host of OR 2013

I’m John Eade, CEO of DiscoveryGarden and this is Mark Leggot. So, the first question I get is where are you? Well we are in Canada! We are tiny but we are there. Other common questions…

Can I walk from one end of the island to the other? Not in a day! And you wouldn’t enjoy it if you did

How many people live there? 145,000 much more than it was

Do Jellyfish sting? We have some of the warmest waters so bring your swimsuit to OR2013!

Can you fly there? Yes! Direct from Toronto, Montreal, Halifax, Ottawa,(via Air Canada and Westjet) and from New York City (via Delta). Book your flights early! And Air Canada will add flights if neccassary!

We will work diligently to get things up on line as early as possible to make sure you can book travel as soon as possible.

Alternatively you can drive – you won’t be landlocked – we are connected to mainland. Canada is connected to us. We have an 8 mile long bridge that took 2 and a half years to build and its 64 metres high – its the highest point in PEI and also the official rollercoaster!

We are a big tourism destination – agriculture, fishing, farming, software, aerospace, bioresources. We get 1 million tourists per year. That means we have way more things to do there than a place our size should – championship quality gold courses. Great restaurants and a culinary institute. We have live theatre and we are the home of Anne of Green Gables, that plucky redhead!

We may not have castles… but we have our own charms…!

Cue a short video…

Mark: free registration if you can tell me what the guy was doing?

Audience member: gathering oysters?

Mark: yes! See me later!

So come join us in Prince Edward Island. Drop by our booth in the Concourse in Appleton Tower concourse for another chance to win free registration to next years event. We’ve had lots of support locally and this shoudl be a great event!

 July 12, 2012  Posted by at 10:34 am LiveBlog, Updates Tagged with:  Comments Off
Jul 122012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Digital Preservation Network, Saving the Scholarly Record Together
Speaker(s): Michele Kimpton, Robin Ruggaber

Michelle is CEO of DuraSpace. Myself and Robin are going to be talking about a new initiative in the US. This initiative wasn’t born out of grant funding but by university librarians and CIOs who wanted to think about making persistent access to scholarly materials and knew that something needed to be done at scale and now. Many of you will be well aware that libraries are being asked to preserve digital and born digital materials and there are not good solutions to do that in scale. Many of us have repositories in place. Typically there is an online or regular backup but these aren’t at preservation scale here.

So about a year ago a group of us met to talk about how we might be able to approach this problem. And from this – Digital Preservation and Network – was born. DPN is not just a technical architecture. It’s an approach that requires replication of complete scholarly record access nodes with diverse architectures without single points of failure. It’s a fedration. And it is a community allowing this to work at mass scale.

At the core of DPN are a number of replicated nodes. There are minimum of three but up to five here. The role of the nodes is to have complete copies of content, full replications of each replicating nodes. This is a full content object store, not just a metadata node. And this model can work with multiple contributing nodes in different institutions – so those nodes replicate across architectures, geographic locations, institutions.

DPN Principle 1: Owned by the community

DPN Principle 2: Geographical diversity of nodes

DPN Principle 3: Diverse organisations – Uof Michigan, Stanford, San Diego, Academic Presrvation Trust, University of Virginia.

DPN Principle 4: Diverse Software Architectores – including iRODS, HATHI Trust, FedoraCommons, Standford Digital Library

DPN Principle 5: Diverse Political Environments – we’ve started in the US but the hope it to expand out to a more diverse global set of locations

So DPN will preserve scholarship for future generations, fund replicating ndes to ensure functional independence, audit ad verify content, provide a legal framework for holding succession rights – so if a node goes down this means the content will not be lost. And we have a diverse governance group taking responsibility for specific areas.

To date 54 partners and growing, about $1.5 million in funding – and this is not grant funding – and we now have a project manager in place.

Over to Robin…

Robim: Many of the partners in the APTrust have also been looking at DPN. APTrust ia a consortium committed to creation and management of an aggregated preservation repository and, now that DPN underway, to be a replicating node. APTrust was formed for reasons of community-building, economies of scale – things we could do together that we could not do agin, aggregated content, long term preservation, disaster recovery – particularly relevent given recent east coast storms.

The APTrust has several arms: Business and marketing strategy; governance policy and legal framework; preservation and collection framework; repository implementation plan – the technical side of APTrust and being a DPN node. So we had to bring together University librarians, technology liaisons, ingest/preservation. The APTrust services are the aggregation repository, the separate replicating node for DPN, and the access service – initially for administaration but also thinking about more services for the future.

There’s been a lot of confusion as APTrust and DPN started emerging at about the same time. And we are doing work with DPN. So we tend to think of the explanation here being about Winnowing of Content with researchers repository of files at the top, then local institutions repositories, then AP trust – preservation for our institutions that provide robustness for our content, and DPN is then for long term preservation. APTrust is preservation and access. DPN is about preservation only.

So the objectives of the initial phase of the APTrust is engaging partners, defining sustainable business model, hiring a project director, building the aggregation repository and setting up our DPN node. We have an advisory group for the project looking at governance. The service implementation is a phased approach building on experience, leveraging open soure – cloud storage, compute notes, DuraCloud all come into play, economies of scale, TRAC – we are using as a guideline for architecture. APTrust will be sitting at the end of legacy workflows for ingest, it will take that data in, ingest to DuraCloud services, synching to Fedora aggregation repository, and anything for long term preservation will also move to the APTrust DPN Noder with DuraCloud OS via cloudsync.

In terms of the interfaces there will be a single administrative interface which gives access to admin of DuraCloud, CloudSync and Fedora. Which will allow audit reports, functionality in each individual area etc. And that uses the API for each of those services. We will have a proof of that architecture at end of Q4 2012. Partners will feedback on that and we expect to deploy in 2013. Then we will be looking at disaster recovery access services, end-user acces, format migration services – considered a difficult issue so very interesting, best practices fro content types etc., coordinated collection development – across services, hosted repository services. Find out more at and


Q1) In Denmark we are building our national repository which is quite like DPN. Something in your preserntation: it seems that everything is fully replicated to all nodes. In our organisation services that want to preserve something they can enter a contract with another/a service and that’s an economic way to do things but it seems that this model is everthing for everyone.

A1 – Michelle) Right now the principle is everyone gets a copy in everything. We may eventually have specialist centres for video, or for books etc. Those will probably be primarily access services. We do have a diverse ecosystem – back ups across organisations in different ways. You can’t choose stuff in one or another node.

Q2) This looks a lot like LOCKSS – what is the main difference between DPN and a private LOCKSS network.

A2) LOCKSS is a technology for preservation but it’s a single architecture. It is great at what it does so it will probably be part of the nodes here – probably Stanford will use this. But part of the point is to have multiple architectural systems so that if there is an attack on one architecture just one component of the whole goes down.

Q3) I understand the goal is replication but what about format obsolescence – will there be format audit and conversion etc?

A3 – Michelle) I think who does this stuff, format emulation, translation etc. has yet to be decided. That may be at node level not network level.

Topic: ISO : Trustworthy Digital Repository Certification in Practice

Speaker(s): Matthew Kroll, David Minor, Bernie Reilly, Michael Witt

This is a panel session chaired by Michael Witt of Purdue University. This is about ISO 16363 and TRAC, the Trustworthy Repository Audit Checklist – how can a user trust that data is being stored corrrectly and securely, that it is what it says it is.

Matthew: I am a graduate research assistant working with Micheal Witt at Purdue. I’ve been preparing the Purdue Research Repository (PURR) for TRAC. We are a progressive repository with online workspace and data sharing platform, to user archiving and access, to preservation needs of Purdue University graduates, researchers and staff. So for today I will introduce you to ISO 16363 – this is the users guide that we are using to prepare ourselves, I’ll give an example of trustworthiness. So a neccassary and valid question to ask ourselves is “what is “trustworthiness” in this context?” – it’s a very vague concept and one that needs to grow as the digital preservation community and environment grows.

I’d like to offer 3 key qualities of trustworthiness (1) integrity, (2) sustainability, (3) support. And I think it’s important to map these across your organisations and across the three sections of ISO 16363. So, for example, integrity might be that the organisation has sufficient staff and funding to work effectively. Or for the repository it might be that you do fixity checks, procedures and practices to ensure successful migration or translation, similarly integrity in infrastructure may just be offsite backup. Similarly sustainability might be about staff training being adequate to meet changing demands. These are open to interpretation here but useful to think about.

In ISO 16363 there are 3 sections of criteria (109 criteria in all): (3) Organizational Infrastructure; (4) Digital Object management; (5) Infrastructure and Security Risk Management. There isn’t a one-to-one relationship in documentation here. One criteria might have multiple documents, a document might support multiple criteria.

Myself and Micheal created a PURR Gap Analysis Tool – we graded ourselves and brought in experts from the organisation in the appropriate areas and we gave them a pop quiz. And we had an outsider read these things. This had great benefit – being prepared means you don’t overrate yourself. And secondly doing it this way – as PURR was developing and deploying our process here – we gained a real understanding of the digital environment

David Minor, Chronopolic Program Manager, UC San Diego Libraries and San Dieo Supercomputer Center: We completed the Trac process this April. We did it through the CDL organisation. We wanted to give you an overview of what we did, what we learnt. So a bit about us first. Chronopolis is a digital preservation network based on geographic replication – UCSD/SDSC, NCAR, UMIACS. We were initially funded vid the Livrary of Congress NDIIPP program. We spun out into a different type of organisation recently, a FIFA service. Our management and finances are via UCSD. All nodes are independent entities here – interesting questions arise from this for auditors.

So, why do TRAC? Well we wanted to do validation of our work – and this was a last step in our NDIIPP process – an important follow on for development. We wanted to learn about gaps, things we could do better. We wanted to hear what others in the community had to say – not just those we had worked for and served but others. And finally it sounds cyncial but it was to bring in more business – to let us get out there and show what we could do particularly as we moved into FIFA service mode.

The process logistics were that we began in Summer 2010 and finished Winter 2011. We were a slightly different model. We were a self-audit that then went to auditors to follow up, ask questions, speak to customers. The auditors were three people who did a site visit. It’s a closed process except for that visit though. We had management, finances, metadata librarians, and data centre managers – security, system admin etc all involved – equiverlent of 3 FTE. We had discussed with users and customers. IN the end we had hundreds of pages of documentation – some writen by us, some log files etc.

Comments and issues raised by auditors were that we were strong on technology (we expected this as we’d been funded for that purpose) and spent time commenting on connections with participant data centres. They found we were less strong on business plan – we had good data on costs and plans but needed better projections for future adoption. And we had discussion of preservation actions – auditors asked if we were even doing preservation and what that might mean.

Our next steps and future plans based on this experience has been to implement recommendations working to better identify new users and communities, improve working with other networks. How do changes impact audit – we will “re-audit|” in 18-24 months – what if we change technologies? What is management changes? And finally we definitely have had people getting in touch specifically because of knowing we have been through TRAC. All of our audit and self-audit materials are on the web too so do take a look.

Bernie from the Centre for Research Libraries Global Resources Network: We do audits and certification of key repositories. We are one of the publishers of the TRAC checklist. We are a publisher not an author so I can say that it is a brilliant document! We also participated in development of recent ISO standard 16363. So, where do we get the standing to do audits, certification and involvement in standards. Well we are a specialist centre in

We started in UofChichargo, Northwestern etc. established in the 1949. We are a group of 167 universities in US, Canada and Hong Kong and we are about preserving key research information for humanities and social science. Almost all of our funding comes from the research community – also where are stakeholders and governance sit. And the CRL Certification program has the goal to support advanced research. We do audits of repositories and we do analysis and evaluations. We take part in information sharing and best practice. We aim to do landscape studies – recently been working on digital protest and documentation

Portico, Cronopolic, currently looking at PURR and PTAB test audits. The process is much as described by my colleagues. The repository self-audits, then we request documentation, then a site visit, then report is shared via the web. In the future we will be doing TRAC certification alongside ISO 16363 and we will really focus on Humanities and social science data. We continue to have the same mission as when we were founded in 1949, to enable the resiliance and durability of research information.


Q1 Askar, State University of Denmark) The finance and sustainability for organisations in TRAC… it seemed to be predicated on a single repository and that being the only mission. But national archives are more “too big to fail”. Questionning long term funding is almost insulting to managers…

A1) Certification is not just pass/fail. It’s about identifying potential weakness, flaws, points of failure for a repository. So for a national library they are too big to fail perhaps but the structure and support for the organisation may impact the future of the repository – cost volitility, decisions made over management and scope of content preserved. So for a national institution we look at finance for that – is it a line item in national budget. And that comes out in the order, the factors governing future developments and sustainability.

Topic: Stewardship and Long Term Preservation of Earth Science Data by the ESIP Federation
Speaker(s): Nancy J. Hoebelheinrich

I am principle of knowledge management at Knowledge Motifs in California. And I want to talk to you about preservation of earth science data by ESIP – Earth Science Informaion Partners. My background is in repositories and metadata and I am relatively new to earth sciences data and there are interesting similarities. We are also keen to build synergies with others so I thought it would be interesting to talk about this today.

The ESIP Federation is a knowledge network for science data and technology practitions – people who are building component for a science data infrastructure. It is distributed geographically, in terms of topic, interest. It’s about a community effort, free flowing ideas in a collaborative environment. It’s a membership organisation but you do not have to be a member to participate. It was started by NASA to support Earth Obervation data work. The idea was to not just rely on NASA for environmental resewarch data. They are interested in research, application in education etc. The areas of interest include climate, ecology, hydrometry, carbon management, etc. Members are of four types: Type 4 are large organisations and sponsors including NOAA and NASA. Type 1 are data centres – similar to libraries but considered separate. Type 2 are researchers and Type 3 are Application developers. There is a real cross sectoral grouping so really interesting discussion arises.

The type of things the group is working on are often in data informatics and data science. I’ll talk in more detail in a second but it’s important to note that organisations are cross functional as well – different stakeholders/focuses in each. We coordinate the community via In Person Meetings, ESIP Commons, Telecons/WebEx, Clusters, Working Groups and Committes and these all feed into making us interoperable. We are particularly focused on Professional development, outreach and collaboration. We have a number of active groups, committees and clusters.

Our data and informatics area is about collaborative activities in data preservation and stewardship, semantic web, etc. Data preservation and stewardship is very much about stewardship principles, ditation guidelines, provenance context and content standards, and linked data principles. Our Data Stewardship Principles are hat they are for data creators, intermediaries and data users. So this is about data management plans, open exchange of data, metadata and progress etc. Our data citation guidelines were accepted by ESIP Membership Assembley in January 2012. These are based on existing best practice from International Polar Year citation guidelines. And this ties into geospatial data standards and these will be used by tools like the new Thomson Reuters new Data Citation Index.

Our Provenance, Context and Content Standard are about thinking about the data you need about a data set to make it preservable into the long term. So this is about what you would want to collect and how you would collect that. Initially based on content from NASA and NOAA and discussions associated to them. It was developed and shared via the ESIP wiki. The initial version was in March 2011. latest version is June 2011 but this will be updated regularly. The categories are focused mostly on satellite remote setting – preflight/preopertional instrument descriptions etc. And these are based on Use cases – based on NASA work from 1998. What has happened as a result of that work is that NASA has come up with a specification for their data for earth sciences. They make  a distinction betweeen documentation and metadata, a bit differently from some others. Categories here in 8 areas – many technical but also rationale. And these categories help set baseline etc.

Another project we are working on is Identifiers for data objects. There was an abstract research project on use cases – unique identification, unique location, citable location, scientifically unique identification. They came up with categories and characterstics and questions to ask each ID schemes. The recommended IDs ended up being DOI for a citable locator and UUID for unique identifier but we wanted to test this. We are in the process of looking at this at the moment. Questions and results will be compared again.

And one more thing the group is doing is Semantic Web Cluster Activities – they are creating different ontologies for specific areas such as SWEET – an ontology for environmental data. And there are services built on top of those ontologies (Data Quality Screening Service on weather and climade data from space (AIRE) for instance) – both are available online. Lots of applications for this sort of data.

And finally we do education and outreach – data management training short courses. given that it’s important that researchers know how to manage their data we have come up with a short training courses based on the Khan Acadaemy model. That is being authored and developed by volunteers at the moment.

And we have associated activities and organisations – DataOne, DataConservancy, NSF’s Earth Cube. If you are interested to work with ESIP please get in touch. If you want to join our meeting in Madison in 2 weeks time there’s still time/room!


Q1 – Tom Kramer) It seems like ESIP is an eresearch community really – is there a move towards mega nodes or repository or is it still the Wild West?

A1) It’s still a bit like the Wild West! Lots going on but les focus on distribution and preservation, the focus is much more about making data is ingested and made available – where the repositories community was a few years ago. ESIP is interested in the data being open but not all scientists agree about that, so again maybe at the same point as this community a few years ago.

Q2 – Tom) So how do we get more ESIP folk without a background in libraries to OR2012?

A2) Well I’ll share me slides, we probably all know people in this area. I know there are organisations like EDINA here. etc.

Q3) [didn't hear]

A3) EarthCube area to talk about making data available. A lot of those issues are being discussed. They are working out the common standard OGC, ISO, sharing ontologies but not nessaccarily preservation behind repositories. It’s sort of data centre by data centre.

Topic: Preservation in the Cloud: Three Ways (A DuraSpace Moderated Panel)
Speaker(s): Richard Rodgers, Mark Leggott, Simon Waddington, Michele Kimtpon, Carissa Smith

Michelle: DuraCloud was developed in the last few years. It’s a software but also a SAAS (Software As A Service) service. So we are going to talk about different usage etc.

Richard Rodgers, MIT: We at MIT libraries participated in several pilot processes in which DuraCloud was defined and refined. The use case here was to establish a geo distributed replication of the repository. We had content in our IR that was very heterogenous in type. We wanted to ensure system administration practices only  address HW or admin failues – other errors unsecured. Service should be automatic yet visible.We developed a set of DSpace tools geared towards collection and administration. DuraCloud provided a single convenient point of service interoperation. Basically it gives you an abstractiojn to multiple backend services. That’s great as it means that applications and protects against lock-in. Tools ad APIs for DSpace integration. High bandwidth acces to developers. Platform for preservation system and institution-friendly service terms.

Challenges and solutions here… It’s not clear how the repository system should create and manae the files yourself. Do all aspects need to have correllated archival units. So we decided to use AIPs – units of replication which packages items together, they gather loose files. There is repository managere involveement – admin UI, integration, batch tools. There is an issue of scale – big data files really don’t suit interactivity in the cloud, replication can be slow, queued not synchronous. And we had to design a system were any local error wouldn’t be replicated (e.g. deletion locally isn’t repeated in replication version). However deletion is forever – you can remove content. The code we did for the pilot has been refined some what and is available for DSpace as an add on – we hink it’s fairly widely used in the DSpace community.

Mark Leggott, University of PEI/DiscoveryGarden: I would echo the complicated issues you need to consider here. We had the same experience in terms of very responsive process with DuraSpace team. Just a quick bit of info on Islandora. It is a Drupal + Fedora framework from UPEI. Flexible UI and apps etc. We think of DuraCloud as a natural extension of what we do. The approach we have is to leverage DuraCloud and CloudSync. The idea is to maintain the context of individual objects and;/or complete collections. To enable a single button restore of damaged edits. And it integrate with standard or private DC. We have an initial release coming. There is a new component in the Manage tab in the Admin panel called “Vault”. It provides full access to DuraCloud and CloudSync services. It’s accessible through Islandora Admin Panel – you can manage settings. you can integrate it with your DuraSpace enabled service. Or you can do this via DiscoveryGarden where we manage DuraCloud on client’s behalf. And in maaging youe maerials you can access or restore at an item or collection level. You can sync to DuraCloud or restore from the cloud etc. You get reports on synching etc. And reports on matches or mismatches so that you can restrore data from the cloud as needed. And you can then manually check the object.

Our next steps are to provide tihhter integratione nad more UI functions, to move to automated recovery, to enable full Fedora/Collection restore, and to include support for private DuraCloud  instances.

Simon: I will be talking about the Kindura project funded by JISC which was a KCL, STFC and ? initiative. The problem is that storage of research outputs (data, documents) is quite ad hoc but it’s a changing language and UK funders can now require data for 10 years+ so it’s important. We wer elooking at hybrid cloud solutions – commercial cloud is very elastic, rapid deployments, transparent cost, but risky in terms of data sensitivity, data protection law, service availablily and loss. In house storage and cloud storage seem like the best way to gain the benefits but mitigate risks.

So Kindura was a proof of concept repository for research data combining commercial cloud and internal storage (iRODS). Based on Fedora Commons. DuraCloud provides a common stoarge intereface and we deployed from source code – we found Windows was best for this and have created some guidelines on this sort of set up. And we developed a storage management framework based on policies, legal and technical constraints as well as cost (including cost of transmitting data in/out of storage) We tried to implement something as flexible as possible. We wanted automated decisions for storage and migration. Content replicaion across storage providers for resiliance. Storage providers transparant to users.

The Kindura system is based on our Fedora Repository feeding Azure, iRODS and Castor (another use case for researchers to migrate to cheaper tape storage) as well as AWS and Rackspace, it also feeds DuraCloud. The repository is populated via web browser depositing into the management server and down into Fedora Respoitory AND DuraCloud.


Q1) For Richard – you were talking about deletion and how to deal with them
A1 – Richard) There are a couple of ways to gather logically delete items. So you can automate based on a policy for garbage collection – e.g. anything deleted and not restored within a year. But you can also  manually delete (you have to do it twice but you can’t mitigate against that).

Q2) Simon, I had a question. You integrated a rules engine and that’s quite interesting. It seems that Rules probably adds some significant flexibility.

A2 – Simon) We actually evaluated several different sorts of rules engines. Jules is easy, open source and for this set up it seemed quite logical to do this. It sits totally separate to DuraCloud set up at the moment but it seemed like a logical extension


 July 12, 2012  Posted by at 8:02 am LiveBlog, Updates Tagged with:  1 Response »
Jul 122012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Eating your own dog food: Building a repository with API-driven development
Speaker(s): Nick John Jackson, Joss Luke Winn

The team decided they wanted to build a wholly new RDM, with research data as a focus for the sake of building the best tool for that job. This repository was also designed to store data during research, not just after.

Old repositories work very well, but they assume the entry of a whole file (or a pointer), only retrievable in bulk and in oddly organized pieces. They have generally limited interface methods and capacities. These old repositories also focus on formats, not form (structure and content) unless there is fantastic metadata.

The team wanted to do something different, and built a great backend first. They were prepared to deal with raw data as raw data. The API was built first, not the UI. APIs are the important bit. And those APIs need to be built in a way that people will want to use them.

This is wear eating your own dog food comes in. The team used their own API to build the frontend of the system, and used their own documentation. Everything had to be done well because it was all used in house. Then, they pushed it out to some great users, and made them do what they wanted to do with the ‘minimum viable product’. It works, and you build from there.

Traditional repos have a database, application, users. They might tack an API on at the end for manual and bulk control, but it doesn’t even include all of the functionality of the website usually. That or you screen scrape, and that’s rough work. Instead, this repository builds an API and then interacts with that via the website.

Research tends to happen on a subset of any given data set, nobody wants that whole data set. So forget the containers that hold it all. Give researches shared, easily usable databases. APIs put stuff in and out automatically.

This was also made extensible from day one. Extensible and writeable by everybody to the very core. The team also encourages re-usable modularity. People do the same things to their data over and over – just share that bit of functionality at a low data level. And they rely on things to do things to get things done – in other words, there’s no sense in replicating other people’s work if it’s done well.

The team ended up building better stuff because it uses its own work – if it doesn’t do what it’s meant to, it annoys them and they have to fix it. All functionality is exposed so they can get their work done quick and easy. Consistent and clean error handling were baked in for the sake of their own sanity, but also for everybody else. Once it’s all good and easy for them, it will be easy for 3rd parties to use, whether or not they have a degree in repo magic. And security is forcibly implemented across the board. API-level authentication means that everything is safe and sound.

Improved visibility is another component. Database querying is very robust, and saves the users the trouble of hunting. Quantitative information is quick and easy because the API gives open access to all the data.

This can be scalable horizontally, to as many servers as needed. It doesn’t use server states.

There are some problems involved in eating your own dog food. It takes time to design a decent API first. You also end up doubling up some development, particularly for frontend post-API development. APIs also add overhead. But after some rejigging, it all works with thousands of points per second, and it’s humming nicely.

Q: Current challenges?

A: Resourcing the thing. Lots of cutting edge technology and dependence on cloud architecture. Even with money and demand, IT infrastructure aren’t keeping up just yet.

Q: How are you looking after external users? Is there a more discoverable way to use this thing?

A: The closest thing we have is continuous integration to build the API at multiple levels. A discovery description could be implemented.

Q: Can you talk about scalability? Limitations?

A: Researchers will sometimes not know how to store what they’ve got. They might put pieces of data on their own individual rows when they don’t need to be. That brings us closer to our limit. Scaling up is possible, and doing it beyond limits is possible, but it requires a server-understood format.

Q: Were there issues with developers changing schemas mysteriously? Is that a danger with MongoDB?

A: By using our own documentation, forcing ourselves to look at it when building and questioning. We’ve got a standard object with tracking fields, and  if a researcher starts to get adventurous with schemas it’s then on them.


Topic: Where does it go from here? The place of software in digital repositories
Speaker(s): Neil Chue Hong

Going to talk about the way that developers of software are getting overlapping concerns with the repository community. This isn’t software for implementing infrastructure, but software that will be stored in that infrastructure.

Software is pervasive in research now. It is in all elements of research.

The software sustainability institute does a number of things at strategic and tactical levels to help create best practices in research software development.

One question is the role of software in the longer term – five and ten years on? The differences between preservation and sustainability. The former holds onto things for use later on, while the latter keeps understanding in a particular domain. The understanding, the sustainability, is the more important part here.

Several purposes for sustaining and preserving software. For achieving legal compliances (architecture models ought to be kept for the life of a building). For creating heritage value (gaining an overall understanding of influences of a creator). For continued access to data (looking back, through the lens of the software). For software reuse (funders like this one).

There are several approaches. Preserving the technology, whether it’s physical hardware or an emulated environment. Migration from one piece of software to another over time while ensuring functionality, or transitioning to something that does similar. There’s also hibernation, just making sure it can be picked apart some day if need be.

Computational science itself needs to be studied to do a good job of this. Software carpentry teaches scientists basic programming to improve their science. One thing, using repositories, is an important skill. Teaching scientists the exploratory process of hacking together code is the fun part, so they should get to do it.

Re-something is the new black. Reuse, review, replay, rerun, repair. But also reward. How can people be rewarded for good software contributions, the ones that other people end up using. People get pats on the back, glowing blog posts, but really reward in software is in its infancy. That’s where repositories come in.

Rewarding good development often requires publication which requires mention of the developments. That ends up requiring a scientific breakthrough, not a developmental one. Software development is a big part of science and it should be viewed/treated as such.

Software is just data, sure, but along with the Beyond Impact team these guys have been looking at software in terms of preservation beyond just data. What needs to get kept in software and development? Workflows should, because they show the boundaries of using software in a study – the dependencies and outputs of the code. Looking at code on various levels is also important. On the library/software/suite level? The program or algorithm or function level. That decision is huge. The granularity of software needs to be considered.

Versioning is another question. It indicates change, allows sharing of software, and confers some sort of status. Which versions should go in which repositories, though? That decision is based on backup (github), sharing (DRYAD), archiving (DSpace). Different repositories do each.

One of the things being looked at in sustaining software are software metapapers. These are scholarly records including ‘standard’ publication, method, dataset and models, and software. This enables replay, reproduction, and reuse. It’s a pragmatic approach that bundles everything together, and peer review can scrutinize the metadata, not the software.

The Journal of Open Research Software allows for the submission of software metapapers. This leads to where the overlap in development and repositories occurred, and where it’s going.

The potential for confusion occurs when users are brought in and licensing occurs. It’s not CC BY, it’s OSI standard software licenses.

Researchers are developing more software than ever, and trying to do it better. They want to be rewarded for creating a complete scholarly record, which includes software. Infrastructure needs to enable that. And we still don’t know the best way to shift from one repository role to another when it comes to software – software repositories from backup to sharing to archival. The pieces between them need to be explored more.

Q: The inconsistency of licensing between software and data might create problems. Can you talk about that?

A: There is work being done on this, on licensing different parts of scholarly record. Looking at reward mechanisms and computability of licenses in data and software need to be explored – which ones are the same in spirit?


Topic: The UCLA Broadcast News Archive Makes News: A Transformative Approach to Using the News in Teaching, Research, and Publication
Speaker(s): Todd Grappone, Sharon Farb

UCLA has been developing an archive since the Watergate hearings. It was a series of broadcast television recordings for a while, but not it’s digital libraries of broadcast recordings. That content is being put into a searchable, browsable interface. It will be publicly available next year. It grows about a terabyte a month (150000+ programs and counting), which pushes the scope of infrastructure and legality.

It’s possible to do program-level metadata search. Facial recognition, OCR of text on screen, closed caption text, all searchable. And almost 10 billion images. This is a new way for the library to collect the news since papers are dying.

Why is this important? It’s about the mission of the university copyright department: public good, free expression, and the exchange of ideas. That’s critical to teaching and learning. The archive is a great way to fulfill that mission. This is quite different from the ideas of other Los Angeles organizations, the MPAA and RIAA.

The mission of higher education in general is about four principles. The advancement of knowledge through research, through teaching, and of preservation and diffusion of that knowledge.

About 100 news stations being captured so far. Primarily American. International collaborators are helping, too. Pulling all broadcast, under a schedule scheme with data. It’s encoded and analyzed, then pushed to low-latency storage in H.264 (250MB/hr). Metadata is captures automatically (timestamp, show, broadcast ID, duration, and full search by closed captioning). The user interface allows search and browse.

So, what is news? Definitions are really broad. Novelties, information, and a whole lot of other stuff. The scope of the project is equally broad. That means Comedy Central is in there – it’s part of the news record. Other people doing this work are getting no context, little metadata, less broadcasts. And it’s a big legal snafu that is slowly untangling.

Fortunately, this is more than just capturing the news. There’s lots of metadata – transformative levels of information. Higher education and libraries need these archives for the sake of knowledge and preservation.

Q: Contextual metadata is so hard to find, and knowing how to search is hard. How about explore? How about triangulating with textual news via that metadata you do have?

A: We’re pulling in everything we can. Some of the publishing from these archives use almost literally everything (court cases, Twitter, police data, CCTV, etc). We’re excited to bring it all together, and this linkage and exploration is the next thing.

Q: In terms of tech. development, how has this archive reflected trends in the moving image domain? Are you sharing and collaborating with the community?

A: An on-staff archivist is doing just that, but so far this is just for UCLA. It’s all standards-driven so far, and community discussion is the next step.


Topic: Variations on Video: Collaborating toward a robust, open system to provide access to library media collections
Speaker(s): Mark Notess, Jon W. Dunn, Claire Stewart

This project has roots in a project called Variations in 1996. It’s now in use at 20 different institutions, three versions. Variations on Video is a fresh start, coming from a background in media development. Everything is open source, working with existing technologies, and hopefully engaging with a very broad base of users and developers.

The needs that Variations on Video are trying to meet are archival preservation, access for all sorts of uses. Existing repositories aren’t designed for time-based media. Storage, streaming, transcoding, access and media control, and structure all need to be handled in new ways. Access control needs to be pretty sophisticated for copyright and sensitivity issues.

Existing solutions have been an insufficient fit. Variations on Video offers basic functionality that goes beyond them or does them better. File upload, transcoding, and descriptive metadata will let the repository stay clean. Navigation and structural metadata will allow users to find and actually use it all.

VoV is built on a Hydra framework, Opencast Matterhorn, and a streaming server that can serve up content to all sorts of devices.

PBCore was chosen for descriptive metadata, with an ‘Atomic’ content model: parent objects for intellectual descriptions, child objects for master files, children of these for derivatives. There’s ongoing investigation for annotation schemes.

Release 0 was this month (upload, simple metadata, conversion), and release one will come about in December 2012. Development will be funded through 2014.

Uses Backlight for discover, Strobe media player for now. Other media players with more capabilities are being considered.

Variations on Video is becoming AVALON (Audio Video Archives and Libraries Online).

Using the agile Scrum approach with a single team at the university for development. Other partners will install, test, provide feedback. All documentation, code, workflow is open, and there are regular public demos. Hopefully, as the software develops, additional community will get involved.

Q: Delivering to mobile devices?

A: Yes, the formats video will transcode into will be selectable, but most institutions will likely choose a mobile-appropriate format. The player will be able to deliver to any particular device (focusing on iOS and Android).

Q: Can your system cope with huge videos?

A: That’s the plan, but ingesting will take work. We anticipate working with very large stuff.

Q: How are you referencing files internally? Filenames? Checksums? Collisions of named entries?

A: Haven’t talked about identifiers yet. UUIDs generated would be best, since filenames are a fairly fragile method. Fedora is handling identifiers so far.

Q: Can URLs point to specific times or segments?

A: That is an aim, and the audio project already does that.

 July 12, 2012  Posted by at 7:59 am LiveBlog, Updates Tagged with:  Comments Off
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 1 (LT1), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Hi there, I’m Mahendra Mahey, I run the DevCSI project, my organisation is funded by JISC. This is the fifth Developer Challenge. This is the biggest to date! We had 28 ideas. We have 19 presentations, each gets 3 minutes to present! You all need a voting slip! At the end of all of the presentations we will bring up a table with all the entries. To vote write the number of your favourite pitch. If it’s a 6 or a 9 please underline to help us! We will take in the votes and collate them. The judges won’t see that. They will convene and pick their favourites and then we will see if they agree… there will then be a final judging process.

The overall prize and runner up shares £1000 in Amazon vouchers. The overall winner will be funded to develop the idea (depending on what’s logitically possible). And Microsoft research have a .Net gadgeteer prize for the best development featuring Microsoft technology. So we start with…

1 – Matt Taylor, University of Southampton – Splinter: Renegade Repositories on Demand

The idea is that you have a temporary offshoot of your repository, can be disposed or reabsorbed, ideal for conferences or workshops, reduces overhead, network of personal microrepositories – the idea is that you don’t have to make accounts for anyone temporarily using your repositoriy. It’s a network of personal microrepository, A lightweight standalone anotation system. Its independent of the main repository. Great for inexperienced users, particularly important if you are a high prestige university. And the idea is that it’s a pseudopersonal workspace – can be shared on the web but separate of your main repository. And it’s a simplified workflow – so if you make a splinter repository for an event you can use contextual information – conference date, location, etc. to populate metadata. Microrepository already in development and tech exists: Demo at Bazaar workshop tomorrow. Reabsorption trivial using SWORD.

2 – Keith Gilmerton and Linda Newman – MATS: Mobile Audio Transcription and Submission

The idea is that you submit audio to repositories from phones. You set up once. You record audio. You select media for transcription, you add simple metadata You can review audio. Can pick from Microsoft Research’s MAVIS or Amazon’s Mechanical Turk. When submission back you get transcription and media to look at, can pick which of those two – either or both – you upload. And even if transcript not back its OK – new SWORD protocol does updates. And this is all possible using Android devices and code reused from one of last years challenges! Use cases – digital archive of literacy studies seek audio files, elliston poetry curator make analogue recordings , tablets in the field – Pompeii Archeaological Research Project would greatly increase submissions of data from the field.

3 – Joonas Kesaniemi and Kevin Van de Velde – Dusting off the mothballs introducing duster

The idea is to dust off time series here.  The only thing constant is change (Heraclitus 500BC). I want to get all the articles from AAlto university. It’s quite a new university but there used to be three universities that merged together. It would help to describe that the institution changed over time. Useful to have a temporal change model. Duster (aka Query expansion service) takes a data source that is a complex data model and then makes that available. Makes a simple Solr document for use via API. An example Kevin made – searching for one uni searches for all…

4 – Thomas Rosek, Jakub Jurkiewicz [sorry names too fast and not on screen] – Additional text for repository entries

In our repository we have keywords on the deposits – we can use intertext to explain keywords. Polish keywords you may not know them – but we can see that in English. And we can transliterate cyrillic. The idea is to build a system from blogs – connected like lego bricks. Build a blog for transliteration, for translating, for wikipedia, blog for geonames and mapping. And these would be connected to repository and all work together. And it would show how powerful

5 – Asger Askov Blekinge – SVN based repositories 

Many repositories have their own versioning systems but there are already well established versioning systems for software development that are better (SVN, GIT) so I propose we use SVN as the back end for Fedora.

Mass processing on the repository dowsn’t work well. Checkout the repo to a hadoop cluster, run the hadoop job, and commit the changed objects back. If we used standardised back end to access repository we could use Gource – software version control visualisation. I have developed a proof of concept that will be on Github in next few days to prove that you can do this, you can have a Fedora like interace on top of SVN repository.

6. Patrick McSweeney, University of Southampton – DataEngine

This is a problem we encountered, me and my friend Dabe Mills. For his PhD he had 1 GB of data, too much for the uni. Had to do his own workaround to visualise the data. Most of our science is in tier 3 where some data, but we need support! So the idea is that you put data into repository, allows you to show provenance, can manipulate data in the repository, merge into smaller CSV files, create a visualisation of your choice. You store intermediary files, data and the visualisations. You could do loads of visualisations. Important as first step on road to proper data science. Turns repository into tool that engages researchers from day one. And full data trail is there and is reproducable. And more interesting than that. You can take similar data, use same workflow and compare visualisation. And you can actually compare them. And I did loads in 2 days, imagine what I could do in another 2!

7. Petr Knoth from the Open University –  Cross-repository mobile application 

I would like to propose an application for searching across all repositories. You wouldn’t care about which repository it’s in, you would just get search it, get it, using these apps. And these would be provided for Apple and Google devices. Available now! How do you do this? You use APIs to aggregate – we can use applications like CORE, can use perhaps Microsoft Academic Search API. The idea of this mobile app is that it’s innovation – it’s a novel app. The vision is your papers are everywhere through syncing and sharing. It’s relevance to user problems: WYFIWYD: What you find is what you download. It’s cool. It’s usable. Its plausible for adoption/tech implementation.

8. Richard Jones and Mark MacGillivray, Cottage Labs – Sword it!

Mark: I am also a PhD student here at Edinburgh. From that perspective I know nothing of repositories… I don’t know… I don’t care… maybe I should… so how do we fix it. How do we make me be bothered?! How do we make it relevent.

Richard: We wrote Sword it code this week. It’s a jQuery plugin – one line of javascript in your header – to turn the page into a deposit button. Could go in repository, library website, your researchers page… If you made a GreaseMonkey script – we could but we haven’t – we could turn ANY page into a deposit! Same with Google results. Let us give you a quick example…

Mark: This example is running on a website. Couldn’t do on Informatics page as I forgot my login in true researcher style!

Richard: Pick a file. Scrapes metadata from file. Upload. And I can embed that on my webpage with same line of code and show off my publications!

9. Ben O Steen –

Cameron Neylon came up to me yesterday saying that lots of researchers submit papers to repositories like PubMed but also to publishers… you get DOIs. But who can see your paper? How can you tell which libraries have access to your papers? I have built We can use CrossRef and a suitable size sample of DOIs to find out the bigger picture – I faked some sample numbers but CrossRef is down just now. Submit a DOI, see if it works, fill in links and submit. There you go.

10. Dave Tarrant – The Thing of Dreams: A time machine for linked data

This seemed less brave than kinect deposit! We typically publish data as triples… why aren’t people publishing this stuff when they could be… well because they are slightly lazy. Technology can solve problems I’ve created It’s very Sword, very CRUD, very Amazon webs services… So in a browser… I can look at a standard Graphite RDF document. But that information is provided by this endpoint, gets annotated automatically. Adds date submitted and who submitted it. So, the cool stuff… well you can click view doc history… it’s just like Apple time machine that you can browse through time! And cooler yet you can restore it and browse through time. Techy but cool! But what else does this mean… we want to get to semantic web, final frontier.. how many countries have capital cities with an airport and a population over 2 million… on 6th June 2006. Can do it using Memento. Time travel for the web + time travel for data! The final frontier.

11. Les Carr – Boastr – marshalling evidence for reposting outcomes

I have found as a researcher I have to report on outcomes. There is technology missing. Last month a PhD student tweeted that he’d won a prize for a competition from the world bank – with link to World bank page and image of him winning prize, and competition page. We released press release, told EPSRC, they press released. Lots of dissemination, some of that should have been planned in advance. All published on the web. And it disappears super fast. It just dissapates… we need to capture that stuff for 2 years time when we report that stuff! It all gets lost! We want to capture imagination while it happens. We want to put stuff together. Path is a great app for stuff like Twitter has a great interface – who, what, where. Tie to sources of open data, maybe Microsoft Academic Live API. Capture and send to repositories! So that’s it: Boastr!

12. Juagr Adam Bakluha? – Fedora Object Locking

The idea is to allow multiple Fedora webapps working together to allow multiheaded fedora working we can do mass processing like: Fedora object store on a Hadoop File System, one fedora head, means bottlenecks, multiple heads mean multiple apps. Some shared stat between webapps. Add new rest methods – 3 lines in some jaxrs.xml. Add the decorator – 3 lines in Fedora.fcfg and you have Fedora Object locking

13. Graham Triggs – SHIELD

Before the proposal lets talk SWORD… its great, but just for deposit. With SWORD2 you can edit but you get edit iri and you need those, what if you lose them. What if you want to change content in the repository? So, SWORD could be more widely used if edit iris were discoverable. I want an ATOM feed. I want it to support authentication. Better replacement for OMI-PMH. But I want more. I want it to complete non archived items, non complete items, things you may have deposited before. Most importantly I want the edit iri! So I said I have a name…. I want a Simple Harvest Interface for Edit Link Discovery!

14. Jimmy Tang, DRI – Redundancy at the file and network level to protect data

I wanted to talk about redundancy at file and network level to protect data. One of the problems is that people with multi-terabyte archives like to protect it. Storage costs money. Replicating data is wasteful and expensive I think. LOCKSS/Replicating data can be wasteful. Replication means N times cost and money. My idea is to take an alternative approach… Possible solutions is using forward error correcting or erasure codes to a persistant layer – like setting up a RAID disc. You keep pieces of files and you can reconstruct it – move complexity from hardware to software world and save money with the efficiency. There are open source libraries to do this, most are mash ups. Should be possible!

15. Jose Martin – Machine and user-friendly policifying

I am proposing a way to embed data from SHERPA ROMEO webservices into records waiting to be reviewed in a repository. Last week I heard how SHERPA/ROMEO receives over 250K requests for data, he was looking for a script to make that efficient, a script to run on a daily or weekly basis. Besides this task is often fairly manual. Why not put machines to work instead… so we have an ePrints repository with 10 items to be reviewed. We download SHERPA/ROMEO information here. We have the colour code that give a hint about policy. Script would go over all items looking for ISSN matches and find colour code. and let us code those submissions – nice for repository manager and means the items are coded by policy ready to go. And updated policy info done in just one request for, say, 10 items. More efficient and happier! And retrieve journal title whilst at it.

16. Petr Knoth – Repository ANalytics

Idea to make repository managers lives very easy. They want to know what is being harvested and if everything is correct in their system. It’s good if someone can check from the outside. The idea is that analytics sit outside repository, lets them see metadata harvested, if it works OK and also provides stats on content – harvesting of full text PDF files. Very important. even though we have OMI-PMH there are huge discrepancies between the files. I am a repository manager I can see that everything is fine, that it has been carried out etc.  So we can see a problem with an end point. I propose we use this to automatically notify repository manager that something is wrong. Why do we count metadata not PDFs – latter are much more important. Want to produce other detailed full text stats, eg citation levels!

17. Steffan Godskesen – Current and complete CRIS with Metadata of excellent quality 

Researchers don’t want to do thinsg with metadata but librarians do care. In many cases metadata is already available from other sources and in your DI. So When we query the discovery iunterface cleverly we can extract metadata inject into CRIS, have librarians quality check it and obtain excellent CRIS. Can we do this? We have done this between our own DI (discovery system) and CRIS. And again when we changed CRIS, again when we changed DI. Why do again and again… to some extent we want help from DI and CRIS developers to help make these systems extract data more easily!

18. Julie Allison and Ben O’Steen – Visualising Repositories in the Real World

We want to use .Net Gadgeteer or Arduino to visualise repository activity, WHy? to demonstrate in the real world what happens in the repository world. Screens showing issues maybe. A physical guage for hits for hourse – great demo tool. A bell that ring when met deposits per day target. Or blowing bubbles for each deposit. Maybe 3D printing of deposited items? Maybe online Chronozoom, PivotViewer – explore content, JavaScript InfoVis – set of visualisation tools. Repository would be mine – York University. Using query interface to return creation date etc. Use APIs etc. So for example a JSPN animation of publications and networks and links between objects.

19. Ben O’Steen – Raid the repositories!

Lots of repositories with one managers, no developers. Raid them! VM that pulls them all in, pull in text mining, analysis, stats, enhancer etc. Data. Sell as a PR tool £20/month as a demo. Tools for reuse.

Applause meter in the room was split between Patrick MacSweeney  and Richard Jones & Mark MacGillivray’s presentation.

 July 11, 2012  Posted by at 4:03 pm LiveBlog, Updates Tagged with:  1 Response »