Nicola Osborne

I am Digital Education Manager and Service Manager at EDINA, a role I share with my colleague Lorna Campbell. I was previously Social Media Officer for EDINA working across all projects and services. I am interested in the opportunities within teaching and learning for film, video, sound and all forms of multimedia, as well as social media, crowdsourcing and related new technologies.

Sep 062012

As we approach publishing a final post of highlights from Open Repositories 2012 and move this website towards being an archive of this year’s event we wanted to let you know how you can begin connecting with next year’s conference.

Open Repositories 2013 will be taking place on Prince Edward Island (PEI), Canada and you may recall that in the very warm welcome the team gave at OR2012 they promised to have their website live very soon… Well the OR2013 website is now live! Bookmark it now:

On the website already you’ll find some introductory information on the Island and highlights of what you’ll be able to enjoy during your conference stay. An OR2013 Crowdvine has also been set up so do go and sign up.

OR2013 have also launched their Twitter account: you can find them as @openrepos2013 and they are using the hashtag #OR2013 to get the conversation around next year’s conference started.

So, over the next few months you can not only look forward to some updates from the OR2012 team but you can also look forward to hearing much more about OR2013 from the Prince Edward Island team and start planning your ideas, papers, etc.





 September 6, 2012  Posted by at 9:45 am Updates Tagged with: , , ,  Comments Off on OR2013 Website Launched!
Aug 202012

It has now been a month since we gathered in Edinburgh for Open Repositories 2012 and we are delighted to report that there has been plenty of new content and reflection about the conference appearing since then.

Well over 90 blog posts and reports on the conference are now out there – you have been absolutely brilliant over the last few weeks sharing your reports, reflections and thoughts on how to take forward the fantastic ideas shared by speakers, posters, fellow delegates. We are sure there are more posts to come (since it has taken us a while to update this blog and we’re sure we’re not the only ones still thinking about talks, ideas, discussions had) so do let us know as you add any reports or write ups of your own. For now here are a few more highlights we wanted to share while everything is still fairly fresh – look at the bottom of this post for links to a more thorough collection of posts.

Firstly we have noticed lots of you sharing links to your slides on SlideShare. We will be making sure all of the programme content, slides and videos are connected up here on the website but for now we are making sure we gather these links to your shared presentations. For instance Todd Grappone and Sharon Farb at UCLA have shared their slides on the broadcast news archival work. This ambitious project is one to keep an eye out for, especially when it opens to the public in the future.

Research data has featured prominently in many of your write ups as it was a major theme of this year’s Open Repositories:

Leyla Williams blogged up a summary on the conference for the Center For Digital Research and Scholarship, with particular attention paid to research data and public access to hives of content.

Meanwhile Leslie Johnston of the Library of Congress gave a talk on big data, and also wrote up a great post on the significance of data in a repository setting where publications were once the center focus.

Tyrannosaurus and Shark in National Museum

Some people say open access policy has no teeth… (‘OR2012 012’ by wr_or2012, 22-07-12)

In addition to delegates and attendees who have been sharing their experiences some of our workshop facilitators have been sharing rich reflections on their workshops. For example Angus Whyte of the Digital Curation Centre further developed the idea of research data in repositories, and wrote up the conference workshop on the subject

Most of you will have seen some of the Developer Challenge Show & Tell sessions and we are delighted that the DevCSI team have shared their videos of OR2012 and they are a great collection of Developer Challenge presentations and short interview recordings, like this clip of Peter Sefton, chair of the judges:

We are also starting to see some really interesting posts about how OR2012 ideas and talks can be operationalised. For instance Simon Hodson of JISC has posted a whole series of excellent OR2012 write ups and reflections at the JISC Managing Research Data blog.

And we have also started to see publications based on the conference appearing.  Steph Taylor has written about OR2012 for Ariadne (Issue 69) as an example to frame her advice from getting the most from a conference – it’s a super article and should prove handy for planning your trip to OR2013 on Prince Edward Island. OR2012 has also featured very prominently in the latest issue of Digital Repository Federation Monthly, which includes 10 Japanese attendees’ reports of the conference – huge thanks to @nish_ku for bringing this to our attention.

The Digital Repository Federation article is far from the only non-English write up we’ve had – so far we have spotted write ups of the conference in GermanFinnishPolish, more posts in Japanese and this fantastic series of images of the conference dinner from the Czech Klíštěcí šuplátko photo blog. We know our language skills can’t match up to the incredible diversity of languages spoken by OR2012 delegates so we would really you to let us know if we’ve missed any of the write ups, reports, or reflections shared, particularly if they have been shared in another language.

As we have shared a number of write ups that draw on major conference themes it seems appropriate to close this post with the video of Peter Burnhill of EDINA delivering the closing session this year and wrapping everything up. It’s worth re-watching and, like all of the OR2012 videos, you can watch, share and comment on this on YouTube:

YouTube Preview Image

And finally….

We have several OR2012 conference bags left to give away. These are the perfect size for a laptop and papers which makes them fantastic for meetings but they are also great for looking stylish and well-travelled around the office or for transporting your craft kit to coffee shops and meet ups. We will be posting these remaining bags out with a few bonus edible Scottish treats so make sure you comment here or tweet with #or2012bags quickly to make sure you secure one of our last three remaining bags!

Where to find even more highlights…

  • Images can be found on Flickr, Highlights are gathered on our Pinterest board.
  • We have several gatherings of useful links which you can find on Delicious: write ups (blog posts, reports, etc.) of OR2012, useful resources shared in presentations and via Twitter, and OR2012 presentations.
  • Videos are on YouTube.
  • We have gathered tweets with Storify for browsing and exploring (please note this archive is updated once a week).
  • If you want to analyse or browse the text of all tweets you can access the full spreadsheet containing thousands of #OR2012 tweets on Google Docs. Please ignore colour codings – these are being used to remove unwanted content (tweets intended for other hashtags) and to ensure we capture all links to useful resources shared.
 August 20, 2012  Posted by at 1:31 pm Updates Tagged with: , , ,  Comments Off on Another Round of Highlights
Jul 132012

As the conference draws to a close we wanted to thank all of you that came along or followed the event online, and we wantnd to fill you in on what would be happening around the conference after the in-person part of Open Repositories 2012.

In the next few weeks we will be going through the over 4000 tweets and the fantastic photos, blog posts, presentations, conference materials and commentary that you have been producing throughout the conference and we’ll be summarising all that right here, linking to your blogs and reports, and highlighting where you can access all of the official conference content.

Here are eight ways to keep in touch:

  1. Fill in our survey – tell us what you liked, what we could have done better… we value all of your feedback on the event whether you were here in person or via reading our blogs, tweets, seeing videos etc:
  2. Stick with us on Twitter – we will continue sharing blog posts, updates, and conference-related new via the #or2012 tag and the @OpenRepos2012 account. And you should start following the new @ORConference Twitter account which will keep you in touch with Open Repositories throughout the year! Remember to reply, comment, retweet!
  3. Blog with us – we did our best to liveblog from the parallel strands but we would love to hear what you thought of these and other sessions – did you go to or run a fantastic workshop? Was there something increadibly useful from the user group you’d like to see shared more widely? We would love your contributions to the blog or to hear about where you’ve been writing up the event – just drop us an email or leave a comment here!
  4. Keep an eye on the OR2012 YouTube channel – you will find over 40 videos of the parallel sessions (excluding P1A unfortunately, our AV team have been unable to correct a corrupt file of that recording) there already and Pecha Kucha sessions will be appearing over the next few weeks.
  5. Share your pictures – if you haven’t already joined our Flickr group please do get in touch – we’d love to see more of your pictures of the event!
  6. Pin with us! – We have begun the process of gathering our favourite images and videos from OR2012 on Pinterest. We would love to add your highlights, your favourite parts of the event so do let us know what you’d like to see appear!
  7. Connect on CrowdVine! Now that you’ve had a chance to meet and chat it’s a great time to use the OR2012 CrowdVine to stay in touch, make further connection, discuss your thoughts on the event. For instance there’s already a great thread on “highlights and things you’ll take home“.
  8. And finally… Look out for emails about Open Repositories 2013. If you’ve let us know your email address via the feedback form we’ll be in touch. You can also join the Open Repositories Google Group and stay in touch that way. Or you can simply drop us a note to and we’ll make sure we add you to our list for staying in touch.

We really enjoyed Open Repositories 2012 and really hope you did too!

 July 13, 2012  Posted by at 4:04 pm Updates Tagged with: ,  Comments Off on What to expect from OR2012 over the next few weeks
Jul 122012

Today we are liveblogging from the OR2012 conference at George Square Lecture Theatre (GSLT), George Square, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Kevin: I am delighted to introduce my colleague Peter Burnhill, Director of EDINA and Head of the Edinburgh University Data Library, who will be giving the conference summing up.
Peter: When I was asked to do this I realised I was doing the Clifford Lynch slot here! So… I am going to show you a Wordle. Our theme for this years conference was Local In for Global Out… I’m not sure if we did that but here is the summing up of all of the tweets from the event. Happily we see Data, open, repositories and challange are all prominent here. But Data is the big arrival. Data is now mainstream. If we look back on previous events we’ve heard about services around repositories… we got a bit obsessed with research articles, in the UK because of the REF, but data is important and great to see it being prominent. And we see jiscmrd here so Simon will be pleased he did come on his crutches [he has broken his leg].
I have to confess that I haven’t been part of the organising committee but my colleagues have. We had over 460 register from over 40 different nations so do all go to PEI. Edinburgh is a beautiful city but when you got here is was rather damp but it’s nicer now – go see those things. Edinburgh is a bit of a repository itself – we have David Hume, Peter Higgs and Harry Potter to boast – and that fits with local in for global out as I’m sure you’ve heard of two of them. And I’ve like to than John Howard, chair of the OR Steering Committe and our Host Organising Committee
Our opening keynote Cameron Neylon talked about repositories beyond academic walls and the idea of using them for turning good research outputs into good research outcomes. We are motivated to make sure we have secure access to content… as part of a more general rumbling with workshops before the formal start there was this notion of disruption. Not only the Digital Economy but also a sense of not being passive about that. We need to take command of the scholarly communication area that is our job – that cry to action from Cameron and we should heed that.
And there was talk of citation… LinkedIn, etc. is all about linking back to research to data. And that means having reliable identifiers. And trust is a key part of that. Publishers have trust, if repositories are to step up to that trust level you have to be sure that when you access that repository you get what it says it is. As a researcher you don’t use data without knowing what it is and where it came from. The respoitory world needs to think about that notion of assurance, not quality assurance exactly. And also that object may be interrogatable to say what it is and really help you reproduce that object.
Preservation and Provenance is also crucial,
Disaster recovery is also important.. When you fail, and you will, you need to know how you cope, really interesting to see this picked up in a number of sessions too.
I won’t  summarise everything but there were some themes…
We are beginning to deal with the idea on registries and how those can be leveaged for linking resources and identifiers. I don’t think solutions were found exactly but the conversations were very valuable.And we need to think about connectivity, as flagged by Cameron. And these places l,e twitter and Facebook… WE don’t own them but we need to be I them, to make sure that citations come back to us from here.And finally, we have been running a thing called repository fringe for the last four years, and then we won the big One. But we had a little trepidation as There afe a lot lf hou! And we had an uncondference strand. Ad i can say that UoE intends to do repository fringe in 2013.

We hope you enjoyed that unconference strand – an addition to complement the open repositories, not to take away from it but to add an extra flavour. We hope that the PEI folk will keep a bit f that flavour at OR and we will be running the fringe a wee bit later in the year, nearer the edinburgh fringe.

As I finish up I wanted to mention an organisation in IASSIST, librarians used to be about the demand side of services but things have shifted over time. We would encourage that those of us here lik up to groups like IASSIST (and we will suggest the same to them) and we can finds way to connect up, to commune together at PEI and to kshare experience. And so finally I think this is about the notion of connectivity. We have the technology, we have the opportunity to connect up more to our colleagues!

And with that I shall finish up!

Begin with an apology….

We seem to have the builders in. We have a small event coming up… The biggest festival in the world… Bt we didn’t realise that the builders would move in about the same week as you….what you haven’t seen yet is out 60x40ft upside down purple cow… If you are here a bit longer you may see it! We hope you enjoyed your time nonetheless

It’s a worrying thing hosting a conference like this… Lke hosting a party you worry if anyone will show up. But the feedback seems to have been good and and I have many thank yous. Firstly to all of those who reviewed papers. To our sponsors. To the staff here – catering, edinburgh first,nthe tech staff. Bt particularly to my colleagues on the local Host Orgnaising Committee: Stuart Macdonald, William Nixon, james toon,  andrew bevan – most persuasive committee member getting our sponsors on board, saly Macgregor, nicola osborne who has led our social media activity, and to Florance Kennedy, who has been using her experience of wrangling 1000 developers at FLOc a few years ago.

The Measure of success for any event like this is about the quality of conversation, of collaboration, of idea sharing, and that seems to have worked well and we’ve really enjoyed having you here. The conference doesn’t end now of course but changes shape.. And so we move onto the user groups!

 July 12, 2012  Posted by at 11:33 am LiveBlog, Updates Tagged with: ,  2 Responses »
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: ORCID update and why you should use ORCIDs in your repository
Speaker(s): Simeon Warner

I am speaking with my Cornell hat on and my ORCID hat on today. So this is a game of two halves. The first half is on ORCID and what it is. And the second half will be about the repository case and interfacing with ORCID.

So, the scholarly record is broken because there is no reliable attribution of authors and contributors is impossible without unique person-level identifiers. I have an unusual name so the issue is mild, but if you have a common name you are in real trouble. We want to find unique identities to person records across data sources and types and to enlist a huge range of stakeholders to do this.

So ORCID is an amazing opportunity that emerged a couple of years ago. Suddently publishers, achivists, etc. all started talking about the same issue. It is an international, interdisciplinary, open and not for profit organization. We have stakeholders that include research institutions, funding organizations, publishers and researchers. We want to create a registry of persistent unique identifyer fo all sorts of roles – not just authors – and all sorts of contributions. We have a clear scope and set of principles. We will create this registry and it will only work if it’s used very widely. The failure of previous systems have been because the scope hasn’t been wide enough. One of the features of research is that things move –  I was a physicist, now repositories, libraries… I don’t live in one space here. To create an identity you need som einformation to manage that. You need a name, an email, some other bits of information, and the option for users to update their profile with stuff that is useful for them. Privacy is an issue – of course. So we have  a principle in ORCID is opt-in. You can hide your record if you want. You can control what is displayed about you. And we have a set of open principles about how ORCID will interact with other systems and infrastructure.

So ORCID will disambiguate researchers and allow tracking. automate repository deposition, and other tasks that levage use of this sort of ID. We have 328 participan organizations, 50 of which have provided sponsorship. And that’s all over the world.

So to go through a research organization workflow: for an organisation it’s a record of what researchers have done in that institution. But you don’t want a huge raft of staff needed to do this. So the organisation registers with ORCID. At some stage ORCID looks for a record of a person and the organisation pulls out data on that person. Once that search is done on already held information. Identifiers can then be created ready for researchers to claim these.

So, granting bodies, in the US there is always a complaint and a worry about the buden of reporting. So what if we tied this up to an ORCID identity? Again the granting body registers with ORCID and then an ORCID::grant linking sent to PI or researcher for confirmation. Same idea again with the publisher. If you have granted the publisher the ability to do it you can let them add the final publication to your name, saving effort and creating a more accurate record.

So a whole set of workflows gives us a sort of vision for researchers as early as possible in the creation of research here. And in phase I system the researcher can self-claim a profile, delegate management and institutional record creation. Fine grained control of privacy settings. Data exchange into grant and manuscript submission system, authorised organisations/publications etc. So right now we have an API, a sandbox server, etc. Now working out launch partners and readying for launch. ORCID registry will launch in Q4 of 2012. Available now: ORCID identifier structure (coordinated with ISNI) will have a specific structure. Code, APIs, etc. available.

So why should you use ORCID in your repository?

Well we have various stakeholders in your repository – authors, academic community and the institutions themselves. Institutional authors want credit for their work, ORCID should and will increase the likelihood of authors publications being recognised. It opens the door to link to articles that wouldn’t be linked up – analyses of citations etc. Opens the door to more nuanced notions of attributions. And it saves efforts by allowing data reuse across institutions. For readers it offers better discovery and analysis tools. Valuable information for improving tools like Microsoft Academic search, better ways to measure research contributions etc. And institutions allows robust links between local and remote repositories, better track and measure use of publications.

And from an arXiv position we’ve looked for years for something to unify author details across our three repositories. We have a small good quality repositories but we need that link between the author and materials. And from UK/JISC perspective there is a report from JISC Research Identifier task force group that indicates the benefits of ORCID. I think for repositories ORCID helps make repositories count in a field we have to play in.

So, you wnat to integrate with ORCID. There are two tiers to the API right now, I’ll talk about both. All APIs return XML or JSON data. The tier 1 API is available to all for free, no access controls. With this you can ask a researcher for their ORCID ID and look at data made public. You could provide pop up in your repository deposit process to check for their ORCID ID. There is a competition between functionality and privacy here but presuming they have made their ID public this will be very useful.

Tier 2 API members will have access to an OAuth2 authentication between service and ORCID allow users to grant certain rights to a service. Access to both public and (if granted) protected data. Ability to add data (if granted). Really three steps to this process. Any member organisation in the process would get an ORCID ID in first stage of the process. Secondly if you have a user approaching the repository that user can login and grant data access to the client repository. The user can be redirected back to the repository along with an access permisssion. And if access is granted then the repository continues to have access to the user’s profile until this permission is revoked by the user (or ORCID). And data can be added to the users profile by the repository if it becomes available.

All code etc. on Follow the project on Twitter @ORCID_Org.


Q1 – Ryan) You mentione dthat ORCID will send information to CrossRef, what about DataCite?

A1) I don’t think I said that. We import data from CrosRef, not an import the other way. I think that would be led by DOI owner, not ORCID. DOI is easy, someone has the right to a publication, people don’t work that way.

Q1) In that case I encourage you to work with DataCite.

A1) If it’s public on ORCID anyone can harvest it. And ORCID can harvest any DOI source.

Q2 – Natasha from Griffith University) An organisation is prompted to remove duplicates? How does that work?

A2) We are working on that. We are not ready to roll out bulk creation of identifiers for third party at th emoment. The initial creation will be by individuals and publications. We need to work out how best to do that. Researchers want this to be more efficient so we need to figure that question out.

Topic: How dinosaurs broke our system: challenges in building national researcher identifier services
Speaker(s): Amanda Hill

So I am going to talk about the wider identifier landscape that ORCID and others fits into. So on the one hand we have book-level data, it’s labour intensive, disambiguation first, authors not involved, open. And then we have publisher angle – automatic, disambiguation later, authors can edit, proprietary. In terms of current international activity we have ISNI as well as ORCID. ISNI is very library driven, disambiguation first, authors not involved, broad scope. ORCID is more publisher instigated, disambiguation later, authors can submit/edit, current researchers. ISNI is looking at fictional entities etc. as well as researchers etc. so somewhat different.

We had a Knowledge Exchange meeting on Digital author identifiers in March 2012 and both groups were encouraged and present, they are aware and working with each other to an extent. Both ISNI and ORCID will use of existing pools of data to populate them. There are a number of national author ID systems – in 2011 there was a JISC-funded survey to look at systems and their maturity. We did this via a survey to national organisations. The Lattes system in Brazil is very long term – its been going since 1999 – and very mature and very well populated but there is  a diverse landscape.

In terms of populating systems there is a mixture – some prepopulated, some manual, some authors edit themselves. In Japan there was an existing researcher identifiers, thesaurus of author names in Netherlands. In Norway they use human resources data for the same purpose. With more mature systems a national organisation generally has oversight – e.g. in Brazil, Norway, Netherlands. There is integration with research fields and organisations etc. It’s a bit different in the UK. The issue was identified in 2006 as part of call for proposals for the JISC-funded repositories and preservation programme. Mimas and British Library proposed a two year project to investigate requirements and build a prototype system. This project, the Names project, can seem dry but actually it’s a complex problem. Everyone has stories of ambiguation.

The initial plan was to use the British Library Zetoc service to create author IDs – journal article information from 1993 but it’s too vast, too international. And it’s only last names and initials, no institutional affiliation. So we scrapped that. And luckily the JISC Merit project used 2008 Research Assessment Exercise data to pre-populate the Names database. It worked well except for twin brothers with the same initials both writing on paleantology and often co-authoring papers… in name authority circles we call this the “Siveter problem” (the brothers surnames). We do have both in the system now.

Merit data covers around 20% of active UK researchers. And we are working to enhance records and create new ones with information from other sources. Working with institutional repositories, british library data sets (Zetoc), Direct input from researachers. With current EPrints the RDF is easy to grab so we’ve used that with Huddersfield data and it works well. And we have a Submission form on the website now so people can submit themselves. Now, an example of why this matters… I read the separatedbyacommonlanguage blog and she was stressing about the fact that her name appears in many forms and the REF process. This is an example of why identifiers matter and why names are not enough. And how strongly people feel about it.

Quality really matters here. Automatic matching can only achieve so much – it’s dependent on data source. And some poeple have multiple affiliations. There is no size fits all solution hre. We have colleagues at the British Library who perform manual check of results of matching new data sources – allows for separation/merging of records – they did similar on ISNI. At the moment people can contribute a record but cannot update it. In the long term we plan to allow poeple to contribute their own information.

So our ultimate aim is to have a high quality set of unique identifiers for UK researchers and research institutions. Available to other systems – national and international (e.g. Names records exported to ISNI in 2011). Business model wise we have looked at possible additional services – such as disambiguation of existing data sets, identification of external researchers. About a quarter of those we asked would be interested in this possibility and paying for such added value services.

There is an API for the Names data that allows for flexible searching. There is an EPrints plugin – based on the API – which was released last year. It allows repository users to choose form a list of Names identifiers – and to create a Names record if none exists.

So, what’s happening with names now? We are hopefully funded until the end of 2012. Simeon mentioned the JISC convened Researcher ID group – final meeting will take place in September. That report went out for consultation in June, the report of the consultants went to JISC earlier this week. So these final aspects will lead to recommendations. And we have been asked to produce an Options Appraisakl Report for Uk national researcher identifier service in December. And we are looking at improving data and adding new records via repositories search.

So Names is kind of a hybrid of library/publisher approaches. Automatic matching/disambiguation; human quality checks; data immediately available for re-use in other systems; and authors can contribute and will be able to edit. When Names set up ORCID was two years away, ISNI hadn’t started yet. Things are moving fast. The main challenges here are cultural and political rather than technical. National author/researcher ID services can be important parts of research infrastructure. It’s vital to get agreement and co-ordination at national level here.


Q1) I should have asked Simeon this but you may have some appreciation here. How are recently deceased authors being handled? You have data since 1993 – how do you pick up deceased authors.

A1) No, I don’t think that we would go back to check that.

Q1) These people will not be in ID systems but retrospective materials will be in repositories so hard to disambiguate these.

A1) It is important. Colleagues on Archives Hub are interestied in disambiguation of long dead people. Right now we are focusing on active resaerchers.

A2 – Simeon) Just wanted to add that ORCID has a similar approach to deceased authors.

Q2 – Lisa from University of Queensland) We have 1300 authors registered with author id – how do you marry national and ORCID ID?

A2) We can accomodate all relevant identifiers as needed, in theory ORCID ID would be one of these.

Q3) How do you integrate this system by Web of Science and other commercial databases?

A3) We haven’t yet but we can hold other identifiers so could do that in theory but it’s still a prototype system.

Q4) Could you elaborate on national id services vs. global services?

A4) When we looked across the world there was a lot of variation. It would depend on each countries requirements. I feel a national service can be more responsive to the need of that community. So in the UK we have the HE statisticas agency who want to identify those in universities for instance, ORCID may not be right for that purpose say. I think there are various ways we could be more flexible or responsible as a national system vs ORCID with such a range of stakeholders.

Topic: Creating Citable Data Identifiers
Speaker(s): Ryan Scherle, Mark Diggory

First of all thank you for sticking around to hear about identifiers! I’m not sure even I’m that excited about identifiers! So instead lets talk about what happened to me on Saturday. I was far away… it was 35 degrees hotter… I was at a little house on the beach. The Mimosa House. It’s at 807 South Virginia Dare Trail. Kill Devil Hills, NC USA. 27898. It isn’t a well known town but it was the place where the first Orville bros. flight tests took place at [gives exact geocordinators]. But I had a problem. My transmission [part number] in my van [engine number] and opened the vent to  a new spider and a deadly spider crawled out [latin name]. I’m fine but it occured to me that we use some really strange combinations of identifiers. And a lot of these are very unusable for humans – those geocoordinates are not designed for humans to read out loud in a presentation [or livebloggers to grab!].

When you want data used and reused we need to make identifiers human friendly. Repositories use identifiers… EPrints can use a 6 digit number and URL, not too bad, In Fedora there isn’t an imposed scheme. In this one there is a short accession number but it’s not very prominent, you have to dig around a long URL. Not really designed for humans (I’ll confess I helped come up with this one so my bad too). DSpace does impose a structure. It’s fairly short and easy to cite. If you are used to repositories. But if you look at Nature – a source scientists understand. They use DOIs. When scientists see a DOI they know what this is and how to cite this. So why don’t repositories do this?

So I’m not going to get controversial. I am going to suggest some principles for citable identifiers, you won’t all agree!

1 ) Use DOIs – they are very familiar to scientists and others. Scientists dont understand handles, purls or info URI. They understand DOI. And using it adds weight to your citation – it looks important. And loads of services and tools are compatible with DOIs. Currently EPrints and DSpace don’t support them, Fedora only with a lot of work.

2) Keep identifiers simple – complex identifiers are fine for machines but bad for humans. Despite our best intentions humans sometimes need to work with identifiers manually. So keep as short and sweet as possible. So do repositories support that? Yes all three do but you need the right policies set up.

3) Use syntax to illustrate relationships – this is the controversial bit. But hints in identifiers can really help the user. A tiny bit of semantics to an identifier is increadibly useful. e.f. A few slashes here help humans look at higher level objects. Useful for human hacks and useful for stats. You can aggregate stats for higher level stuff. Could break in the future, probably wont! Again EPrints and DSpace don’t enable this. Fedora only with work.

4) When “meaning-bearing” content changes, create a versioned identifier – scientists are pretty picky. Some parts objects have meaning, some don’t. For some objects you might have an excel file. Scientists want that file to be entirely unchanged – and only with new URL. Scientists want datat to be invariant to enable reuse by machines, even a single bit makes a difference. Watch out for implicit abstractions – e.g. thumbnails of different images etc. This kind of process seems intuitive but it kinda flies in face of DOI system and conventions. A DOI for an article it resolves to a landing page that could change every day and contain any number of items. Could be with a different publisher. What the scientist cares about is the article of text itself, webpage not so much of an issues.

Contrast that with…

5) When “meaningless” content changes, retain the current identifier – descriptive metadata must be editable without creating a new identifier. Humans rearely care about metadata changes, especially for citation purposes. Again repositories dont handle this stuff so well. EPrints supports flexible versioning/relationships. DSpace has no support. Fedora has implicit versioning of all data and metadata – useful but too granular!

So to build a repository with all of these features we had a lot of work to do. We had previously been using DSpace so we had some work to do here. What we did was add a new DSpace identifier service. It allows us to handle DOI, and to extend to new identifiers in the future. It allows us granular control of when a new DOI is registered and it lets us send these to citation services as required. So our DSpace identifier system uses EZCite at CDL and then also to DataCite. The DataCite content service lets you look up DOIs, they are linked data compliant – you can see relationships in the metadata. You can export metadata in various formats for textual or machine processing purposes. And we added some data into our citations information. When you load a page in Dryad there is a clear “here’s how to cite this item” note as we really want people to cite our material.

In terms of versioning we have put this under the control of the user and that means that when you push a button a new object is created and goes through all the same creation processes – just a copy of the original. So we can also connect back to related files on the service. And we thus have versioning on files. We plan to do more on versioning on the file and track changes on these. We need to think about tracking information in the background without using new identifiers in the foreground. We are contributing much of this back to DSpace but we want to make sure that the wider DSpace community finds this useful, it meets their requirements.

So, how well has it worked? Well it’s been OK. Lots of community change needed around citing data identifiers. Last year we looked at 186 articles associated with Dryad deposits – 77% had “good” citations to the data. 2% had “bad” citations to the data. And 21% had no data citations at all. We are owrking with the community to raise awareness about that last issue. Looking at articles a lot of people cite data in the text of the article, sometimes in supplementary materials at the end. And a bad citation – they called their identifier an “accession number”.

So, how many of you disagree with me here? [some, not tons of people] Great! come see me at dinner! But no matter whether you agree or not do think about identifiers and humans and how they use them. And finally we are hiring developer and user interface posts at the moment, come talk to me!


Q1 – Rob Sanderson, Los Alamos Public Laboratories) I agree with (4) and (5) but DOIs? I disagree! They are familiar but things can change on a DOI, that’s not what you want!

A1) I maybe over simplified. When you resolve a DOI you get to an HTML landing page. There is content – in our case data files. Those data files we guarantee to be static for a given DOI. We do offer an extension to our DOI – you can add /bitstream to get the static bits. But that page does change and restyle from time to time.

Q2 – Robin Rice, Edinburgh University Data Library) We are thinking about whether to switch from handles for DOI but you can’t have a second DOI for a different location… What do you do if you can’t mind a new DOI for something?

A2) You can promote the existing DOI. I question that you can’t have more than one DOI though, you can have a DOI for each instance for each object.

Q2) Earlier it seemed that the DOI issuing agency wouldn’t allow that

A2)  We haven’t come across that issue yet

A2 – audience) I think the DOI agency would allow your sort of use.

 July 11, 2012  Posted by at 2:30 pm LiveBlog, Updates Tagged with:  Comments Off on P5B: Name and Data Identifiers LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: A Repository-based Architecture for Capturing Research Projects at the Smithsonian Institution
Speaker(s): Thorny Staples

I have recently returned to the Smithsonian. I got into repositories through lots of digital research projects. I should start off by saying that I’ll show you screenshots for a system that allows researchers to deposit data from the very first moment of research, it’s in their control until it goes off to curators later.

I’m sure most of you know of the Smithsonian. We were founded to be a research institute originally – museums were a result of that. We have 19 museums, 9 scientific research centers, 8 advances study centres, 22 libraries, 2 major archives and a zoo (Washington zoo). We focus on longterm baseline research, especially in biodiversity and environmental studies, lots of research in cultural heritage areas. And all of this, hundreds of researchers working around the world, has had no systematic data management of digital researvh content (except for SAO who work under contract for NASA).

So the problem is that we need to capture research information as it’s created and make it “durable” – it’s not about presevation but about making it durable. The Smithsonian is now requiring a data management plan for ALL projects of ANY time. This is supposed to say where they will put their digital information, or at least get them thinking about it. But we are seeing very complex arrays of numerous types of data. Capturing the full structure and context of the research content is neccasary. It’s a network model, it’s not a library model. We have to think network from the very beginning.

We have to depend on the researvchers to do much of the work, so we have to make it easy. They have to at least minimally describe their data but they have to do something. And if we want them to do it we must provide incentives. It’s not about making them curators. They will have a workspace, not an archive. It’s about a virtual research environment but a repository-enables VRE. Primary goal is to enhance their research capabilities, leaving trusted data as their legacy. So to deliver that we have to care about a content creation and management environment, an analysis environment and a dissemination environment. And we have to think about this as two repositories: there is the repository for the researcher, they are data owners, they set policies, they have control – crucial buy-in and crucial concept for them; And then we have to think about an interoperable gathering service – a place researcher content feeds into and also cross search/access to multiple repositories back in the other direction as these researchers work in international teams.

Key to the whole thinking is the concept of the web as the model. It’s a network of nodes that are units of content, connected by arcs that are relationships. I was attracted to Fedora because of the notion of a physical object and a way to create networks here. Increasingly content will not be sustainable as discrete packages. We will be maintaining our part of the formalized world-wide web of content. Some policies will mean we can’t share everything all the time but we have to enable that, that’s where things are going. Information objects should be ready to be linked, not copied, as policy permits. We may move things from one repository to another as data moves over to curatorial staff but we need to think of it that way.

My conceptual take here is that a data object is one unit of content – not one file. E.g. a book is one object no matter how many pages (all of which could be objects). By the way this is a prototype, this isn’t a working service, it’s a prototype to take forward. And the other idea that’s new is the “concept object”. This is an object with a metadata about the project as a whole then a series of concept objects for the components of that project. If I want to create a virtual exhibition I might build 10 concept objects for those paintings and then pull up those resources.

So if you come into a project you see a file structure idea. Theres an object at the top for the project as a whole. Your metadata overview, which you can edit, lets you define those concepts. The researcher controls every object and all definitions. The network is there, they are operating within it. You can link concepts to each other, it’s not a simple hierachy. And you can see connections already there. You can then ingest objects – right now we have about 8 concept types (e.g. “Research site, plot or area”). When you pick that you then pick which of several forms you want to use. When you click “edit” you can see the metadata editor in a simple web form prepopulated with existing record. And when you look at resources you can see any resources associated with that concept. You can upload resources without adding metadata but it will show in bright yellow to remind you to add metadata. And you can attach batches of resources – and these are offered depending where you are in the network.

And if I click in “exhibit” – a link on each concept – you can see a web version of the data. This takes advantage of the adminstrator screen but allows me to publish my work to the web. I can keep resources private if I want. I can make things public if I want. And when browsing this I can potentially download or view metadata – all those options defined by researcher’s setting of policies.


Q1 – Paul Stanhope from University of Lincoln) Is there any notion of concepts being bigger than the institution, being available to others

A1) We are building this as a prototype, as an idea. So I hope so. We are a good microcosm for most types of data – when the researcher picks that they pick metadata schemas behind the scenes. This think we built is local but it could be global, we’re building it in a way that could work that way. With the URIs othwe intstitutions can link their own resources etc.

Q2) Coming from a university, do you think there’s anything different about your institution? Is there a reason this works differently?

A2) One of the things about the Smithsonian is that all of our researchers are Federal employees and HAVE to make their data public after a year. That’s a big advantage. We have other problems – funding, the government – but policy says that the researchers have to

Q3 – Joseph Green from University College Dublin) How do you convey the idea of concept objects etc. to actual users – it looks like file structures.

A3) Well yes, kind of the idea. If they want to make messy structures they can (curators can fix). The only thing they need is a title for their concept structure. They do have a file system BUT they are building organising nodes here. And that web view is an incentive – it’ll look way better if they fill in their metadata. Thats the beginning… for tabular data objects for instance they will be required to do a “code book” to describe the variables. They can do this in a basic way or they can do better more detailed code book and it will look better on the web. We are trying to incentivise  at every level. And we have to be fine with ugly file structures and live with it.

Topic: Open Access Repository Registries: unrealised infrastructure?
Speaker(s): Richard Jones, Sheridan Brown, Emma Tonkin

I’m going to be talking about an Open Access Repositories project that we have been working on, funded by JISC, looking at what Open Access repositories are being used for and what their potential is via stakeholder interviews, via a detailed review of ROAR and OPENDOAR, and somerecommendations.

So if we thought about a perfect/ideal repository as a starting point… we asked out stakeholders what they would want. They would want it to be authoritative – the right name, the right URL; they want it to be reliable; automated; broad scope; curated; up-to-date. The idea of curation and the role of human intervention would be valuable although much of this would be automated. People particularly wanted the scope to be much wider. If a data set changes there are no clear ways to expand the registry and that’s an issue. But all of those terms are really about the core things you want to do – you all want to benchmark. You want to compare yourself to others and see how you’re doing. And in our sector and funders they want to see all repositories, what are the trends, how are we doing with Open Access. And potentially ranking repositories or universities (like Times HE rankings) etc.

But what are they ACTUALLY being used for right now? Well mainly use them for documenting their own existing repositories. Basic management info. Discovery. Contact info. Lookups for services – use registry for OAI-PMH endpoints. So that’s I think, it looks as if we’re falling a bit short! So, a bit of background on what OA repository registries there are. So we have OpenDOAR, ROAR (Registry of Open Access Repositories) – those are both very broad scope repositories, well known and well used. But there is also the Registry of Biological Repositories. There is – all research data so it’s a content type specific repository registry. And, more esoterically, is the Ranking Web of World Repositories. Not clear if this is a registry or a service on a registry. And indeed that’s a good question… what services run on registries. So things like BASE search for OAI-PMH endpoints, very similar to this is Institutional Respositories Search based at Mimas in the UK. Repository 66 is a more novel idea – mashup with Google Maps to show repositories around the world. Then there is the Open Access Repository Junction a multideposit tool for discovery and use of Sword endpoints.

Looking specifically at OpenDOAR and ROAR. OpenDOAR is run at University at Nottingham (SHERPA) and it uses manual curation. Only lists OA and Full-text repositories. It’s been running since 2005. Whereas DOAR is principally Repository Manager added records. No manual curation. And lists both full-text and metadata only. Based at University of Southampton and running EPrints 3, inc. SNEEP elements etc. Interestingly both of these have policy addition as an added value service. Looking at the data here – and these are a wee bit out of date (2011). There seems to be big growth but some flattening out in OpenDOAR in 2011 – probably approaching full coverage. ROAR has a larger number of repositories due to difference in listing but quite similar to OpenDOAR (and ROAR harvests this too). And if we look at where repositories are both ROAR and OpenDOAR are highly international. Slightly more European bias in OpenDOAR perhaps. The coverage is fairly broad and even around the globe. When looking at content type OpenDOAR is good at classifying material into types, reflective of manual curation. We expect this to change over time, especially datasets. ROAR doesn’t really distinguish between content types and repository types – it would be interesting to see these separately. We also looked at what data you typically see about the repository in any record. Most have name, URL, location etc. OpenDOAR is more likely to include a description and contact details than is the case in ROAR. Interestingly the machine to machine interfaces are a different story. OpenDOAR didn’t have any RSS or SWORD endpoint information at all, ROAR had little. I know OpenDOAR are changing this soon. This field has been added on later in ROAR and no-one has come back to update this new technology, that needs addressing.

A quick not about APIs. ROAR has OAI-PMH API, no client library, full data dump available. OpenDOAR has a fulled documented query API, no client library and full data dump available. When we were doing this work almost no one was using the APIs, they just download all data.

We found stakeholders, interviewees etc. noted some key limitations: content count stats are unreliable; not internationalised/multilingual- particularly problematic if a name is translated and is the same as but doesnt appear to be the same thing; limited revisions history; No clear relationships between repos, orgs, etc. And no policies/mechanisms for populating new fields (e.g. SWORD). So how can we take what we have and realise potential for registries? There is already good stuff going on… Neither of those registries automatically harvest data from repositories but that would help to make data more authoritative/reliable/up to date; automated; increased scope of data – and that makes updates so much easier for all.  And we can think about different kinds of quality control – no one was doing automated link checking or spell checking and those are pretty easy to do. And an option for human intervention was in OpenDOAR but not in ROAR, and that could be make available.

But we could also make them more useful for more things – graphical representaqtions of the registry; better APIs and Data (with standards compliance where relevent); versioning of repositories and record counts; more focus on policy tools.  And we could look to encourage overlaid services: repository content stats analysis; comparitive statistics and analytics; repository and OA rankings; text analysis for identifying holdings; error detection; multiple deposits. Getting all of that we start hitting that benchmarking objective.


Q1 – Owen Stephens) One of the projects I’m working on is CORE project from OU and we are harvesting repositories via OpenDOAR. We are producing stats about harvesting. Others do the same. It seems you are combining two things – benchmarking and repositories. We want OpenDOAR to be comprehensive, and we share your thoughts on need to automate and check much of that. But how do we make sure we don’t build both at the same time or separate things out so we address that need and do it properly?

A1) The review didn’t focus on structures of resulting applications so much. But we said there should be a good repository registry that allows overlay of other services – like the benchmarking services. CORE is an example of something you would build over the registry. We expect the registry to provide mechanism to connect up to these though. And I need to make an announcement: JISC, in the next few weeks, will be putting out an ITT to take forward some of this work. There will be a call out soon.

Q2 – Peter from OpenDOAR) We have been improving record quality in OpenDOAR. We’ve been removing some repositories that are no longer there – link checking doesn’t do it all. We also are starting to look at including those machine to machine interfaces. We are doing that automatically with help from Ian Stuart at EDINA. But we are very happy to have them sent in too – we’ll need that in some case

A2) you are right that link checkers are not perfect. More advanced checking services can be built on top of registries though.

Q3) I am also working on the CORE project. The collaboration with OpenDOAR where we reuse their data, it’s very useful. Because we are harvesting we can validate the repository and share that with OpenDOAR. The distinction between registries and harvesting is really about an ecosystem that can work very well.

Q4) Is there any way for repositories to register with to enable automatic discovery?

A4) We would envision something like that, that you could get all that data in a sitemap or similar.

A4 – Ian Stuart) If registering with then why not register with OpenDOAR?

A4 – chair) Well with you host the file, its just out on the web.

Q5) How about persistant URLs for repositories?

A5) You can do this. The Handle in DSpace is not a persistant URL for the repository.

Topic: Collabratorium Digitus Humanitas: Building a Collaborative DH Repository Framework
Speaker(s): Mark Leggott, Dean Irvine, Susan Brown, Doug Reside, Julia Flanders

I have put together a panel for today but they are in North America so I’ll bring them in virtually… I will introduce and then pass over to them here.

So… we all need a cute title and Collaboratory is a great word we’ve heard before. I’m using that title to describe a desire to create a common framework and/or set of interoperable tools providing a DH Scholars Workbench. We often create great creative tools but the idea is to combine and make best use of these in combination.

This is all based on Islandora. A Drupal+ Feora framework from UPEI. Flexible UI on top of Fedora and other apps. It’s deployed in over 100 institutions and that’s growing. The ultimate goal of those efforst is to release a Digital Humanities solutions packs with various tools integrated in, in a framework that would be of interest to scholarly DH context – images, video, TEI, etc.

OK so now my colleagues…

Dean is visiting professor in Yale, and also professor at Dalhousie University in Canada and part of a group that creates new versions of important modernism in canada prints. Dean: so this is the homepage for Modernist Commons. This is the ancillery site that goes with the Modernism in Canada project. One of our concerns is about long term preservation about digital data stored in the commons. What we have here is both the repository and a suite of editing tools. When you go into the commons you will find a number of collections – all test collections and samples from the last year or so. We have scans of a bilingual publication called Le Nigog, a magazine that was published in Canada. You can view images, mark-up, or you can view all of the different ways to organise and orchestrate the book object in a given collection. You can use an Internet Archive viewer or alternative views. The IA viewer frames things according to the second to last image in the object, so you might want to use an alternative. In this viewer you can look at the markup, entities, structures, RDF relations or whether you want to look at image annotations. The middle pane is a version of CWRC Writer that lets us do TEI and RDF markup. And you see the SharedCanvas tools provided with other open annotation group items. As you mark up a text you can create author authority files that can be used across collections/objects.

Next up Victoria Brown, her doctorate is on Victorian feminist literature. She currently researches collaborative systems, interface design, usability. Victoria: I’ll be talking more generally than Dean. The Canadian Writing Research Council is looking to do something pretty ambitios that only works in a collaborative DH environment. We have tools that can aim as big as we can. I want to focus on talking about a couple of things that define a DH Collaboratory. It needs to move beyond institutional repository model. To invoke persoective of librarian colleagues I want to address what makes us so weird… What’s different about us is that storing final DH materials is only part of the story, we want to find, amass, collect materials; to sort and organise them; to read, analyse and visualize. That means environments much be flexible, porous, really robust. Right now most of that work is on personal computers – we need to make these more scalable and interoperable. This will take a huge array of stakeholders buying into these projects. So a DH repository environment needs to be easy o manage, diverse and flexible. And some of these will only have a small amount of work and resources. In many projects small teanms of experts will be working with very little funding. So the CWRC Writer here shows you how you edit materials. On the right you see TEI markup. You can edit this and other aspects – entities, RDF open annotation mark up etc, notations allows you to construct triples from within the eidt. One of the ways to encourage interoperability is through use of common entities – connecting your work to the world of linked data. The idea is to increase consistency across projects with TEI markup and RDF means better metadata than the standard working in Word, publishing in HTML many use. So this is a flexible tool. Embedding this in a repository does raise questions about revisioning and archiving though. One of the challenges for repositories and DH is how we handle those ideas. Ultimately though we think this sort of tool can broaden participation in DH and collaboration in DH content. I think the converse challenge for DH is to work on more generalised environments to make sure that work can be interoperable. So we need to take something from solid and stable structure and move to the idea of shared materials – a porous silo maybe – where we can be specific to our work but share and collaborate with others.

The final speaker is Doug, he became first digital curator at NYPL. He’s currently editing music of the month blog at NYPL. Doug: the main thing we are doing is completely reconfiguring our repository to allow annotation of Fedora and take in a lot of audio nad video content. And particularly for large amounts of born digital collections. We’ve just started working with a company called BrightCove to share some of our materials. Actually we are hiring an engineer to design the interface for that – get in touch. We are also working on improved display interfaces. Right now it’s all about the idea of th egallery – the idea was that it would self-sustain through selling prints. We are moving to a model where you can still view those collections but also archival materials. We did a week long code sprint with DH developers to extend the Internet Archive book reader. We have since decided to move from that to New York Times backed reader – the NYT doc viewer with OCR and annotation there.


Q1) I was interested in what you said about the CWRC writer – you said you wnated to record every key stroke. Have you thought about SVN or GIT that do all that versioning stuff already.

A1 – Susan) They are great tools for version control and it would be fascinating to do that. But do you put your dev money into that or do you try to meet needs of greatest number of projects? But we would definitely look in that direction to look at challenges of versioning introduced in dynamic online production environments.


 July 11, 2012  Posted by at 12:27 pm LiveBlog, Updates Tagged with: , , , ,  Comments Off on P4B: Shared Repository Services and Infrastructure LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Panel Discussion Proposal: “Effective Strategies for Open Source Collaboration” Panel Proposal “Effective Strategies for Open Source Collaboration”
Speaker(s): Tom Cramer, Jon William Butcher Dunn, Valorie Hollister, Jonathan Markow

This is a panel session, so it’s a little bit different. We’ve asked all of our DuraSpace experts here about collaborations they have been engaged in and then turn to you for your experiences of collaboration – what works and what doesn’t.

So, starting with Tom. I’m going to talk about 3 different Open Source technologies we are involved in. First of these is Blacklight which is in use in many places. It’s a faceted search application – it’s Ruby-on-Rails on solr. Originally developed at UVa around 2007, 1st adopted external to UVa in 2009. It’s had multiple installations, 10+ committer institutions etc.

Hydra is a framework for creating digital asset management apps to supplement Fedora. Started in 2008 in Hull, Stanford and Virginia with FedoraCommons. It’s institutionally-driven And developer-led.

And the last item is IIIF: International Image Interoperability Framework – I’ll be talking more on this later – an initiative by major research libraries across the world – a cooperative definition of APIs to enable cross-repository image collections. It’s a standards not technology project.

Lessons learned…

DO: Work from a common vision; be productive, welcoming and fun; engineer face-time is essential; get great contributors – they lead to more great contributors too!

DON’T: over-plan, over-govern; establish too many cross institution dependencies; get hooked on single sources of funding.

Now over to Jon. A few collaborations. First up Sakaibrary. Sakai is an eLearning/course management tool used by dozens of institutions. There was a collaborative project between Indiana University and University of Michigan Libraries to develop extensions to Sakai and facilitate use of library resources in teaching and learning. Top down initiative from university head librarians. Mellon funding 2006-2008 (

The second project is Variations on Video. This one is a collaboration between Indiana University and Northwestern University Libraries – with additional partners for testing and feedback. This is a single cross institution team using AGILE Scrum approaches.

Lessons learned from these projects… Success factors: initial planning periods – shared values and vision being established – helped very much; good project leadership and relationships between leaders important; collaborative development model. Some challenges: Divergent timelines; electronic communication vs. face-to-face – very important to meet face to face; existing community culture; shifts in institutional priorities and sustainability.

Now over to Val, Director of Community Programs for DuraSpace. Part of my role is to encourage teams to collaborate and gain momentum within the DSpace community. We are keen to get more voices into the development process. We had DSpace Developer meeting on Monday and have made some initial tweaks, and continue to tweak, the programme. So what is the DSpace Community Advisory Team? Well we are a group of mostly repository managers/administrators. Developers wanted help/users wanted more input. Formed in Jan 2011, 5-7 active members. DCAT helps review/refine new feature requests – get new voices in there but also share advice, provide developer help. We had a real mission to assess feature requests, gauge interest, and enable discussion.

Some of the successes of DCAT. We have reviewed/gathered feedback on 15_ new feature requests – 3 were included in the last release. It really has broadened development discussion – developers and non-developers, inter/intra-institution. And it has been useful help/resource for developers – community survey by DCAT and provided recommendation on the survey. Feedback on feature implementation.

Challenges for us: no guarantee that a feature makes it in – despite everyone’s efforts features still might not make it in, because of resource limitations; continue to broaden discussion and broaden developer pool; DCAT could also be more helpful during the release process itself – to help with testing, working out bugs etc.

So the collaboration has been successful with discussion and features but continue to do better at this!

Now Jonathan is asking the panel: how important is governance in this process? How does decision making take place?

Tom: Different in different communities. And bottom up vs. top down makes a big difference. In bottom up it’s about developers working together, trusting each other, building the team but keeping code quality is challenging on a local and broader level for risk averse communities.

Jon: governance different between the two projects. In both cases we did have a project charter of sorts. for Sakaibrary it was more consensus based – good in some ways but maybe a bit less productive as a project as a result. In terms of prioritisation of features in the video project we are making use of the scrum concept really and the idea of product owners is very useful there. We try to involve whole team but product owner define priorities. When we expand to other institutions with their own interests we may have to explore other ways of doing things – we’ll need to learn from Hydra etc.

Val: I think DCAT is a wee bit different. Initially this was set up between developers and DCAT and that has been an ongoing conversation. Someone taking the lead on behalf of developers was useful. And for features DCAT members tend to take the lead on a particular request or other to lead analysis etc. of it.


Q1) In a team development effort there is great value to being able to pop into someone’s office and ask for help. And lots of decisions made for free – a discussion really quickly. When collaborative even a trivial decision can mean a 1 hr conference call. How do you deal with that.

A1 – Jon) In terms of the video project we take a couple of approaches – we use IRC channel and Microsoft Link for one-t0-one discussion as needed. We also have daily 15 min stand up meeting via telephone or video conference. And that agile approach with 2 week cycles means it’s not hugely costly to take the wrong approach or find we want to change something.

A1 – Tom) With conference calls we now feel if it takes an hour we shouldn’t make that decision. Move to IRC rather than email is a problem in different time zones. Email lets you really consider things through and that’s no bad thing.. one member of the Blacklight community is loquacious but often answers his own questions inside of an hour! you just learn how to work together.

A1 – Jonathan) We really live on Skype and that’s great. But I miss water cooler moments, tacit understandings that develop there. There’s no good substitute for that.


Topic: High North Research Documents – a new thematic and global service reusing all open sources
Speaker(s): Obiajulu Odu, Leif Longva

Our next speakers are from the University of Tromso. The High North Research Documents is a project we began about six months ago. You may think that you are high in the North but we are from far arctic Norway. This map gives a different perspective on the globe, on the north. We often think of the north as the North of America, of Asia etc. but the far north is really a region of it’s own.

The Norwegian government has emphasized the importance of northern areas and the north is also of interest on an international level – politically and strategically; environmental and climate change issues; resource utilization; the northern sea route to the Pacific. And our university, Tromso, is the northernmost university in the world and we are concerned with making sure we lead research in the north. And we are involved in many research projects but there can be access issues. The solution is Open Access research literature and we thought that it would be a great idea to look at the metadata to extract a set of documents concerned with High North research.

The whole world is available through aggregators like OAIster (OCLC) and BASE (University of Bielefeld) and they have been harvesting OA documents across the world. We don’t want to repeat that work. We contacted the guys a Bielefeldand they were very useful. We have been downloading their metadata local allowing us to do what we wanted to do to analyse the metadata.

Our hypothesis was if we selected a set of keywords and they are in the metadata then the thematic scope of the document can be identified. So we set up a set of filtering words (keywords) applied to the metadata of BASE records based on: geographic terms; species names; languages and folks (nations); other keywords. We have mainly looked for English and Norwegian words, but there is a bigger research world out there.

The quality of keywords is an issue – are their meanings unambiguous. Labrador for instance for us is about Northern Canada, it has a different meaning – farmer or peasant – in Spanish. Sami is a term for people but it is also a common given name in Turkey and Finland! So we have applied keywords filtering a selection of elements – so “sami AND language” or “sami AND people”. The filter process is applied only to selected metadata elements – title, description, subject. But it’s not perfect.

Looking at the model we have around 36 million documents from 2150 scholarly resources. These are filtered, extracted. And one subset of keywords go right into the High North Research Documents database. Another set of keywords we don’t trust as much so they go through a manual quality control first. Now over to my colleague Obiajulu.

Thank you Leif. We use a series of modules in the High North System model. The Documents service itself is DSpace. The import module gets metadata records and puts them in our MySQL database. After documents are imported we have the extraction module – applies the extraction criteria on the metadata. The Ingest module transforms metadata records relevant to the high north into DSpace XML format and imports them into a DSpace repository. And we have the option of addicting custom information – including use of facets.

Our Admin Module allows us to add, edit or display all filtering words (keywords). And it allows us to edit the status of a record or records – Blacklist/reject; approved; modified. So why do we use DSpace? Well we have used it for 8 or 9 years to date. It provides end use with both a regular search interface and faceted search/browsing. Our search and discovery interface is an extension of DSpace and it allows us to find out about any broken links in the system.

We are on High North RD v 1.1. 151,000 documents extracted from more than 50% of the sources appealing in BASE and from all over the world. Many different languages – even if we apply mainly English and Norwegian and Latin in the filtering process. Any subject but weight on the hard sciences. And we are developing the list of keywords as a priority so we have more and better keywords.

When we launched this we tried to get word out as far and wide as possible. Great feedback received so far. The data is really heterogeneous in quality, full text status etc. so feedback received has been great for finding any issues with access to full text documents.

Many use their repository for metadata only. That would be fine if we could identify where a record is metadata only. We could use the dc:rights but many people do not use this. How do we identify records without any full text documents. We need to weed out many non-OA records from High North RD – we only want OA documents, it’s not a bibliographic service we want to make. Looking at document types we have a large amount of text and articles/journals but also a lot of images (14-15% ish). The language distribution shows English. Much smaller percentage in French, Norwegian… and other languages.

So looking at the site ( It’s DSpace and everything in it is included in a single collection. So… if I search for pollution we see 2200 results and huge numbers of keywords that can be drilled down into. You can filter by document type, date, languages etc.

And if we look at an individual record we have a clear feedback button that lets users tell us what the problem is!


Q1) You mentioned checking quality of keywords you don’t trust, and that you have improvements coming to keywords. Are you quality checking the “trusted” keywords.

A1) When we have a problem record we can track back over the keywords and see if one of those is giving is giving us problems, we have to do that this way.

We believe this to be a rather new method, to use keywords in this way to filter content. We haven’t come across it before, it’s simple but interesting. We’d love to hear about any other similar system if there are any. And it would be applicable to any topic.

Topic: International Image Interoperability Framework: Promoting an Ecosystem of Open Repositories and Open Tools for Global Scholarship
Speaker(s): Tom Cramer

I’m going to talk about IIIF but my colleagues here can also answer questions on this project. I think it would be great to get the open repositories community involved in this process and objectives.

There are huge amounts of image resources on the web – books, manuscripts, scrolls, etc. Loads of images and yet really excellent image delivery is hard, it’s slow, it’s expensive, it’s often very disjointed and often it’s too ugly. If you look at bright spots: CDragon, Google Arts, or other places with annotation or transcription it’s amazing to see what they are doing vs. what we do. Its like page turners a few years ago – there were loads, all mediocre. Can we do better?! And we – repositories, software developers, users, funders – all suffer because of this stuff.

So consider…

… a paleographer who would like to compare scribal hands from manuscripts at two different repositories – very different marks and annotations.

— an art and architecture instructor trying to assemble a teaching collection of images from multiple sources..

… a humanities scholar who would like to annotate a high resolution image of an historical map – lots of good tools but not all near those good resources.

… a repository manager who would like to drop a newspaper viewer with deep zoom into her site with no development of customization required

… a funder who would like to underwrite digitization of scholarly resources and decouple content hosting and delivery.

We started last September a year long project to look at this – a group of 6 of the worlds leading libraries and Stanford. Last September we looked at the range of different image interfaces. Across our 7 sites there were 15 to 20 interfaces, including Oxford it was more like 40 or 50 interfaces. Oxford seems to have lots of legacy humanities interfaces – lovely but highly varied – hence the increase in numbers.

So we want specialised tools but less specialised environment. So we have been working on Parker on the web project – mediaeval manuscripts project with KCL and Stanford. the La Munda Le Rose is similar in type. Every one of these many repositories is a silo – no interoperability. Every one is a one-off – big overhead to code and keep. And every user is forced to cope – many UIs, little integration. no way to compare one resource with another. They are great for researchers who fed into the design but much less useful for others.

Our problem is we have confused the role of responsibilities of the stakeholders here. We have scholars who want to find, use, analyze, annotate. they want to mix and match, they want best of breed tools. We have toolers – build useful tools and apps – want users and resources. And we have the repositories who want to host, preserve and enrich records.

So for the Parker project we had various elements managed via APIs. We have a TPEN transcription tool. We sent TPEN hard drive full of Tiffs to work on. Dictionary of Old English, they couldn’t take a big file of TIFFs but we gave them access to the database. We also had our own app. So our data fed into three applications here and we could have taken the data on some round trips – adding annotations before being fed into database. And by taking those APIs into a Framework and up into an Ecosystem we could enable much more flexible solutions – ways to view resources in the same environment.

So we began some DMS Tech work. We pulled together technologists from a dozen institutions to look at best tools to use, best adaptations to make etc. and we came up with basic building blocks for ecosystem: image delivery API (speced and built); data model for medieval manuscripts (M3/SharedCanvas) – we anticipate people wanting to page through documents – for this type of manuscript the page order, flyleafs, inserts etc. are quite challenging; support for authentication and authorization – it would be great if everything was open and free but realistically it’s not; reference implementations of load balanced, performant Djatoka server – this seemed to be everyone’s page turning software solution of choice; interactive open source page turning and image viewing application; OAC-compatible tools for Annotation (Digital Mappaemundi) and transcription (T-PEN).

We began the project last October, some work already available. the DMS Index pulls data from remote repositories and you can explore in a common way as the data is structured in a common way. You can also click to access annotation tools in DM, or to transcribe the page from TPEN etc. So one lets you explore and interact with this diverse collection of resources.

At the third DMS meeting we started wondering if, if this makes sense for manuscripts, doesn’t this make sense for other image materials. IIIF basically takes the work of DMS and looks how we can bring these to the wider world of images. We’ve spent the least 8 or 9 months putting together the basic elements. So there is a Restful interface to pic up an image from a remote location. We have a draft version of the specification available for comment here: What’s great is the possibility to bring in functionality on images into your environment that you don’t already offer but would like to. Please do comment into 0.9 proclamation you have until 4pm Saturday (Edinburgh time).

The thing about getting images into a common environment is that you need metadata. We want and need to focus on just what the key metadata needs to be – labels, title, sequence, attribution etc. Based on (synthesis of OAC (open annon. collab) and DMS.

from a software perspective we are not doing software development but we hope to ferment lots of software development. So we have thought of this in terms of tiers for sharing images. Lots of interest in Djatoka, IIIF Image API and then sets of tools for deep panning, zooming, rotating etc. And then moving into domain and modality specific apps. And so we have a wish list for what we want to see developed.

This was a one year planning effort – Sept 2011 – Aug 2012. We will probably do something at DOF as well. We have had three workshops. We are keen to work with those who want to expose their data in this sort of way. Just those organisations in the group have millions of items that could be in here.

So… What is the collective image base of the Open Repository community? What would it take to support IIIF APIs natively from the open repository platforms? What applications do you have that could benefit from IIIF? What use cases can you identify that could and should drive IIIF? What should IIIF do next? Please do let us know what we could do or what you would like us to do.

Useful links: IIIF:; DMS Interop:; Shared-canvas:


Q1) Are any of those tools available, open source?

A2) T-Pen and DM are probably available. Both Open Source-y. Not sure if code distributed yet. Shared Canvas code is available but not easy to install.

Q2) What about Djatoka and improved non buggy version?

A2) There is a need for this. Any patches or improvements would be useful. There is a need and no-one has stepped up to the plate yet. We expect that as part of IIIF that we will publish. The national library of Norway rewrote some of the coding in C, which improved performance three-fold. They are happy to share this. It is probably open source but hard to find the code – theoretically open source.

And with that we are off to lunch…

 July 11, 2012  Posted by at 10:03 am LiveBlog, Updates Tagged with:  Comments Off on P3B: Open Source: Software and Frameworks LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Built to Scale?
Speaker(s): Edwin Shin

I’m going to talk about a project I recently worked on with a really high volume read vs. write. 250 million records – largest blacklight solr application. Only took a couple of days with Solr to index these but for reasonable query performance thresholds things get more complex. The records were staged in a relational database (Postgres). And around 1KB/record (bibliographic journal data). There are some great documented examples here that helped us. And we had a good environment – 3 servers each with 12 physical cores, 100GB RAM. We just moved all that data from Postgres – 80Gig compressed took a long time. The rate of ingest of the first 10K records if constant, suggested that all 250 million could be achieved in under a day but performance really slowed down after the first

Assign 32 GB of heap to JVM – we found RAM has more impact than CPU. Switch to Java 7. Adding documents in batches of 1000. We stopped forcing commits and only did this every 1 million documents. So in the end we indexed, to the level we wanted, 250 million documents in 2.5 days. We were pretty happy with that.

Querying – we were working with 5 facets (Format, Journal, Author, Year and Keywords) and 7 queryable fields. Worst Case was just under a minute. Too sow. So we optimised querying by running optimize after index. Added newSearcher and firstSearcher event handlers. But it was still slow. We started looking at sharing. 12 shards across 2 servers: 3 Tomcat instances per server, each Tomcat with 2 shards. This means splitting the index across machines and Solr is good at letting you do that and search all the shards at once. So our worst case query dropped from 77 seconds to 8 seconds.

But that’s till too slow. We noticed that filterCache wasn’t being used much, it needed to be bigger. Each shard had about 3 million unique keyword terms cached. We hadn’t changed default level at 512. We bumped it to about 40,000. We removed facets with large number of unique terms (e.g. keywords). The Worst case queries were now down to less than 2 seconds.

The general theme is that there was no one big thing we did or could do, it was about looking at the data we were dealing with and making the right measures for our set up.

We recently set up a Hydra installation, again with a huge volume of data to read. We needed to set up ingest/update queues with variable “worker” threads. It became clear that Fedora was the bottleneck for ingest. Fedora objects created programmatically rather than by FOXML documents – making it slower. The latter would have been fast but would have caused problems down the road, less flexibility etc. Solr performed well and wasn’t a bottleneck. But we got errors and data corruption in Fedora when we had 12-15 concurrent worker threads. What was pretty troublesome was that we could semi replicate this in staging for ingesting. But we couldn’t get a test case and never could get to the bottom of this. So we worked around it… and we decided to “shard” a standalone Fedora repository. It’s not natively supported so you have to do it separately. Sharding is handled by ActiveFedora using a simple hashing algorithm to shard things. We started with just 2 shards and use an algorithm much as Fedora uses internally for distributing files. We get on average a pretty even distribution across the Fedora repositories. This more or less doubled ingest performance without any negative impact.

project at the end of last year. A project with 20 million digital objects. 10-39 read transactions per second 24/7. High availability required, no downtime for reads. no more than 24 hours downtime for writes. Very challenging set up.

So, the traditional approach for high up time is the Fedora Journaling Module that allows you to ingest once to many “follower” installations. Journaling is proven, it’s a simple and straightforward design. Every follower is a full redundant node. But that’s also a weakness. Every follower is a full, redundant node – huge amounts of data and computationally expensive processes that happen on EVERY node, that’s expensive in terms of time, storage, traffic. And this approach assumes a Fedora-centric architecture. If you have a complex set up with other components this is more problematic still.

So we modeled the journaling and looked at what else we could do. So we set up the ingest that was replicated but then fed out to a Fedora shared file system and was fed into nodes but not doing FULL journaling.

But backups, upgrade and disaster recovery. But with 20 million digital objects. The classic argument for Fedora is that you can always rebuild. In a disaster that could take months here, though. But we found that most users used new materials – items from the last year – so we did some working around to make that disaster recovery process faster.

Overall the general moral of the story is you can only make these types of improvements if you really know

Q1) What was the garbage collector you mentioned?

A1) G1 Garbage collector that comes with Java 7

Q2) Have you played with the chaos monkey idea? Netflix copies to all its servers and it randomly stops machines to train the programming team to deal with that issue. It’s a neat idea.

A2) I haven’t played with it yet, I’ve yet to meet a client who would let me play with that but it is a neat idea.

Topic: Inter-repository Linking of Research Objects with Webtracks
Speaker(s): Shirley Ying Crompton, Brian Matthews, Cameron Neylon, Simon Coles

Shirley from STFC – Science and Technology Facilities Council. We run large facilities for researchers. We run manage a huge amount of data every year and my group runs the e-Infrastructure for these facilities – including the ICAT Data Catalogues, E-publications archive and Petabyte Data Store. We also contribute to data management, data preservation etc.

Webtracks is a joint programme between STFC and University of Southampton. This is a Web-scale link TRACKing for research data and publications. Science on the web increasingly involves the use of diverse data sources and services plus objects. And this ranges from raw data from experiments through to contextual information, lab books, derived data, research outputs such as publications, protein models etc. When data moves from research facility to home institution to web materials areas we lose that whole picture of the research process.

Linked data allows us to connect up all of these diverse areas. If we allow repositories to communication then we can capture the relationship between research resources in context. It will allow different types f resources to be linked within a discipline – linking a formal publication to on-line blog posts and commentary. Annotation can be added to facilitate intelligence linking. It allows researchers to annotate their own work with materials outside their own institution.

Related protocols here: Trackback (tracking distributed blog conversations, with fixed semantics), Semantic Pingback (RPC protocol using P2P).

In webtracks we took a two pronged approach: inter-repository communications pool and a Reslet Framework. The InteRCom protocol that allows repositories to connect and describe their relationship (cito: isCitedby). InteRCom is a two stage protocol like Trackback, first harvesting of resources and metadata. Then pinging process to post the link request. The architecture is based on the Restlet Framework (with a data layer access, app-spec config (security, encoding, tunneling), resource wrapper. This has to connect to many different institutional policies – whitelisting, pingback (and checking if this is a genuine request), etc. Lastly you have to implement the resource cloud to expose the appropriate links.

Webtracks uses a Resource Info Model. A repository connected to a resource, to a link and each link has subject, predicate and object. The link can be updated and tracked automatically using HTTP. We have two exemplars being used with WebTracks the ICAT investigation resource – DOI landing page, and HTML rep with RDFa – so a machine and human readable version. The other exemplar is EPubs set up much like ICAT.

InterCom Citation Linking – we can see on the ICAT DOI landing page linking to Epubs expression links pae. That ICAT DOI also links to ICAT Investigation links page and in turn that links to Epubs expression page. And that expression page feeds back into the Epubs Expression links page.

Using the Smart Research Framework we have integrated services to automate prescriptive research workflow – that attempt to preemptively catch all of the elements that make up the research project, including policy information, to allow the researcher to concentrate on their core work. That process will be triggered at STFC and will capture citation links in the process.

To summarise, Webtracks provides a simple but effective mechanism to facilitate propagation of citation links to provide a linked web of data. It links diverse types of digital research objects. To restore context to dispersed digital research outputs. No constraints on link semantics and metadata. It’s P2P, does not rely on centralised service. And it’s a highly flexible approach.

Topic: ResourceSync: Web-based Resource Synchronization
Speaker(s): Simeon Warner, Todd Carpenter, Bernhard Haslhofer, Martin Klein, Nettie Legace, Carl Lagoze, Peter Murray, Michael L. Nelson, Robert Sanderson, Herbert Van de Sompel

Simeon is going to be talk about resource synchronization. We are a big team here and have funding from the Sloan Foundation and from JISC. I’m going to be talk about discussions we’ve been having. We have been working on the ResourceSync project, looking at replication of web material… it sounds simple but…

So… synchronization of what? Well web resources – things with a URI that can be dereferenced and are cache-able. Hidden in that is something about support for different representations, for content negotiations. No dependency on underlying OS, technologies etc. For small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources). We want this to be properly scalable to large resources or large collections of resources. And then there is the factor of change – is it a slow change (weeks/month) for an institutional repository maybe, or very quickly (seconds) – like a set of linked data URIs, and where there needs to be latency there. And we want this to work on/via/native to the web.

Why do this? Well because lots of projects are doing  synchronization but do so case by case. The project teams are involved in these projects. Lots of us have experience with OAI-PMH, it’s widely used in repository but XML metadata only and web technologies have moved on hugely since 1999. But there are loads of use cases here with very different needs. We had lots of discussion and decided that some use cases were not but some were in scope. That out of scope for now list is: bidirectional syncronisation; destination-defined selective synchronization (query); special understanding of complex objects; bulk URI migration; Diffs (hooks?) – we understand this will be important for large objects but there is no way to do this without needing to know media types; intra-operation event tracking; content tracking.

So a use case: DBPedia Live duplication. 20 million entries updated once per second. We need push technology, we can be polling this all the time.

Another use case: arXiv mirroring  1 million article versions. about 800 created each day and updated at 8pm US eastern time. metadata and full text for each article. Accuracy very important. want low barrier for others to use. Works but currently use rsync and that’s specific to one authentication regime.

Terminology here:

  • Resource – inject to be synchronizes, a web resource
  • Source – system with the original or master resources
  • Destination – where synchronised to
  • Pull
  • Push
  • Metadata – information about resources such as URI, modification time, checksum etc. Not to be confused with metadata that ARE resources.

We believe there are 3 basic needs to meet for syncronisation. (1) baseline synchronisation – a destination must be able to perform an initial load or catch-up with a source (to avoid out of band setup, provide discovery). (2) Incremental synchronization – destination must have some way to keep up-to-date with changes at a source (subject to some latency; minimal; create/update/delete). (3) Audit – it should be possible to determine whether a destination is synchronised with a source (subject to some latency; want efficiency –> HTTP HEAD).

So two approaches here. We can get an inventory of resources then copy one by one via HTTP GET. Or we can get a dump of the data and extract metadata.  For auditing we could do new Baseline synchronization and compare but likely to be very inefficient. Can optimize by adding getting an inventory and compare copy with destination – using timestamp, digest etc. smartly, a latency issue here again to consider.  And then we can think about Incremental Synchronisation. The simplest method would be audit then copy all new/updated resources plus removal of deleted. Optimize this by changing communication – exchange ChangeSet listing only updates; Resource Transfer – exchange dumps for ChangeSets of even diffs; and Change Memory.

We decided for simplicity to use Pull but some applications may need Push. And we wanted to think about the simplest idea of all: SiteMaps as an inventory. So we have a framework based on Sitemaps. On level 0, the base level. Publish a sitemap and someone can grab all of your resources. A simple feed of URL and last modification date lets us track changes. Sitemap format was designed to allow extension. It’s deliberately simple and extensible. There is an issue about size. The structure is for a list of resources that handles up to 2.5 billion resources before further extension required. Should we try to make this looks like RDF we expect? We think no but map Sitemap RDF to RDA.

At the next level we look at a ChangeSet. This time we reuse Sitemap format but include information only for change events over a certain period. To get a sense of how this looks we tried this with ArXiv. Baseline synchronisation and Audit: 2.3 million resources (300GB); 46 sitemaps and 1 sitemapindex (50k resources/sitemap).

But what if I want Push application that will be quicker? We are trying out XMPP (as used by Twitter etc.) and lots of experience and libraries to work with for this standard. So this model is about rapid notification of change events via XMPP Push. They trialed his at LiveDBpedia. LANL Research Library ran a significant scale experiment of LiveDBPedia database from Los Alamos to two remote sites using XMPP to push notifications.

One thing we haven’t got to is dumps. Two thought so far… Zip file with Sitemap – simple and widely used format  – but custom solution. The other possibility is WARC – Web ARCiving format. Designed for just this purpose but not widely used. We may end up doing both.

Real soon now a rather extended and concrete version of what I’ve said will be made available. First draft of sitemap-based spec is coming July 2012. We will then publicize and want your feedback, revision and experiments etc in September 2012. And hopefully we will have a final specification in August.


Q1) Wouldn’t you need to make a huge index file for a site like ArXiv?

A1) Depends on what you do. I have a program to index ArXiv on my own machine and it takes an hour but it’s a simplified process. I tested the “dumb” way. I’d do it differently on the server. But ArXiv is in a Fedora repository so you already have that list of metadata to find changes.

Q2) I was wondering as you were going over the SiteMap XML… have you considered what to do for multiple representations of the same thing?

A2) It gets really complex. We concluded that multiple representations with same URI is out of scope really.

Q3) Can I make a comment – we will be soon publishing use cases probably on a Wiki and that will probably be on GitHub and I would ask people to look at that and give us feedback.

 July 11, 2012  Posted by at 8:26 am LiveBlog, Updates Tagged with:  Comments Off on P2A: Repository Services LiveBlog
Jul 102012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Moving from a scientific data collection system to an open data repository
Speaker(s): Michael David Wilson, Tom Griffin, Brian Matthews, Alistair Mills, Sri Nagella, Arif Shaon, Erica Yang

I am here presenting on behalf of myself and my colleagues from the Science and Technology Facilities Council. We run facilities ranging from CERN Large Hadron Collider to the Rutherfod Appleton Laboratory. I will be talking about the ISIS Facility, which is based at Rutherford. People put in their scientific sample and that crystal goes into the facility and then it may examine that crystal for anything from maybe an hour to a few days. The facility produces 2 to 120 files per experiment in several formats including NeXus, RAW (no, not that one, a Rutherford Appleton format). In 2009 we had run 834 experiments, 0.5 million files, 0.5Tb of data. But that’s just one facility. We have petabytes of data across our facilities.

We want to maximise the value of STFC data, as Cameron indicated in his talk earlier it’s about showing the value to the taxpayer.

  1. Researchers want to access their own data
  2. Other researchers validate published results
  3. Meta-studies incorporating data – reuse or new subsets of data can expand the use of the original intent for data
  4. Set experimental parameters and test new computational models/theories
  5. User for new science not yet considered – we have satellites but the oldest climate data we have is on river depth, collected 6 times a day. Its 17th century data but it has huge 21st century climate usefulness. Science can involve uses of data that is radically different than original envisioned
  6. Defend patents on innovations derived from science – biological data, drug related data etc. is relevant here.
  7. Evidence based policy making – we know they want this data but what the impact of that is maybe arguable.

That one at the top of the list (1) is the one we started with when we began collecting data. We started collecting about 1984. The Web came along about 1994 1995 and by 1998 researchers could access their own data on the web – they could find the data set they had produced using an experiment number. It wasn’t useful for others but it was useful for them. And the infrastructure reflected this. It was very simple. We have instrument PCs as the data acquisition system, there was a distributed file system and server, delivery and the user.


Moving to reason (2) we want people to validate the published results. We have the raw data from the experiment. We have calibrated data – that’s the basis for any form of scientific analysis. That data is owned by the facility and preserved by the facility. But the researchers do the data analysis at their own institution. The publisher may eventually share some derived data. We want to hold all of that data, the original data, the calibration data, and the derived data. So when do we publish data? We have less than 1% commercial data so that’s not an issue. But we have data policies (different science, difference facilities, different policy) around PhD period largely so we have a 3 year data embargo. It’s generally accepted by most of our users now but a few years ago were not happy with that. We do keep a record of who accesses data. And we embargo metadata as well as data as if it’s known, say, that a drug company supports a particular research group or university a competitor may start copying the line of inquiry even on the basis of the metadata… don’t think this is just about corporates though… In 2004 a research group in California arranged a meeting about a possible new planet, some researchers in Spain looked at the data they’d been using and reasoning that that research team had found a planet announced that THEY had found a planet. It’s not just big corporations; academics are really competitive!


But when we make the data available we make it easy to discover that data and reward it. For any data published we create a Data DOI that enables Google to find the page but also in the UK HEFCE have said that the open access research dataset use will be allowed in new REF. And data will also be going into the citation index that is used in the assessment of research centres.


So on our diagram of the infrastructure we now have metadata and Data DOI added.


Onto (3) and (4). In our data we include schedule and proposal – who, funder, what etc. that goes with that data. Except about 5% don’t do what they proposed so mostly that job is easily done but sometimes it can be problematic. We add publications data and analysis data – we can do this as we are providing the funding, facility and tools they are using. The data can be searched via Datacity. Our in-house TopCat system allows in-house browsing as well. And we’ve added new elements to infrastructure here.


Looking at (5), (6) and (7) new science, patents, policy. We are trying to find socio-economic impact into the process. We have adopted a commercial product called Tesella Safety Depositr Box with Fixity checks. We have a data format migration. And we have our own long term storage as well.


So that infrastructure looks more complex still. But this is working. We are meeting our preservation objectives. We are meeting the timescale of objectives (short, medium, long). Designated communities, additional information, security requirements are met. We can structure a business case using these arguments.



Q1) Being a repository major I was interested to hear that over the last few years 80% of researchers had gone from unhappy at sharing data to most now being happy. What made the difference?

A1) The driver was the funding implications of data citations. The barrier was distrust in others using or misinterpreting their data but our data policies helped to ameliorate that.

Topic: Postgraduate Research Data: a New Type of Challenge for Repositories?
Speaker(s): Jill Evans, Gareth Cole, Hannah Lloyd-Jones

I am going to be talking about Open Exeter project. This was funded under the Managing Research Data programme and was working as a pilot biosciences research project but we are expanding this to other departments. We created a survey for researchers to comment on Post Graduates by Research (PGRs) and researchers. We have created several different Research Data Management plans, some specifically targeted at PGRs. We have taken a very open approach to what might be data, and that is informed by that survey.

We currently have three repositories – ERIC, EDA, DCO – but we plan to merge these so that research is in the same place from data to publications.  We will be doing this with DSpace 1.8.2 and Oracle 11g database system. We are using Sword2 and testing various types of upload at the moment.

The current situation is that thesis deposit is mandatory for PGRs but not deposit of data. There is no clear guidance or strategy for this nor a central data store for this. But there is no clear strategy for deposit for large size files and deposits of this kind are growing. But why archive PGR data? Well enhanced discoverability is important especially for early career researchers, raised research profile/portfolio is also good for the institution. There is also an ability to validate findings if queried – good for institution and individual.  And this allows funder compliance – expected for a number of funders including the Wellcom Trust. And the availability of data on open access allows fuller exploitation of data and enables future funding opportunities.

Currently there is very varied practice. One issue is problem of loss of data – this has impact on their own work but increasingly PGRs are part of research groups so lacking access can be hugely problematic. Lack of visibility – limits potential for reuse, lack of recognition. And Inaccessibility can mean duplication of effort and inaccessibility can block research that might build on their work.

The solution will be to support deposit of big data alongside thesis. It will be a simple deposit. And a long term curation process will take place that is file agnostic and provides persistent IDS. Awareness raising and training will take place and we hope to embed cultural change in the research community. This will be supported by policy and guidance as well as a holistic support network.

The policy is currently in draft and mandates deposit if required by funder; encourages in other cases. We hope the policy will be ratified by 2013. There are various issues that need to addressed though:

  • When should data be deposited
  • Who checks data integrity
  • IP/Confidentiality issues
  • Who pays for the time taken to clean and package the data? This may not be covered by funders and may delay their studies but one solution may be ongoing assessment of data throughout the PGR process.
  • Service costs and sustainability.

Find out more here



Q1, Anthony from Mont Ash) How would you motivate researchers to assess and cleanse data regularly?

A1) That will be about training. I don’t think we’ll be able to check individual cases though.

Q2, Anna Shadboldt, University of NZ) Given what we’re doing across the work with data mandates is there a reason

A2) We wanted to follow where the funders are starting to mandate deposit but all students funded by the university will also have to deposit data so that will have wider reach. In terms of self-funded students we didn’t think that was achievable.

Q3) Rob Stevenson, Los Alamos Labs) Any plans about different versions of data?

A3) Not yet resolved but at the moment we use handles. But we are looking into DOIs. The DOI system is working with the Handle system so that Handle will be able to deal with DOI. But versioning is really important to a lot of our potential depositors.

Q4 Simon Hodson from JISC) You described this as applying to PG students generally. Have you worked on a wider policy to wider research communities? Have there been any differences with supervisors or research groups approach this?

A4) We have a mandate for researchers across the university. We developed a PGR policy separately as they face different issues. In general supervisors are very pro preserving student data as reuse and use as this problem within research projects has arisen before. We have seen PGRS are generally pro this, researchers it tends to vary greatly by discipline.

More information:, project team: and draft policies are at and

Topic: Big Data Challenges in Repository Development
Speaker(s): Leslie Johnston, Library of Congress

A lot of people have asked why we are at this sort of event, we don’t have a repository, we don’t have researchers, we don’t fund research. Well we actually do have a repository of a sort. We are meant to store and preserve the cultural output of the entire USA. We like to talk about our collections as big data. We have to develop new types of data that are very different to our old service model. We have learned that we have no way of knowing how our collections will be used. We talked about “collections” or “content” or “items” or “files”. But recently we have started to talk about and think about our materials as data. We have Big Data in libraries, archives and museums.

We first looked into this via Digging into Data Challenge through the National Endowment for the Arts and Humanities. This was one of the first introductions to our community, the libraries, archives and museums community, that research are interested in data – including bulk corpora – in their research.

So, what constitutes Big Data? Well the definition is very fluid and a moving target. We have a huge amount of data – 10-20TB per week per collection. We still have collections but what we also have is big data, which requires us to rethink the infrastructure that is needed to support Big Data services. We are used to mediating the researchers experience so the idea that they will use data without us knowing perhaps is radically different.

My first case study is our web archives. We try to collect what is on the web but it’s about heavily curated content around big events, around specific topics etc. When we started this in 2000 we thought researchers would be browsing to see how websites used to look. That’s not the case. People want to data mine the whole collection and look for trends = say for elections for instance. This is 360TB right now, billions of files. How do we curate and catalogue these? And how do we make them accessible? We also have an issue that we cannot archive without permission so we have had to get permission for all of these and in some cases the pages are only available on a terminal in the library.

Our next case study is our historic newspapers collections. We have worked with 25 states to bring in 5 million page images from historic newspapers all available with OCR. This content is well understood in terms of ingest. It’s four image files and an OCR file and a METS file and a MEDs file. But we’ve also made data available as an API. You can download all of those files and images if you want.

Case Study – Twitter. The twitter archive has tens of billions (21 billions) files in it. We are still somewhat under press archive. We received 2006-2010 archive this year. We are just now working with it. We have had over 300 research requests already in the two years since this was announced. This is a huge scale of research requests. This collection grows by tens of millions of items per hour. This is a tech and infrastructure challenge but also a social and training challenge. And under the terms of the gift researchers will have to come into the library, we cannot put this on the open web.

Case study – Viewshare. A lot of this is based on the SIMILE toolkit from MIT. This is a web tool to upload and share visualisations of metadata. It’s on sourceforge – all open access. Or the site itself: Any data shared is available as a visualisation but also, if depositor allows, the raw data. What does that mean for us?

We are working with lots of other projects, which could be use cases. Electronic journal articles for instance – 100GB with 1 million files. How about born-digital broadcast television? We have a lot of things to grapple with?

Can each of our organisations support real-time querying of billions of full text items? Should we provide the tools?

We thought we understood ingest at scale until we did it. Like many universities access is one thing, actual delivery is enough. And then there are fixities and check sums, validating against specifications. We killed a number of services attempting to do this. We are now trying three separate possibilities: our current kit, on better kit and on amazon cloud services. About ingest AND indexing. Indexing is crucial to making things available. How much processing should we do on this stuff? We are certainly not about to catalogue tweets! But expectations of researchers and librarians are about catalogues. This is a full text collection, and it will never be catalogued. It may be one record for the whole collection. We will do some chunking by time and in their native JSON. I can’t promise when or how this stuff will be happening.

With other collections we are doing more. But what happens if one file is corrupted? Does that take away from the whole collection? We have tried several tools for analysis – BigInsights and Greenplum. Neither is right yet though. We will be making files discoverable but we can’t handle the download traffic… we share the same core web and infrastructure as and etc. Can our staff handle these new duties or do we leave researchers to fend for themselves? We are mainly thinking about unmediated access for data of this type? We have custodial issues here? Who owns Twitter – it crosses all linguistic and cultural boundaries.


Q1) What is the issue with visiting these collections in person?

A1) With the web archives you can come in and use them. Some agreements allow take away of that data, some can only be used on-site. Some machines with analytics can be used. We don’t control access to research based on collections however.

Q2) You mentioned the Twitter collection. And you are talking about self-service collections. And people say stupid stuff there

A2) We only get tweets, we get username, we know user relations but we don’t get profile information or their graph. We don’t get most of the personal information. I’ve been asked if we will remove bad language – no. Twitter for us is like diaries, letters, news reporting, citizen journalism etc. We don’t want to filter this. There was a court case decided last week in New York that said that Twitter could be subpoenaed to give over a users tweets – we are looking at implications for us. But as we have 2006-10 archive this is less likely to be of interest. And we have a six month embargo on all tweets and any deleted tweets or deleted accounts won’t be making available. That’s an issue for us actually; this will be a permanently redacted archive in some ways.

Topic: Towards a Scalable Long-term Preservation Repository for Scientific Research Datasets
Speaker(s): Arif Shaon, Simon Lambert, Erica Yang, Catherine Jones, Brian Matthews, Tom Griffin

This is very much a follow up to Micheals talk earlier as I am also at the Science and Technologies Facilities Council. The pitch here is that we re interested in the long-term preservation of scientific data. Lots going on here and it’s a complex area thanks to the complex dependencies of digital objects also needing preservation to enable reusability and the large volumes of digital objects that need scalable preservation solutions. And Scientific data adds further complexity – unique requirements to preserve the original context (e.g. processed data, final publications, etc.). And may involve preservation of software and other tools etc.

As Michael said we provide large scale scientific facilities to UK Science. And those experiments running on STFC facilities generate large volumes of data that needs effective and sustainable preservation with contextual data. There is significant investment here – billions of €’s involved – and we have a huge community of usage here as well. We have 30K+ user visitors each year in Europe.

We have a fairly well established STFC scientific workflow. Being central facilities we have lots of control here. And you’ve seen our infrastructure for this. But what are the aims of the long term preservation programme? Well we want to keep data safe – the bits that are retrievable and the same as the original. We want to keep data usable – that which can be understood and reused at a later date. And we have three emerging themes in our work:

  • Data Preservation Policy – what is the value in keeping data
  • Data preservation Analysis – what are the issues and costs involved
  • Data Preservation Infrastructure – what tools do we use

But there are some key data preservation challenges:

  • Data Volume – for instance single run of ISIS experiment could be files of 1.2GB in size. An experiment typically has 100s of runs – files of 100+GB in total size. ISIS is a good test bed as these sizes are relatively small.
  • Data Complexity- scientific HDF data format (NeXus), structural and semantic diversity in files
  • Data Compatibility – 20 years of data archives here.

We are trialing a system that is proprietary and commercial and manages integrity and format verification; designed within library and archive context; turns a data storage service in to a data archive service. But there are some issues. There is limited scalability – not happy with files over several GBs. There is no support for syntactic and semantic validation of data. No support for linking data to its context (e.g. process description, publications). There is no support for effective preservation planning (tools like Plato).


We are doing this in the context of a project called SCAPE – Scalable Preservation Environments – an EC FP7 project with 16 partners (Feb 2011-Jan 2015) and it’s a follow on from the PLANETS project. We are looking at facilitating compute-intensive preservation processes that involve large (multi-TB) data sets. We are developing cloud-based preservation solutions using Apache Hadoop. For us the key products from the project for us will be a scalable platform for performing preservation operations (with potential format conversion), to enable automatic preservation processes. So our new infrastructure will add further context into our preservation service, a watch service will also alert us to necessary preservations over time. We will be storing workflows, policies and what we call PNMs for particular datasets. The tricky areas for us are the cloud based execution platform and the preservation platform.


The cloud-based workflow execution platform will be with Apache Hadoop and workflows may range from ingest operations etc. We are considering using Taverna for workflows. The PNM is Preservation Network Models (PNM) a technique developed by the CASPAR project and to formally represent the outputs of preservation planning. These models should help us control policies, workflows, and what happens with preservation watch.

Finally this is sort of the workflows we are looking at to control this. The process we might do for a particular file. Ingest via JOVE type. Then we check semantic integrity of the file. Then we build our AIP (archive in package) construction etc.

So at the moment we are in the design stage of this work but there are further refinements and assessment to come. And we have potential issues to overcome – including how Taverna might work with the system.

But we know that a scalable preservation infrastructure is needed for STFC’s large volumes of scientific data.


Q1) We run the Australian Synchotron so this was quite interesting for me. When you run the data will that data automatically be preserved? Our one is shipped to a data centre and can then be accessed as wanted.

A1) For ISIS the data volumes are relatively low so we would probably routinely store and preserve data. For Synchotron the data volumes are much larger so that’s rather difference. Although the existing work on crystallography may help us with identifying what can or cannot be preserved.

Q2) Where do you store your data? In Hadoop or somewhere else? Do you see Hadoop as a feasible long term data solution?

A2) I think we will be mainly storing in our own data systems. We see it as a tool to compute really.

Q3) What is software in data centre to store that much data?

A4) We have a variety of solutions. Our own home grown system is use. We use CASTA, the CERN system. We have a number of different ones as new ones emerge. Backup really depends on your data customer. If they are prepared to pay for extra copies you can do that. That’s a risk analysis. CERN has a couple of copies around the world. Others may be prepared to take the risk of data loss rather than pay for storage.

Topic: DTC Archive: using data repositories to fight against diffuse pollution
Speaker(s): Mark Hedges, Richard Gartner, Mike Haft, Hardy Schwamm

The Demonstration Test Catchment Project is funded by Defra and runs from Jan 2011 and Dec 2014. It’s a collaboration between the Freshwater Biological Association and KCL (Centre of eResearch) and builds upon previous JISC-funded research. To understand the project you need to understand the background to the data.

Diffuse Pollution is the release of polluting agent that may not have immediate effect but may have long term cumulative impact. Examples of diffuse pollution includes run off from roads, discharges of fertilisers in farms etc. What is Catchment? Well typically this is the catchment area of a particular body of water draining into a particular point. And the final aspect is the Water Framework Directive. This is a legal instruction for EU member states that must be implemented through national legislation within a prescribed time-scale. This framework impacts on water quality and so this stretches beyond academia and eResearch.

The project is investigating how the impact of diffuse pollution can be reduced through on-farm mitigation methods (changes to reduce pollution) and those have to be cost effective and maintain food production capacity. There are 3 catchment areas in England for tests to demonstrate three different environment types.

So how does the project work? Well roughly speaking we monitor various environmental markers; we try out mitigation measures, and then analyze changes in baseline readings. And it’s our job to curate that data and make it available and usable by various different stakeholders. So these measurements come in various forms – bankside water quality monitoring systems etc.

So the DTC archive project is being developed. We need that data to be useful to researchers, land managers, farmers, etc. So we have to create the data archive, but also the querying, browsing, visualizing, analysing and other interactions. There need to be integrated views across diverse data that suits their need. Most of the data is numerical – spreadsheets, databases, CSV files. Some of this is sensor data (automated, telemetry) and some are manual samples or analysis. The Sensor data are more regular, more risk of inconsistencies in manual data. There is also data on species/ecological data. Also geo-data. Also less highly structured information such as time series images, video, stakeholder surveys, unstructured documents etc.

Typically you need data from various objects etc. So checking levels of potassium you need data from of points in sensor data as well as contextual data from adjacent farms. So looking at data we see spreadsheets of sensor data, weather data, and land usage data as a map of usage for instance that might all be needed.

Some challenges around this data. The datasets are diverse in terms of structure, there are different degrees of structuring – both highly structured and highly unstructured combined here. And another challenge for us is INSPIRE with the intent of creating a European Spatial Data Infrastructure for improved sharing of spatial information and improve environmental policy. It includes various standards for geospatial data (e.g. Gemini2 and GML – Geography Markup Language) and it builds on various ISO standards (ISO 19100 series).

The generic data model is based around ISO 19156 concerned with observation and measurements. The model facilitates the sharing of observations across communities and includes metadata/contextual information and the people responsible for measurement. And this allows multiple data representations. The generic data model implemented in several ways for different purposes. For archival representation (based on library/archival standards), data representation for data integration (“atomic” representation as triples), and various derived forms.

In the IslanDora repository we create a data and metadata METS files and MADS files and MODs are there. That relationship to library standards is a reflection of the fact that this archive sits within a bigger more bibliographic type archive. The crucial thing here is ensuring consistency across data components for conceptual entities etc. So to do this we are using MADS a Metadata Archiving Description Standard that helps explain the structure and format of the files and links to vocabulary terms and table search. The approach we are taking is to break data out to RDF based model. This approach has been chosen because of simplicity of data model and flexibility of that data model.

Most of this work is in the future really but based on that earlier JISC work – breaking data out of tables and assembling in triples. Something that is clear form an example data set – where we see collection method, actor, dataset, tarn, site, locating, and a multiple observation sets each with observations, all as a network of elements. So to do this we need common vocabularies – we need columns, concepts, entities mapped to formal vocabularies. Mappings defined as archive objects. We have automated, computer-assisted and manual approaches here. The latter require domain experience and mark up of text.

Architecturally we have diverse data as archival data in islandora. Then mapped and broken into RDF triples and then mapped again out to browsing, visualisation, search, analysis for particular types of access or visualisation. That break up may be a bit perverse. We think of it as breaking into atoms and recombining it again.

The initial aim is to meet needs of specific sets of stakeholders, we haven’t thought about the wider world but this data and research may be of interest to other types of researchers and broader publics in the future.

At the moment we are in the early stages. Datasets are already being generated in large quantities. There is some prototype functionality. We are looking next at ingest and modeling of data. Find out more here:


Q1) This sounds very complex and specific. How much of this work is reusable by other disciplines?

A1) If it works then I think the general method could be applicable to other disciplines. But the specifics are very much for this use case but the methodology would be transferrable.

Q2) Can you track use of this data?

A2) We think so, we can explain more about this

Q3) It strikes me that these sorts of complex collections of greatly varying data is a common type of data in many disciplines so I would imagine the approach is very reusable. But the Linked Data approach is more time consuming and expensive so could you explain cost benefit of this?

A3) We are being funded to deliver this for a specific community. Moving to the end of the project converting the software to another area would be costly – developing vocabularies say. It’s not just about taking and reusing this work, that’s difficult, it’s about the general structure.

And with that this session is drawing to a close with thank you from our chair Elin Strangeland.

 July 10, 2012  Posted by at 2:42 pm LiveBlog, Updates Tagged with:  1 Response »
Jul 102012

Today we are liveblogging from the OR2012 conference at George Square Lecture Theatre (GSLT), George Square, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Kevin is introducing the Minute Madness by reminding us that all posters will be being shown at our drinks reception this evening so these very short introductions will be to entice you to visit their stand. Les Carr is chairing the madness and will buy drinks for any presentation under 45 seconds as an incentive for speed!

Our first speaker in the room is poster #105 DataONE (Observation Network for Earth) – we just heard the reasoning for why we need this, there are thousands of repositories that need to be linked together. DataONE does this, integrating data and tools for earth observation data. Tools researchers use already like Excel, like SAAS etc.

#100 is a mystery!

#109 on Metadata Analyser Portal – checks metadata quality, checks s for depositor, for repository manager, and we want to build ranking based on quality of metadata. Come to my poster and discuss this with me!

#112 on Open Access Publishing in the Social Science – one of the leading repositories in Germany. I want to talk about the roll ethos kind of repository can take, how we can ensure quality of publications.

#114 Open Access Directory – its hard to check open access status of data. Come chat to us at our poster, more importantly look at our website

#121 Design and development of LISIR for Scholarly Publications of Karnataka State – looking at how universities in Edinburgh have been using this technology to deposit in DSpace

#136 Can LinkedIn and enhance access to Open Repositories – how do we get our research out? It’s all about links and connectedness, the commercial publishers encourage this, why don’t you? Come tell me?

#149 Sharing experiences and expertise in the professional development of promoting OA and IRs between repository communities in Japan and the UK

#? Another mystery

#160 Making Data repositories visible – building a register of research data repositories. We want to encourage sharing and reuse of research data. We have research work planned on this, come talk to me about it!

#161 another mystery

#207 Metadata Database for upper atmosphere for using DSpace – a metadata repository talk geospatial data. We have solved the issue of cross searching for this metadata repository – come find out more!

#209 Revealing presence of amateurs at an institutional repository by analysing queries at Search engines – I think it is difficult to segment repository users into different groupings but it’s importance, they have different needs. Come see me to find out how we have overcome this.

#223 Integrating Fedora into the Semantic Web using Apache Stanbol – we are trying to graph the web and come along to find out more about using semantic web without losing durability of data

#224 Using CKAN – storing data for re-use – as used in The public hub lets you share data, your code, your files – you get an API for your data and stats. You can use ours or download and run your own.

#251 Developing Value-added services facilitating the outreach of institutional repositories at Chinese Academy of Sciences – maybe you don’t get good opportunities to visit China but we will share our experience – come see our poster



#263 The RSP Embedding Guide – there was once a sad dusty library and no one spoke to it. Sometimes people would throw it an article and it would be happy… but then quickly sad again. Then one day the repository manager found the RSP embedding guide and you could find out all about the happy ending at our poster!

#268 Duracloud poster proposal  – digital preservation is important but not all institutions are able to deliver this. We have built DuraCloud a web based solution. Our poster will debunk the myths of the cloud – duracloud and other cloud services – for checking data integrity

#271 SafeArchive – automated policy based auditing and provisions of replicated content  – there are many good tools in this space such as DuraCloud, such as local systems such as LOCKSS, what it’s difficult to do with these tools is to show a relationship between replication services and policy, SafeArchive does that

#274 current and future effects of social media based metrics an open access and IRs – my open data archive provides an open access repository and it is a social media based OA repository. One of the smallest repositories, but well known on social media. I want to discuss any metrics come see my poster!

#275 Adapting a spoken language data model for a fedora repository – this data type is hard to process and expensive to produce so we need repositories and data models that works with this. Annotations of video and audio, metadata specific to this etc. will all be at my poster!

#276 All about Hot Topics the duraspace community webinar series – this is a web seminar series addressing issues bubbling up from the community, Talk to me about the series and perhaps how you can get involved.

#277 A handshake system for Japanese Academic Societies and Insti8tutional repositories – we work as something like JISC or JANET and we recent started a repository hosting service called Jairo Cloud. We have tried to make a handshake for academic society repositories – I’ll explain how at my poster!

#278 create attract deposit – We at the New Bulgarian University have a poster on how we have increased deposit into our institutional repositories. We use web 2.0 to increase our deposits from 0.7% to 2% in just a year! Come and find out what we did and how we promote these materials.

#279 Engage – using data about research clusters to enhance collaboration – funded in part under JISC business dev strand come see us to find out more and tell us your experiences

#281 CSIC Bridge – linking digital.CSIC to Institutional CRIS – we have used homegrown software and other external tools to automate ingestion. I’ll talk about pros and cons and also integration with DSPACE IR and how we are using CRIS rather than DSPACE deposit tool


#283 JAIRO Cloud – national infrastructure for institutional repositories in JAPAN – I am a tech person without much money. In Japan there are 800 universities and 600 are a bit like me in that regard so the national institute of informatics has begun to share a cloud repositories, 17 are already open to the public. Come find out more

#284 The CORE Family – Connected Repositories is the project. Like William Wallace we are fighting for freedom in terms of open access. We are providing access to millions of resources. But hopefully we won’t end up in the same way: hung, drawn and quartered!

#285 Enhancing DSPace to Synchronise with sources having distinct updating patterns – I am presenting Lume a repository aggregating work from several different data sources and how we are enabling provision of embedded videos

#286 Cellar – the project for common access to EU information – 43 million file in 23 languages, delivered in multiple formats including JSON and SPARQL

#288 Moving DSpace to a fully feature CRIS System – come see how we have been doing this, adaptations made etc.

#291 Makerere University’s dynamic experience of setting up, content collection and use of an institutional repository running on DSpace – we have been doing this for 5 years, come find out about our taking this to the next level.


#294 History DMP: managing data for historical research – we have very active history researchers and got funding to work with those historians to gather and curate data through data management plans created with historians, we have 3 case studies, and we enhanced our repository for these results.

#295 NSF DMP content analysis: what are researchers saying about repositories – find out what crazy things researchers have been saying

#296 Making DSpace Content Accessible via Drupal – we recently moved to Drupal and as departments migrated we wanted to deposit publications etc. into the repository and they were fine with that but wanted it to look just like the website. So come find out how we did this via DSPace REST interface

#297 Databib – an online bibliography of research data repositories – perfect for researchers, libraries, repository managers etc. Please stop by the poster or site to make sure your repository is represented. All our metadata is available via CC0

#298 Making it work – operationalizing digital repositories: moving from strategic direction to integrated core library service – we stared out like a garage band with just our moms and boyfriends hanging out. But like better garage band we’ve gotten better and high level researchers now want to jam with us. Come find out what how we moved from garage band to centre stage!

#299 Publishing system as the new tool for the open access promotion at the university – we migrated over to an open journals system, come find out more about this.

#300 The CARPET project – an information platform on ePublishing technology for users, developers and providers – match matching these groups and technologies. Please come to our poster and ask me how e can help you

#301 Proactive personalized self archiving – we have written an application for outside repositories that allow users to submit metadata and data into repositories

#302 DataFlow project – DataStage – personalised file management – this is a love story of DataBank and DataStage, They were made for each other but didn’t know it! We are an open access project so this is an open relationship between these two components

#303 Databank – a restful web interface for repositories – come see us!

#304 Repositories at Warwick – how we refreshed our marketing for the repositories and how we used the “highlight your research” strapline. We launched the service late last year and in first 10 months we saw a nearly 50% increase in deposit. Come find out about our process and end project

#306 University of Prince Edward Islands VRE Service – this poster is a chronological narrative/fairytale tracing the repository process at PEI and Islandora itself. If you are a small institution trying to make your repository work come speak to me!

#307 CRUD (the good kind) at Northwestern University Library – a Drupal based system, a Hydra based deposit system and a Fedora Repository. It’s fun stuff, come talk

#309 Client side interfaces for content re-use framework based on OAI-PMH – an extension of OAI-PMH with image via JSON. Should be brand new framework, a very beautiful framework. Come see me.

#311 Agricultural repositories in India – darn, our presenter isn’t here

#312 If you love them, set them free: developing digital archives collections at Salford – we have been working with our local community to share and make collections available. We’ve worked hard to make our stuff more discoverable and easier to enjoy

#315 At the centre – a story first (with props!). One of the first journeys to St Andrews was by a monk to move bones of St Andrews for safekeeping. Today researchers are still inspired to come to St Andrews… our poster explains how research@StAndrews has led to all sorts of adventures and encounters.


#319 Introducing Islandora Stack and Architecture 0 Islandora is open access repository software that connects to Drupal (used by everyone from NASA To Playboy). Come find out more about Islandora, about recent updates, or about Prince Edward Island where we will be hosting OR2013

#320 Implementing preservation services in an Islandora Framework – various approaches will be discussed notably Duracloud

#324 Use of shared central virtual open access agriculture and aquaculture repository (VOA3R) – an open source portal that harvests OA scientific literature from different institutional repositories and it embarks a social network. This project funded by EU Framework7 and technology is reusable and open source.

#325 Integrating an institutional CRIS with an OA IR – find out how we are using text harvesting with Symplectic elements to create a repository full of high quality open metadata

#326 SAS OJS: overlaying an Open Journals service onto an Institutional Repositories – SAS OJS better known as “sausages” – find out about our pilot with legal researchers at University of London

#328 Putting the Repository First: Implementing a Repository to RIS EWorkflow for QUT ePrints – we’ve made the repository the only deposit for metadata for research publications

#329 Implementing an enhanced repository statistics system for QUT ePrints – so important but our researchers wanted to collate statistics at author, research group, school, faculty and home repository level (as well as article level) – my poster talks about how we implemented this and how it has gone down.

#287 Open AIRE: supporting open science in Europe – a pitch with a poem that I can’t do justice to here. But we talk about supporting open science in Europe, training… add Continental Chic to your OR2012!

#197 Open AIRE Data Infrastructure Services: On Interlinking European Institutional Repositories, Dataset Archives and CRIS systems – how we did the technical work here and how it can be reused by you!


 July 10, 2012  Posted by at 1:08 pm LiveBlog, Updates Tagged with:  1 Response »