Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Repositories and Microsoft Academic Search
Speaker(s): Alex D. Wade, Lee Dirks

MSResearch seeks out innovators from the worldwide academic community. Everything they produce is freely available, non-profit.

They produce research accelerators in the for of Layerscape (visualization, storytelling, sharing), DataUp (used to be called DataCuration for Excel), and Academic Search.

Layerscape provides desktop tools for geospatial data visualization. It’s an Excel add-in that creates live-updating earth-model visuals. It provides the tooling to create a tour/fly-through of the data a researcher is discussing. Finally, it allows people to share their tours online – they can be browsed, watched, commented on like movies. If you want to interact with the data you can download the tour with data and play with it.

DataUp aids scientific discovery by ensuring funding agency data management compliance and repository compliance of Excel data. It lets people go from spreadsheet data to repositories easily. This can be done through an add-in or via cloud service. The glue that sticks theses applications together is repository agnostic, with minimum requirements for ease of connection. It’s all open source, driven by DataOne and CDL. It is in closed beta now with a wide release later this summer.

Now, Academic Search. It started by bringing together several research projects in MSResearch. It’s a search engine for academic papers from the web, feeds, repositories. Part of the utility of it is a profile of information around each publication, possibly from several sources, coalesced together. As other full-text documents cite in, those can be shown in context. Keywords can be shown and linked to DOI, can be subscribed to for change alerts. These data profiles are generated automatically, and that can build automatic author profiles as well. Conferences and journals they’ve published in, associations, citation history, institution search.

The compare button lets users compare institutions by different publication topics – by the numbers, by keywords, and so on. Visualizations are also available to be played with. The Academic Map shows publications on a map.

Academic Search will also hopefully be used a bit more than as a search engine. It is a rich source of information that ranks journals, conferences, academics, all sortable in a multitude of ways.

Authors also have domain-specific H-Index numbers associated with them.

Anyone can edit author pages, submit new content, clean things up. Anyone can also embed real-time pulls of data from the site onto their own site.

With the Public API, and an API key, you can fetch information with an even broader pull. Example: give me all authors associated with University of Edinburgh, and all data associated with them (citations, ID number, publications, other others, etc). With a publication ID, a user could see all of the references included, or all of the documents that cite it.

Q: What protocol is pushing information into the repositories?

A: SWORD was being looked at, but I’m uncertain about the merit protocol right now. SWORD is in the spec, so it will be that eventually.

Q: Does Academic Search harvest from repositories worldwide?

A: We want to, but first we’re looking at aggregations (OCLC Oyster). We want to provide a self-service registration mechanism, plus scraping via Bing. Right now, it’s a cursory attempt, but we’re getting better.

Q: How is the domain hierarchy generated?

A: The Domain hierarchy is generated manually with ISI categories. It’s an area of debate: we want an automated system, but the challenge is that more dynamic systems make rank lists and comparison over time more difficult. It’s a manual list of categories (200 total, at the journal level).

Q: Should we be using a certain type of metadata in repos? OAIPMH?

A: We use OAIPMH now, but we’re working on analysis of all that now. It’s a long term conversation about the best match.


Topic: Enhancing and testing repository deposit interfaces
Speaker(s): Steve Hitchcock, David Tarrant, Les Carr

Institutional repositories are facing big challenges. How are they presenting a range of services to users? How is presentation of repositories being improved, made easier? The DepositMO project hopes to improve just that. It asks how we can reposition the deposit process in a workflow. SWORD and V2 enable this.

So, IRs are under pressure. The Finch report suggests a transition with clear policy direction toward open access. This will make institutional open access repositories for publication obsolete, but not for research data. Repositories are taking a bigger view of that, though. Even if publications are open access, they can still be part of IR stores.

DepositMO has been in Edinburgh before. It induced spontaneous applause. It was also at OR before, in 2010.

This talk was a borderline accepted talk, perhaps because there is not a statement included: few studies of user action with repositories.

There are many ways that users interact with repositories, which ought to be analyzed. SWORD for Facebook, for Word.

SWORD gives a great scope of use between the user and repository, especially with V2. V2 is native in many repositories now, partially because of DepositMO.

With convenient tools built into already used software, like Word, work can be saved into repositories as it is developed. Users can set up Watch Folders for adding data, either as a new record or an update to an older version if changed locally. The latter example is quite a bit like Dropbox or Skydrive, but repositories aren’t harddrives. They aren’t designed as storage devices. They are curation and presentation services. Depositing means presenting very soon. DepositMO is a bit of a hack to prevent presentation while iteratively adding to repository content. Save for later, effectively.

Real user tests of DepositMO have been done – set up some laptops running created services and inviting users to test in pairs. This wasn’t about download, installation, and setup, but actual use in a workflow. Is it useful in the first place? Can it fit into the process? Task completion and success rates of repository user tasks were collected as users did these things.

On average, Word and watch folder deposit tools improved deposit time amongst other things. However, these entries aren’t necessarily as well documented as is typically necessary. The overall summary suggests that while there is a wow-factor in terms of repository interaction, the anxiety level of users increases as the amount of information they have to deposit increases. Users sometimes had to retrace steps, or else put things in the wrong places as they worked. They needed some trail or metadata to locate deposit items and fix deposit errors.

There are cases for not adding metadata during initial entry, though, so low metadata might not be the worst thing.

Now it’s time to do more research, exploring the uses with real repositories. That project is called DepositMOre. Watch Folder, EasyChair one-click submission, and to an extent the word add in will be analyzed statistically as people actually deposit into real repositories. It’s time to accomodate new workflows, to accomodate new needs, and face down challenges of publishers offering open access.

Q: Have you looked into motivations for user deposit into repositories?

A: No, it was primarily a study of test users through partners in the project. The how and what of usage and action, but not the why. There was a wonder whether more data about the users would be useful. If more data was obtainable, the most interesting thing would be understanding user experience with repositories. But mandate motivation, no, not looking into that.

Q: You’ve identified a problem users have with depositing many things and tracking deposits. Did you identify a solution?

A: It’s more about dissuading people from reverting to previous environments and tools. There are more explicit metadata tools, and we could do a better job of showing trails of submission, so that will need to filter back in. Unlike cloud drives, losers use control of an object once they are submitted to a repository. So, suddenly something else is doing something, and the user it’s disconcerting.


Topic: OERPub API for Publishing Remixable Open Educational Resources (OER)
Speaker(s): Katherine Fletcher, Marvin Reimer

This talk is about a SWORD implementation and client. Most of this work has happened in the last year, very quick.

Remixable open education repositories target less academic and more multi-institution, open repos. Remixability lets users learn anywhere. It’s a ton of power. All these open resources can seed a developer community for authoring and creation, machine learning algorithms, and it all encourages lots of remixable creation.

Remixability can be hard to support, though. Connexions, and other organizations, had grand ambition but not a very large API. And you need an importer/editor that is easy to use. Something that can mash data up.

In looking at APIs needed for open education, discoverability is important, but making publishing easier is important, too. We need to close the loop so that we stop losing the remixed work externally. That’s where SWORD comes in. V2.

Why SWORD V2 for OER? It has support for workflow. The things being targeted are live edited objects, versioned. Those versions need to be permanent so that changes are nondestructive. Adapting, translating, deriving are great, but associating them with common objects helps tie it all together.

OERPub extends SWORD V2. It clarifies and adds specificity to metadata. Specificity is required for showing the difference between versions and derivatives, specifically. And documentation is improved. Default values, repository controlled and auto-generated values are all documented. Precedents have been made clear, that’s it.

OERPub also merges semantics header for PUT. It simplifies what’s going on. Also added a section on Transforms under packaging. If a repository will transform content, it has a space to explain its actions. It provides error handling improvements, particularly elaboration on things like transform and deposit fails.

This is the first tool to submit to Connexions from outside of Connexions.

Lessons learned? Specification detail was great. Good to model on top of and save work. Bug fixes also lead the project away from multiple metadata specifications – otherwise bugs will come up. Learned that you always need a deposit receipt, which is normally optional. Finally, auto-discovery – this takeaway suggests a protocol for accessing and editing public item URLs.

A client was built to work with this – a transform tool to remixable format in very clean HTML, fed into Connexions, and pushed to clients on various devices. A college chemistry textbook was already created using this client. And a developer sprint got three new developers fixing three bugs in a day – two hours to get started. This is really enabling people to get involved.

Many potential future uses are cropping up. And all this fits into curation and preservation – archival of academic outputs as an example.

Q: Instead of PUT, should you be using PATCH?

A: Clients aren’t likely to not know repositories, but it is potentially dangerous to ignore headers. Other solutions will be looked at.

Q: One lesson learned was to avoid multiple ways of specifying metadata. What ways?

A: DublinCore fields with attributes and added containers. That caused errors. XML was mixed in, but we had to eventually specify exactly which we wanted.

 July 11, 2012  Posted by at 2:31 pm LiveBlog, Updates Tagged with:  Comments Off on P5A: Deposit, Discovery and Re-use LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: ORCID update and why you should use ORCIDs in your repository
Speaker(s): Simeon Warner

I am speaking with my Cornell hat on and my ORCID hat on today. So this is a game of two halves. The first half is on ORCID and what it is. And the second half will be about the repository case and interfacing with ORCID.

So, the scholarly record is broken because there is no reliable attribution of authors and contributors is impossible without unique person-level identifiers. I have an unusual name so the issue is mild, but if you have a common name you are in real trouble. We want to find unique identities to person records across data sources and types and to enlist a huge range of stakeholders to do this.

So ORCID is an amazing opportunity that emerged a couple of years ago. Suddently publishers, achivists, etc. all started talking about the same issue. It is an international, interdisciplinary, open and not for profit organization. We have stakeholders that include research institutions, funding organizations, publishers and researchers. We want to create a registry of persistent unique identifyer fo all sorts of roles – not just authors – and all sorts of contributions. We have a clear scope and set of principles. We will create this registry and it will only work if it’s used very widely. The failure of previous systems have been because the scope hasn’t been wide enough. One of the features of research is that things move –  I was a physicist, now repositories, libraries… I don’t live in one space here. To create an identity you need som einformation to manage that. You need a name, an email, some other bits of information, and the option for users to update their profile with stuff that is useful for them. Privacy is an issue – of course. So we have  a principle in ORCID is opt-in. You can hide your record if you want. You can control what is displayed about you. And we have a set of open principles about how ORCID will interact with other systems and infrastructure.

So ORCID will disambiguate researchers and allow tracking. automate repository deposition, and other tasks that levage use of this sort of ID. We have 328 participan organizations, 50 of which have provided sponsorship. And that’s all over the world.

So to go through a research organization workflow: for an organisation it’s a record of what researchers have done in that institution. But you don’t want a huge raft of staff needed to do this. So the organisation registers with ORCID. At some stage ORCID looks for a record of a person and the organisation pulls out data on that person. Once that search is done on already held information. Identifiers can then be created ready for researchers to claim these.

So, granting bodies, in the US there is always a complaint and a worry about the buden of reporting. So what if we tied this up to an ORCID identity? Again the granting body registers with ORCID and then an ORCID::grant linking sent to PI or researcher for confirmation. Same idea again with the publisher. If you have granted the publisher the ability to do it you can let them add the final publication to your name, saving effort and creating a more accurate record.

So a whole set of workflows gives us a sort of vision for researchers as early as possible in the creation of research here. And in phase I system the researcher can self-claim a profile, delegate management and institutional record creation. Fine grained control of privacy settings. Data exchange into grant and manuscript submission system, authorised organisations/publications etc. So right now we have an API, a sandbox server, etc. Now working out launch partners and readying for launch. ORCID registry will launch in Q4 of 2012. Available now: ORCID identifier structure (coordinated with ISNI) will have a specific structure. Code, APIs, etc. available.

So why should you use ORCID in your repository?

Well we have various stakeholders in your repository – authors, academic community and the institutions themselves. Institutional authors want credit for their work, ORCID should and will increase the likelihood of authors publications being recognised. It opens the door to link to articles that wouldn’t be linked up – analyses of citations etc. Opens the door to more nuanced notions of attributions. And it saves efforts by allowing data reuse across institutions. For readers it offers better discovery and analysis tools. Valuable information for improving tools like Microsoft Academic search, better ways to measure research contributions etc. And institutions allows robust links between local and remote repositories, better track and measure use of publications.

And from an arXiv position we’ve looked for years for something to unify author details across our three repositories. We have a small good quality repositories but we need that link between the author and materials. And from UK/JISC perspective there is a report from JISC Research Identifier task force group that indicates the benefits of ORCID. I think for repositories ORCID helps make repositories count in a field we have to play in.

So, you wnat to integrate with ORCID. There are two tiers to the API right now, I’ll talk about both. All APIs return XML or JSON data. The tier 1 API is available to all for free, no access controls. With this you can ask a researcher for their ORCID ID and look at data made public. You could provide pop up in your repository deposit process to check for their ORCID ID. There is a competition between functionality and privacy here but presuming they have made their ID public this will be very useful.

Tier 2 API members will have access to an OAuth2 authentication between service and ORCID allow users to grant certain rights to a service. Access to both public and (if granted) protected data. Ability to add data (if granted). Really three steps to this process. Any member organisation in the process would get an ORCID ID in first stage of the process. Secondly if you have a user approaching the repository that user can login and grant data access to the client repository. The user can be redirected back to the repository along with an access permisssion. And if access is granted then the repository continues to have access to the user’s profile until this permission is revoked by the user (or ORCID). And data can be added to the users profile by the repository if it becomes available.

All code etc. on dev.orcid.org. Follow the project on Twitter @ORCID_Org.


Q1 – Ryan) You mentione dthat ORCID will send information to CrossRef, what about DataCite?

A1) I don’t think I said that. We import data from CrosRef, not an import the other way. I think that would be led by DOI owner, not ORCID. DOI is easy, someone has the right to a publication, people don’t work that way.

Q1) In that case I encourage you to work with DataCite.

A1) If it’s public on ORCID anyone can harvest it. And ORCID can harvest any DOI source.

Q2 – Natasha from Griffith University) An organisation is prompted to remove duplicates? How does that work?

A2) We are working on that. We are not ready to roll out bulk creation of identifiers for third party at th emoment. The initial creation will be by individuals and publications. We need to work out how best to do that. Researchers want this to be more efficient so we need to figure that question out.

Topic: How dinosaurs broke our system: challenges in building national researcher identifier services
Speaker(s): Amanda Hill

So I am going to talk about the wider identifier landscape that ORCID and others fits into. So on the one hand we have book-level data, it’s labour intensive, disambiguation first, authors not involved, open. And then we have publisher angle – automatic, disambiguation later, authors can edit, proprietary. In terms of current international activity we have ISNI as well as ORCID. ISNI is very library driven, disambiguation first, authors not involved, broad scope. ORCID is more publisher instigated, disambiguation later, authors can submit/edit, current researchers. ISNI is looking at fictional entities etc. as well as researchers etc. so somewhat different.

We had a Knowledge Exchange meeting on Digital author identifiers in March 2012 and both groups were encouraged and present, they are aware and working with each other to an extent. Both ISNI and ORCID will use of existing pools of data to populate them. There are a number of national author ID systems – in 2011 there was a JISC-funded survey to look at systems and their maturity. We did this via a survey to national organisations. The Lattes system in Brazil is very long term – its been going since 1999 – and very mature and very well populated but there is  a diverse landscape.

In terms of populating systems there is a mixture – some prepopulated, some manual, some authors edit themselves. In Japan there was an existing researcher identifiers, thesaurus of author names in Netherlands. In Norway they use human resources data for the same purpose. With more mature systems a national organisation generally has oversight – e.g. in Brazil, Norway, Netherlands. There is integration with research fields and organisations etc. It’s a bit different in the UK. The issue was identified in 2006 as part of call for proposals for the JISC-funded repositories and preservation programme. Mimas and British Library proposed a two year project to investigate requirements and build a prototype system. This project, the Names project, can seem dry but actually it’s a complex problem. Everyone has stories of ambiguation.

The initial plan was to use the British Library Zetoc service to create author IDs – journal article information from 1993 but it’s too vast, too international. And it’s only last names and initials, no institutional affiliation. So we scrapped that. And luckily the JISC Merit project used 2008 Research Assessment Exercise data to pre-populate the Names database. It worked well except for twin brothers with the same initials both writing on paleantology and often co-authoring papers… in name authority circles we call this the “Siveter problem” (the brothers surnames). We do have both in the system now.

Merit data covers around 20% of active UK researchers. And we are working to enhance records and create new ones with information from other sources. Working with institutional repositories, british library data sets (Zetoc), Direct input from researachers. With current EPrints the RDF is easy to grab so we’ve used that with Huddersfield data and it works well. And we have a Submission form on the website now so people can submit themselves. Now, an example of why this matters… I read the separatedbyacommonlanguage blog and she was stressing about the fact that her name appears in many forms and the REF process. This is an example of why identifiers matter and why names are not enough. And how strongly people feel about it.

Quality really matters here. Automatic matching can only achieve so much – it’s dependent on data source. And some poeple have multiple affiliations. There is no size fits all solution hre. We have colleagues at the British Library who perform manual check of results of matching new data sources – allows for separation/merging of records – they did similar on ISNI. At the moment people can contribute a record but cannot update it. In the long term we plan to allow poeple to contribute their own information.

So our ultimate aim is to have a high quality set of unique identifiers for UK researchers and research institutions. Available to other systems – national and international (e.g. Names records exported to ISNI in 2011). Business model wise we have looked at possible additional services – such as disambiguation of existing data sets, identification of external researchers. About a quarter of those we asked would be interested in this possibility and paying for such added value services.

There is an API for the Names data that allows for flexible searching. There is an EPrints plugin – based on the API – which was released last year. It allows repository users to choose form a list of Names identifiers – and to create a Names record if none exists.

So, what’s happening with names now? We are hopefully funded until the end of 2012. Simeon mentioned the JISC convened Researcher ID group – final meeting will take place in September. That report went out for consultation in June, the report of the consultants went to JISC earlier this week. So these final aspects will lead to recommendations. And we have been asked to produce an Options Appraisakl Report for Uk national researcher identifier service in December. And we are looking at improving data and adding new records via repositories search.

So Names is kind of a hybrid of library/publisher approaches. Automatic matching/disambiguation; human quality checks; data immediately available for re-use in other systems; and authors can contribute and will be able to edit. When Names set up ORCID was two years away, ISNI hadn’t started yet. Things are moving fast. The main challenges here are cultural and political rather than technical. National author/researcher ID services can be important parts of research infrastructure. It’s vital to get agreement and co-ordination at national level here.


Q1) I should have asked Simeon this but you may have some appreciation here. How are recently deceased authors being handled? You have data since 1993 – how do you pick up deceased authors.

A1) No, I don’t think that we would go back to check that.

Q1) These people will not be in ID systems but retrospective materials will be in repositories so hard to disambiguate these.

A1) It is important. Colleagues on Archives Hub are interestied in disambiguation of long dead people. Right now we are focusing on active resaerchers.

A2 – Simeon) Just wanted to add that ORCID has a similar approach to deceased authors.

Q2 – Lisa from University of Queensland) We have 1300 authors registered with author id – how do you marry national and ORCID ID?

A2) We can accomodate all relevant identifiers as needed, in theory ORCID ID would be one of these.

Q3) How do you integrate this system by Web of Science and other commercial databases?

A3) We haven’t yet but we can hold other identifiers so could do that in theory but it’s still a prototype system.

Q4) Could you elaborate on national id services vs. global services?

A4) When we looked across the world there was a lot of variation. It would depend on each countries requirements. I feel a national service can be more responsive to the need of that community. So in the UK we have the HE statisticas agency who want to identify those in universities for instance, ORCID may not be right for that purpose say. I think there are various ways we could be more flexible or responsible as a national system vs ORCID with such a range of stakeholders.

Topic: Creating Citable Data Identifiers
Speaker(s): Ryan Scherle, Mark Diggory

First of all thank you for sticking around to hear about identifiers! I’m not sure even I’m that excited about identifiers! So instead lets talk about what happened to me on Saturday. I was far away… it was 35 degrees hotter… I was at a little house on the beach. The Mimosa House. It’s at 807 South Virginia Dare Trail. Kill Devil Hills, NC USA. 27898. It isn’t a well known town but it was the place where the first Orville bros. flight tests took place at [gives exact geocordinators]. But I had a problem. My transmission [part number] in my van [engine number] and opened the vent to  a new spider and a deadly spider crawled out [latin name]. I’m fine but it occured to me that we use some really strange combinations of identifiers. And a lot of these are very unusable for humans – those geocoordinates are not designed for humans to read out loud in a presentation [or livebloggers to grab!].

When you want data used and reused we need to make identifiers human friendly. Repositories use identifiers… EPrints can use a 6 digit number and URL, not too bad, In Fedora there isn’t an imposed scheme. In this one there is a short accession number but it’s not very prominent, you have to dig around a long URL. Not really designed for humans (I’ll confess I helped come up with this one so my bad too). DSpace does impose a structure. It’s fairly short and easy to cite. If you are used to repositories. But if you look at Nature – a source scientists understand. They use DOIs. When scientists see a DOI they know what this is and how to cite this. So why don’t repositories do this?

So I’m not going to get controversial. I am going to suggest some principles for citable identifiers, you won’t all agree!

1 ) Use DOIs – they are very familiar to scientists and others. Scientists dont understand handles, purls or info URI. They understand DOI. And using it adds weight to your citation – it looks important. And loads of services and tools are compatible with DOIs. Currently EPrints and DSpace don’t support them, Fedora only with a lot of work.

2) Keep identifiers simple – complex identifiers are fine for machines but bad for humans. Despite our best intentions humans sometimes need to work with identifiers manually. So keep as short and sweet as possible. So do repositories support that? Yes all three do but you need the right policies set up.

3) Use syntax to illustrate relationships – this is the controversial bit. But hints in identifiers can really help the user. A tiny bit of semantics to an identifier is increadibly useful. e.f. http://dx.doi.org/10.5061/dryad.123ab/3. A few slashes here help humans look at higher level objects. Useful for human hacks and useful for stats. You can aggregate stats for higher level stuff. Could break in the future, probably wont! Again EPrints and DSpace don’t enable this. Fedora only with work.

4) When “meaning-bearing” content changes, create a versioned identifier – scientists are pretty picky. Some parts objects have meaning, some don’t. For some objects you might have an excel file. Scientists want that file to be entirely unchanged – and only with new URL. Scientists want datat to be invariant to enable reuse by machines, even a single bit makes a difference. Watch out for implicit abstractions – e.g. thumbnails of different images etc. This kind of process seems intuitive but it kinda flies in face of DOI system and conventions. A DOI for an article it resolves to a landing page that could change every day and contain any number of items. Could be with a different publisher. What the scientist cares about is the article of text itself, webpage not so much of an issues.

Contrast that with…

5) When “meaningless” content changes, retain the current identifier – descriptive metadata must be editable without creating a new identifier. Humans rearely care about metadata changes, especially for citation purposes. Again repositories dont handle this stuff so well. EPrints supports flexible versioning/relationships. DSpace has no support. Fedora has implicit versioning of all data and metadata – useful but too granular!

So to build a repository with all of these features we had a lot of work to do. We had previously been using DSpace so we had some work to do here. What we did was add a new DSpace identifier service. It allows us to handle DOI, and to extend to new identifiers in the future. It allows us granular control of when a new DOI is registered and it lets us send these to citation services as required. So our DSpace identifier system uses EZCite at CDL and then also to DataCite. The DataCite content service lets you look up DOIs, they are linked data compliant – you can see relationships in the metadata. You can export metadata in various formats for textual or machine processing purposes. And we added some data into our citations information. When you load a page in Dryad there is a clear “here’s how to cite this item” note as we really want people to cite our material.

In terms of versioning we have put this under the control of the user and that means that when you push a button a new object is created and goes through all the same creation processes – just a copy of the original. So we can also connect back to related files on the service. And we thus have versioning on files. We plan to do more on versioning on the file and track changes on these. We need to think about tracking information in the background without using new identifiers in the foreground. We are contributing much of this back to DSpace but we want to make sure that the wider DSpace community finds this useful, it meets their requirements.

So, how well has it worked? Well it’s been OK. Lots of community change needed around citing data identifiers. Last year we looked at 186 articles associated with Dryad deposits – 77% had “good” citations to the data. 2% had “bad” citations to the data. And 21% had no data citations at all. We are owrking with the community to raise awareness about that last issue. Looking at articles a lot of people cite data in the text of the article, sometimes in supplementary materials at the end. And a bad citation – they called their identifier an “accession number”.

So, how many of you disagree with me here? [some, not tons of people] Great! come see me at dinner! But no matter whether you agree or not do think about identifiers and humans and how they use them. And finally we are hiring developer and user interface posts at the moment, come talk to me!


Q1 – Rob Sanderson, Los Alamos Public Laboratories) I agree with (4) and (5) but DOIs? I disagree! They are familiar but things can change on a DOI, that’s not what you want!

A1) I maybe over simplified. When you resolve a DOI you get to an HTML landing page. There is content – in our case data files. Those data files we guarantee to be static for a given DOI. We do offer an extension to our DOI – you can add /bitstream to get the static bits. But that page does change and restyle from time to time.

Q2 – Robin Rice, Edinburgh University Data Library) We are thinking about whether to switch from handles for DOI but you can’t have a second DOI for a different location… What do you do if you can’t mind a new DOI for something?

A2) You can promote the existing DOI. I question that you can’t have more than one DOI though, you can have a DOI for each instance for each object.

Q2) Earlier it seemed that the DOI issuing agency wouldn’t allow that

A2)  We haven’t come across that issue yet

A2 – audience) I think the DOI agency would allow your sort of use.

 July 11, 2012  Posted by at 2:30 pm LiveBlog, Updates Tagged with:  Comments Off on P5B: Name and Data Identifiers LiveBlog
Jul 112012

Lots of people have been asking about download access to presentations and posters. Sorry we didn’t provide information about this earlier – I confess it was partly because until we had missed a simple setting in our conference submission system that provided public access to downloads.

The good news is that many presentations and posters are now available directly from the detailed pages in the conference agenda ; this isn’t a permanent home for the content but it will be available to access for at least the rest of this calendar year. At present, presentations are only there if presenters uploaded them beforehand (similarly for posters.) We’re also trying to collect presentations as they are given, but we can’t guarantee a 100% success rate. If you are a presenter or poster author, please do take a minute or so to upload your presentation whilst you are here. We are only making content available after each session has run, so there’s no risk of any surprises being unleashed early.

Given all that we’ve heard at OR2012 about the advantages of putting content online and making it open and the value of putting content in repositories, it would be downright contrary of any of you not to comply and place your content in the staging post for the conference repository. Do it now if only for the brief glow of self-satisfaction that it will give.


 July 11, 2012  Posted by at 1:59 pm Updates Tagged with: , ,  2 Responses »
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: A Repository-based Architecture for Capturing Research Projects at the Smithsonian Institution
Speaker(s): Thorny Staples

I have recently returned to the Smithsonian. I got into repositories through lots of digital research projects. I should start off by saying that I’ll show you screenshots for a system that allows researchers to deposit data from the very first moment of research, it’s in their control until it goes off to curators later.

I’m sure most of you know of the Smithsonian. We were founded to be a research institute originally – museums were a result of that. We have 19 museums, 9 scientific research centers, 8 advances study centres, 22 libraries, 2 major archives and a zoo (Washington zoo). We focus on longterm baseline research, especially in biodiversity and environmental studies, lots of research in cultural heritage areas. And all of this, hundreds of researchers working around the world, has had no systematic data management of digital researvh content (except for SAO who work under contract for NASA).

So the problem is that we need to capture research information as it’s created and make it “durable” – it’s not about presevation but about making it durable. The Smithsonian is now requiring a data management plan for ALL projects of ANY time. This is supposed to say where they will put their digital information, or at least get them thinking about it. But we are seeing very complex arrays of numerous types of data. Capturing the full structure and context of the research content is neccasary. It’s a network model, it’s not a library model. We have to think network from the very beginning.

We have to depend on the researvchers to do much of the work, so we have to make it easy. They have to at least minimally describe their data but they have to do something. And if we want them to do it we must provide incentives. It’s not about making them curators. They will have a workspace, not an archive. It’s about a virtual research environment but a repository-enables VRE. Primary goal is to enhance their research capabilities, leaving trusted data as their legacy. So to deliver that we have to care about a content creation and management environment, an analysis environment and a dissemination environment. And we have to think about this as two repositories: there is the repository for the researcher, they are data owners, they set policies, they have control – crucial buy-in and crucial concept for them; And then we have to think about an interoperable gathering service – a place researcher content feeds into and also cross search/access to multiple repositories back in the other direction as these researchers work in international teams.

Key to the whole thinking is the concept of the web as the model. It’s a network of nodes that are units of content, connected by arcs that are relationships. I was attracted to Fedora because of the notion of a physical object and a way to create networks here. Increasingly content will not be sustainable as discrete packages. We will be maintaining our part of the formalized world-wide web of content. Some policies will mean we can’t share everything all the time but we have to enable that, that’s where things are going. Information objects should be ready to be linked, not copied, as policy permits. We may move things from one repository to another as data moves over to curatorial staff but we need to think of it that way.

My conceptual take here is that a data object is one unit of content – not one file. E.g. a book is one object no matter how many pages (all of which could be objects). By the way this is a prototype, this isn’t a working service, it’s a prototype to take forward. And the other idea that’s new is the “concept object”. This is an object with a metadata about the project as a whole then a series of concept objects for the components of that project. If I want to create a virtual exhibition I might build 10 concept objects for those paintings and then pull up those resources.

So if you come into a project you see a file structure idea. Theres an object at the top for the project as a whole. Your metadata overview, which you can edit, lets you define those concepts. The researcher controls every object and all definitions. The network is there, they are operating within it. You can link concepts to each other, it’s not a simple hierachy. And you can see connections already there. You can then ingest objects – right now we have about 8 concept types (e.g. “Research site, plot or area”). When you pick that you then pick which of several forms you want to use. When you click “edit” you can see the metadata editor in a simple web form prepopulated with existing record. And when you look at resources you can see any resources associated with that concept. You can upload resources without adding metadata but it will show in bright yellow to remind you to add metadata. And you can attach batches of resources – and these are offered depending where you are in the network.

And if I click in “exhibit” – a link on each concept – you can see a web version of the data. This takes advantage of the adminstrator screen but allows me to publish my work to the web. I can keep resources private if I want. I can make things public if I want. And when browsing this I can potentially download or view metadata – all those options defined by researcher’s setting of policies.


Q1 – Paul Stanhope from University of Lincoln) Is there any notion of concepts being bigger than the institution, being available to others

A1) We are building this as a prototype, as an idea. So I hope so. We are a good microcosm for most types of data – when the researcher picks that they pick metadata schemas behind the scenes. This think we built is local but it could be global, we’re building it in a way that could work that way. With the URIs othwe intstitutions can link their own resources etc.

Q2) Coming from a university, do you think there’s anything different about your institution? Is there a reason this works differently?

A2) One of the things about the Smithsonian is that all of our researchers are Federal employees and HAVE to make their data public after a year. That’s a big advantage. We have other problems – funding, the government – but policy says that the researchers have to

Q3 – Joseph Green from University College Dublin) How do you convey the idea of concept objects etc. to actual users – it looks like file structures.

A3) Well yes, kind of the idea. If they want to make messy structures they can (curators can fix). The only thing they need is a title for their concept structure. They do have a file system BUT they are building organising nodes here. And that web view is an incentive – it’ll look way better if they fill in their metadata. Thats the beginning… for tabular data objects for instance they will be required to do a “code book” to describe the variables. They can do this in a basic way or they can do better more detailed code book and it will look better on the web. We are trying to incentivise  at every level. And we have to be fine with ugly file structures and live with it.

Topic: Open Access Repository Registries: unrealised infrastructure?
Speaker(s): Richard Jones, Sheridan Brown, Emma Tonkin

I’m going to be talking about an Open Access Repositories project that we have been working on, funded by JISC, looking at what Open Access repositories are being used for and what their potential is via stakeholder interviews, via a detailed review of ROAR and OPENDOAR, and somerecommendations.

So if we thought about a perfect/ideal repository as a starting point… we asked out stakeholders what they would want. They would want it to be authoritative – the right name, the right URL; they want it to be reliable; automated; broad scope; curated; up-to-date. The idea of curation and the role of human intervention would be valuable although much of this would be automated. People particularly wanted the scope to be much wider. If a data set changes there are no clear ways to expand the registry and that’s an issue. But all of those terms are really about the core things you want to do – you all want to benchmark. You want to compare yourself to others and see how you’re doing. And in our sector and funders they want to see all repositories, what are the trends, how are we doing with Open Access. And potentially ranking repositories or universities (like Times HE rankings) etc.

But what are they ACTUALLY being used for right now? Well mainly use them for documenting their own existing repositories. Basic management info. Discovery. Contact info. Lookups for services – use registry for OAI-PMH endpoints. So that’s I think, it looks as if we’re falling a bit short! So, a bit of background on what OA repository registries there are. So we have OpenDOAR, ROAR (Registry of Open Access Repositories) – those are both very broad scope repositories, well known and well used. But there is also the Registry of Biological Repositories. There is re3data.org – all research data so it’s a content type specific repository registry. And, more esoterically, is the Ranking Web of World Repositories. Not clear if this is a registry or a service on a registry. And indeed that’s a good question… what services run on registries. So things like BASE search for OAI-PMH endpoints, very similar to this is Institutional Respositories Search based at Mimas in the UK. Repository 66 is a more novel idea – mashup with Google Maps to show repositories around the world. Then there is the Open Access Repository Junction a multideposit tool for discovery and use of Sword endpoints.

Looking specifically at OpenDOAR and ROAR. OpenDOAR is run at University at Nottingham (SHERPA) and it uses manual curation. Only lists OA and Full-text repositories. It’s been running since 2005. Whereas DOAR is principally Repository Manager added records. No manual curation. And lists both full-text and metadata only. Based at University of Southampton and running EPrints 3, inc. SNEEP elements etc. Interestingly both of these have policy addition as an added value service. Looking at the data here – and these are a wee bit out of date (2011). There seems to be big growth but some flattening out in OpenDOAR in 2011 – probably approaching full coverage. ROAR has a larger number of repositories due to difference in listing but quite similar to OpenDOAR (and ROAR harvests this too). And if we look at where repositories are both ROAR and OpenDOAR are highly international. Slightly more European bias in OpenDOAR perhaps. The coverage is fairly broad and even around the globe. When looking at content type OpenDOAR is good at classifying material into types, reflective of manual curation. We expect this to change over time, especially datasets. ROAR doesn’t really distinguish between content types and repository types – it would be interesting to see these separately. We also looked at what data you typically see about the repository in any record. Most have name, URL, location etc. OpenDOAR is more likely to include a description and contact details than is the case in ROAR. Interestingly the machine to machine interfaces are a different story. OpenDOAR didn’t have any RSS or SWORD endpoint information at all, ROAR had little. I know OpenDOAR are changing this soon. This field has been added on later in ROAR and no-one has come back to update this new technology, that needs addressing.

A quick not about APIs. ROAR has OAI-PMH API, no client library, full data dump available. OpenDOAR has a fulled documented query API, no client library and full data dump available. When we were doing this work almost no one was using the APIs, they just download all data.

We found stakeholders, interviewees etc. noted some key limitations: content count stats are unreliable; not internationalised/multilingual- particularly problematic if a name is translated and is the same as but doesnt appear to be the same thing; limited revisions history; No clear relationships between repos, orgs, etc. And no policies/mechanisms for populating new fields (e.g. SWORD). So how can we take what we have and realise potential for registries? There is already good stuff going on… Neither of those registries automatically harvest data from repositories but that would help to make data more authoritative/reliable/up to date; automated; increased scope of data – and that makes updates so much easier for all.  And we can think about different kinds of quality control – no one was doing automated link checking or spell checking and those are pretty easy to do. And an option for human intervention was in OpenDOAR but not in ROAR, and that could be make available.

But we could also make them more useful for more things – graphical representaqtions of the registry; better APIs and Data (with standards compliance where relevent); versioning of repositories and record counts; more focus on policy tools.  And we could look to encourage overlaid services: repository content stats analysis; comparitive statistics and analytics; repository and OA rankings; text analysis for identifying holdings; error detection; multiple deposits. Getting all of that we start hitting that benchmarking objective.


Q1 – Owen Stephens) One of the projects I’m working on is CORE project from OU and we are harvesting repositories via OpenDOAR. We are producing stats about harvesting. Others do the same. It seems you are combining two things – benchmarking and repositories. We want OpenDOAR to be comprehensive, and we share your thoughts on need to automate and check much of that. But how do we make sure we don’t build both at the same time or separate things out so we address that need and do it properly?

A1) The review didn’t focus on structures of resulting applications so much. But we said there should be a good repository registry that allows overlay of other services – like the benchmarking services. CORE is an example of something you would build over the registry. We expect the registry to provide mechanism to connect up to these though. And I need to make an announcement: JISC, in the next few weeks, will be putting out an ITT to take forward some of this work. There will be a call out soon.

Q2 – Peter from OpenDOAR) We have been improving record quality in OpenDOAR. We’ve been removing some repositories that are no longer there – link checking doesn’t do it all. We also are starting to look at including those machine to machine interfaces. We are doing that automatically with help from Ian Stuart at EDINA. But we are very happy to have them sent in too – we’ll need that in some case

A2) you are right that link checkers are not perfect. More advanced checking services can be built on top of registries though.

Q3) I am also working on the CORE project. The collaboration with OpenDOAR where we reuse their data, it’s very useful. Because we are harvesting we can validate the repository and share that with OpenDOAR. The distinction between registries and harvesting is really about an ecosystem that can work very well.

Q4) Is there any way for repositories to register with schema.org to enable automatic discovery?

A4) We would envision something like that, that you could get all that data in a sitemap or similar.

A4 – Ian Stuart) If registering with Schema.org then why not register with OpenDOAR?

A4 – chair) Well with scheama.org you host the file, its just out on the web.

Q5) How about persistant URLs for repositories?

A5) You can do this. The Handle in DSpace is not a persistant URL for the repository.

Topic: Collabratorium Digitus Humanitas: Building a Collaborative DH Repository Framework
Speaker(s): Mark Leggott, Dean Irvine, Susan Brown, Doug Reside, Julia Flanders

I have put together a panel for today but they are in North America so I’ll bring them in virtually… I will introduce and then pass over to them here.

So… we all need a cute title and Collaboratory is a great word we’ve heard before. I’m using that title to describe a desire to create a common framework and/or set of interoperable tools providing a DH Scholars Workbench. We often create great creative tools but the idea is to combine and make best use of these in combination.

This is all based on Islandora. A Drupal+ Feora framework from UPEI. Flexible UI on top of Fedora and other apps. It’s deployed in over 100 institutions and that’s growing. The ultimate goal of those efforst is to release a Digital Humanities solutions packs with various tools integrated in, in a framework that would be of interest to scholarly DH context – images, video, TEI, etc.

OK so now my colleagues…

Dean is visiting professor in Yale, and also professor at Dalhousie University in Canada and part of a group that creates new versions of important modernism in canada prints. Dean: so this is the homepage for Modernist Commons. This is the ancillery site that goes with the Modernism in Canada project. One of our concerns is about long term preservation about digital data stored in the commons. What we have here is both the repository and a suite of editing tools. When you go into the commons you will find a number of collections – all test collections and samples from the last year or so. We have scans of a bilingual publication called Le Nigog, a magazine that was published in Canada. You can view images, mark-up, or you can view all of the different ways to organise and orchestrate the book object in a given collection. You can use an Internet Archive viewer or alternative views. The IA viewer frames things according to the second to last image in the object, so you might want to use an alternative. In this viewer you can look at the markup, entities, structures, RDF relations or whether you want to look at image annotations. The middle pane is a version of CWRC Writer that lets us do TEI and RDF markup. And you see the SharedCanvas tools provided with other open annotation group items. As you mark up a text you can create author authority files that can be used across collections/objects.

Next up Victoria Brown, her doctorate is on Victorian feminist literature. She currently researches collaborative systems, interface design, usability. Victoria: I’ll be talking more generally than Dean. The Canadian Writing Research Council is looking to do something pretty ambitios that only works in a collaborative DH environment. We have tools that can aim as big as we can. I want to focus on talking about a couple of things that define a DH Collaboratory. It needs to move beyond institutional repository model. To invoke persoective of librarian colleagues I want to address what makes us so weird… What’s different about us is that storing final DH materials is only part of the story, we want to find, amass, collect materials; to sort and organise them; to read, analyse and visualize. That means environments much be flexible, porous, really robust. Right now most of that work is on personal computers – we need to make these more scalable and interoperable. This will take a huge array of stakeholders buying into these projects. So a DH repository environment needs to be easy o manage, diverse and flexible. And some of these will only have a small amount of work and resources. In many projects small teanms of experts will be working with very little funding. So the CWRC Writer here shows you how you edit materials. On the right you see TEI markup. You can edit this and other aspects – entities, RDF open annotation mark up etc, notations allows you to construct triples from within the eidt. One of the ways to encourage interoperability is through use of common entities – connecting your work to the world of linked data. The idea is to increase consistency across projects with TEI markup and RDF means better metadata than the standard working in Word, publishing in HTML many use. So this is a flexible tool. Embedding this in a repository does raise questions about revisioning and archiving though. One of the challenges for repositories and DH is how we handle those ideas. Ultimately though we think this sort of tool can broaden participation in DH and collaboration in DH content. I think the converse challenge for DH is to work on more generalised environments to make sure that work can be interoperable. So we need to take something from solid and stable structure and move to the idea of shared materials – a porous silo maybe – where we can be specific to our work but share and collaborate with others.

The final speaker is Doug, he became first digital curator at NYPL. He’s currently editing music of the month blog at NYPL. Doug: the main thing we are doing is completely reconfiguring our repository to allow annotation of Fedora and take in a lot of audio nad video content. And particularly for large amounts of born digital collections. We’ve just started working with a company called BrightCove to share some of our materials. Actually we are hiring an engineer to design the interface for that – get in touch. We are also working on improved display interfaces. Right now it’s all about the idea of th egallery – the idea was that it would self-sustain through selling prints. We are moving to a model where you can still view those collections but also archival materials. We did a week long code sprint with DH developers to extend the Internet Archive book reader. We have since decided to move from that to New York Times backed reader – the NYT doc viewer with OCR and annotation there.


Q1) I was interested in what you said about the CWRC writer – you said you wnated to record every key stroke. Have you thought about SVN or GIT that do all that versioning stuff already.

A1 – Susan) They are great tools for version control and it would be fascinating to do that. But do you put your dev money into that or do you try to meet needs of greatest number of projects? But we would definitely look in that direction to look at challenges of versioning introduced in dynamic online production environments.


 July 11, 2012  Posted by at 12:27 pm LiveBlog, Updates Tagged with: , , , ,  Comments Off on P4B: Shared Repository Services and Infrastructure LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Multivio, a flexible solution for in-browser access to digital content
Speaker(s): Miguel Moreira

Multivio is a generic browser and visualizer for digital objects, a presentation layer for document servers, and an add-on for other infrastructure. It’s main principle: when searching a document server, users are provided with immediate access to content. It’s origins lie in RERO and its digital library. In 2006, an internal survey showed desire for a service that eventually became Multivio – an adequate presentation layer for full-text, structure-rich files and show patrimonial (heritage) collections. It does all of this quickly and directly, as opposed to traditional solutions.

Multivio was developed because other solutions were not flexible enough. It is co-funded by RERO and the Electronic Library of Switzerland. Development took place between 2008-2011, with an official release in 2011.

Using Multivio is straightforward. Provide a URL to a file (PDF, image, sound, video, etc) or a combination of files. Then Multivio will investigate structure and content, and provide it to the user in a convenient searchable interface in browser.

Multivio, given content, shows in a window over a given page. It pulls in content very quickly and shows it off visually with JavaScript and HTML. No pre-indexing necessary.

Multivio is a full-featured HTML5 document viewer. It allows zoom, search, copy and paste. It also has an elegant way of handling large and multi-file documents, which can be shown together without downloading. It is low-bandwidth consumptive, and based on widely accepted web standards. All it requires is a modern browser client-side. Server-side, the role of Multivio is rendering, search and extraction. It uses Python and Poppler (for PDFs). The only other requirement is that remote content be fetched and stored on-server.

Multivio.org to check it out. For a public demonstrator, go to demo.multivio.org – usable with any web-accessible document.

The advantage of Multivio is performance, customization, access control. It only requires a Unix server running Python.

The CORE Portal is using Multivio now.

In the future, support for audio and video will be added and improved, along with authentication and access control. Calendar-based navigation of publications is coming as well.

Q: Do you do PDF file processing beforehand?

A: No, it is all done on the fly. Poppler is very effective at doing this. It wastes no time or bandwidth in grabbing what it needs.

Q: Do you do OCR processing? Can individual pages of a document be shared/navigated to directly?

A: No OCR processing. As for page-specific URLs, the client API allows for file URLs with page numbers. This isn’t being used for analysis of document usage yet, but that is very interesting.

Q: For multimedia, what experience do you have working with it?

A: We are starting to have and use that content. Prototypes are showing one video format so far – we must work on that. It’s a challenge, but we know it’s possible. We will rely on HTML5 and modern browsers, and if needed maybe fall back on Flash. Further investigation has to be done.

Q: More details on access control?

A: It’s on the todo list. Right now the solution is to install the Multivio server alongside protected documents. Multivio needs access rights, then it can restrict what it displays.

Q: How will this interact with usage metrics?

A: There’s an intention to work on this in the future. It’s important. We will still provide direct download, and do basic view analysis, but we hope to go much farther.


Topic: Biblio-transformation-engine: An open source framework and use cases in the digital libraries domain
Speaker(s): Kostas Stamatis, Nikolaos Konstantinou, Anastasia Manta, Christina Paschou, Nikos Houssos

This will be a backend talk. Sorry in advance. This is an open source framework that has been in development for 4-5 years. It facilitates digital transformations in library systems. It’s a solution to a common problem.

This tool has been used extensively so far. Digital transformations are a necessary reality in libraries, repositories, everything. You need to transform data to get into any publishing system or database, to migrate it or share it. Such processes need to constantly be evolving, so the framework provides systematic management of code that does all that. This will accelerate common transformation tasks.

The first step in a framework is creating an analysis, finding the abstractions that will represent common procedures. From that, the steps are retrieving data records, applying processing and changing any given records or field values, then finally generating the desired output. The less obvious finding is that there is a demand for incremental or selective data loading – breaking up the task, say.

The design goals demanded customisability, non-intrusiveness, ease of use, and the ability to integrate or extend for anyone who needs the Biblio-transformation-engine.

The components of the engine. The Data Loader itself, which retrieves data from sources according to its own spec. The Processing step transforms information with a filter, then modifier, then initializer. The output generator actually creates the desired product.

The FLOSS library was developed in Java (maven-based). FLOSS is available online in EU Public License – free to download, use, comment upon.

Use cases. One is generating linked open data in repository records, legacy cultural material records, CERIF information. Corresponding data loaders are reused. Filters and modifiers can be totally agnostic of RDF and input formats. JENA RDF generates triples. It also adds or generates appropriate identifiers/URI for entities.

Another is populating repositories from EndNote, RIS, Bibtex, UNIMARC. A third and fourth are feeding VOA3R and European aggregators.

In the future, the project hopes to support more data transformations, extend declarative specification of mapping for complex cases. Also some infrastructure to reuse Filter and Modifier implementations. Finally, the project would like to study user experience to sort out the little things and make life easier.

Q: You’re using CSL and JS – are you running JS on client or server side?

A: JS on server side. A modifier calls a JS server.

 July 11, 2012  Posted by at 12:23 pm LiveBlog, Updates Tagged with:  Comments Off on P4A: Accessing Digital Content LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Panel Discussion Proposal: “Effective Strategies for Open Source Collaboration” Panel Proposal “Effective Strategies for Open Source Collaboration”
Speaker(s): Tom Cramer, Jon William Butcher Dunn, Valorie Hollister, Jonathan Markow

This is a panel session, so it’s a little bit different. We’ve asked all of our DuraSpace experts here about collaborations they have been engaged in and then turn to you for your experiences of collaboration – what works and what doesn’t.

So, starting with Tom. I’m going to talk about 3 different Open Source technologies we are involved in. First of these is Blacklight which is in use in many places. It’s a faceted search application – it’s Ruby-on-Rails on solr. Originally developed at UVa around 2007, 1st adopted external to UVa in 2009. It’s had multiple installations, 10+ committer institutions etc.

Hydra is a framework for creating digital asset management apps to supplement Fedora. Started in 2008 in Hull, Stanford and Virginia with FedoraCommons. It’s institutionally-driven And developer-led.

And the last item is IIIF: International Image Interoperability Framework – I’ll be talking more on this later – an initiative by major research libraries across the world – a cooperative definition of APIs to enable cross-repository image collections. It’s a standards not technology project.

Lessons learned…

DO: Work from a common vision; be productive, welcoming and fun; engineer face-time is essential; get great contributors – they lead to more great contributors too!

DON’T: over-plan, over-govern; establish too many cross institution dependencies; get hooked on single sources of funding.

Now over to Jon. A few collaborations. First up Sakaibrary. Sakai is an eLearning/course management tool used by dozens of institutions. There was a collaborative project between Indiana University and University of Michigan Libraries to develop extensions to Sakai and facilitate use of library resources in teaching and learning. Top down initiative from university head librarians. Mellon funding 2006-2008 (http://sakaibrary.org).

The second project is Variations on Video. This one is a collaboration between Indiana University and Northwestern University Libraries – with additional partners for testing and feedback. This is a single cross institution team using AGILE Scrum approaches.

Lessons learned from these projects… Success factors: initial planning periods – shared values and vision being established – helped very much; good project leadership and relationships between leaders important; collaborative development model. Some challenges: Divergent timelines; electronic communication vs. face-to-face – very important to meet face to face; existing community culture; shifts in institutional priorities and sustainability.

Now over to Val, Director of Community Programs for DuraSpace. Part of my role is to encourage teams to collaborate and gain momentum within the DSpace community. We are keen to get more voices into the development process. We had DSpace Developer meeting on Monday and have made some initial tweaks, and continue to tweak, the programme. So what is the DSpace Community Advisory Team? Well we are a group of mostly repository managers/administrators. Developers wanted help/users wanted more input. Formed in Jan 2011, 5-7 active members. DCAT helps review/refine new feature requests – get new voices in there but also share advice, provide developer help. We had a real mission to assess feature requests, gauge interest, and enable discussion.

Some of the successes of DCAT. We have reviewed/gathered feedback on 15_ new feature requests – 3 were included in the last release. It really has broadened development discussion – developers and non-developers, inter/intra-institution. And it has been useful help/resource for developers – community survey by DCAT and provided recommendation on the survey. Feedback on feature implementation.

Challenges for us: no guarantee that a feature makes it in – despite everyone’s efforts features still might not make it in, because of resource limitations; continue to broaden discussion and broaden developer pool; DCAT could also be more helpful during the release process itself – to help with testing, working out bugs etc.

So the collaboration has been successful with discussion and features but continue to do better at this!

Now Jonathan is asking the panel: how important is governance in this process? How does decision making take place?

Tom: Different in different communities. And bottom up vs. top down makes a big difference. In bottom up it’s about developers working together, trusting each other, building the team but keeping code quality is challenging on a local and broader level for risk averse communities.

Jon: governance different between the two projects. In both cases we did have a project charter of sorts. for Sakaibrary it was more consensus based – good in some ways but maybe a bit less productive as a project as a result. In terms of prioritisation of features in the video project we are making use of the scrum concept really and the idea of product owners is very useful there. We try to involve whole team but product owner define priorities. When we expand to other institutions with their own interests we may have to explore other ways of doing things – we’ll need to learn from Hydra etc.

Val: I think DCAT is a wee bit different. Initially this was set up between developers and DCAT and that has been an ongoing conversation. Someone taking the lead on behalf of developers was useful. And for features DCAT members tend to take the lead on a particular request or other to lead analysis etc. of it.


Q1) In a team development effort there is great value to being able to pop into someone’s office and ask for help. And lots of decisions made for free – a discussion really quickly. When collaborative even a trivial decision can mean a 1 hr conference call. How do you deal with that.

A1 – Jon) In terms of the video project we take a couple of approaches – we use IRC channel and Microsoft Link for one-t0-one discussion as needed. We also have daily 15 min stand up meeting via telephone or video conference. And that agile approach with 2 week cycles means it’s not hugely costly to take the wrong approach or find we want to change something.

A1 – Tom) With conference calls we now feel if it takes an hour we shouldn’t make that decision. Move to IRC rather than email is a problem in different time zones. Email lets you really consider things through and that’s no bad thing.. one member of the Blacklight community is loquacious but often answers his own questions inside of an hour! you just learn how to work together.

A1 – Jonathan) We really live on Skype and that’s great. But I miss water cooler moments, tacit understandings that develop there. There’s no good substitute for that.


Topic: High North Research Documents – a new thematic and global service reusing all open sources
Speaker(s): Obiajulu Odu, Leif Longva

Our next speakers are from the University of Tromso. The High North Research Documents is a project we began about six months ago. You may think that you are high in the North but we are from far arctic Norway. This map gives a different perspective on the globe, on the north. We often think of the north as the North of America, of Asia etc. but the far north is really a region of it’s own.

The Norwegian government has emphasized the importance of northern areas and the north is also of interest on an international level – politically and strategically; environmental and climate change issues; resource utilization; the northern sea route to the Pacific. And our university, Tromso, is the northernmost university in the world and we are concerned with making sure we lead research in the north. And we are involved in many research projects but there can be access issues. The solution is Open Access research literature and we thought that it would be a great idea to look at the metadata to extract a set of documents concerned with High North research.

The whole world is available through aggregators like OAIster (OCLC) and BASE (University of Bielefeld) and they have been harvesting OA documents across the world. We don’t want to repeat that work. We contacted the guys a Bielefeldand they were very useful. We have been downloading their metadata local allowing us to do what we wanted to do to analyse the metadata.

Our hypothesis was if we selected a set of keywords and they are in the metadata then the thematic scope of the document can be identified. So we set up a set of filtering words (keywords) applied to the metadata of BASE records based on: geographic terms; species names; languages and folks (nations); other keywords. We have mainly looked for English and Norwegian words, but there is a bigger research world out there.

The quality of keywords is an issue – are their meanings unambiguous. Labrador for instance for us is about Northern Canada, it has a different meaning – farmer or peasant – in Spanish. Sami is a term for people but it is also a common given name in Turkey and Finland! So we have applied keywords filtering a selection of elements – so “sami AND language” or “sami AND people”. The filter process is applied only to selected metadata elements – title, description, subject. But it’s not perfect.

Looking at the model we have around 36 million documents from 2150 scholarly resources. These are filtered, extracted. And one subset of keywords go right into the High North Research Documents database. Another set of keywords we don’t trust as much so they go through a manual quality control first. Now over to my colleague Obiajulu.

Thank you Leif. We use a series of modules in the High North System model. The Documents service itself is DSpace. The import module gets metadata records and puts them in our MySQL database. After documents are imported we have the extraction module – applies the extraction criteria on the metadata. The Ingest module transforms metadata records relevant to the high north into DSpace XML format and imports them into a DSpace repository. And we have the option of addicting custom information – including use of facets.

Our Admin Module allows us to add, edit or display all filtering words (keywords). And it allows us to edit the status of a record or records – Blacklist/reject; approved; modified. So why do we use DSpace? Well we have used it for 8 or 9 years to date. It provides end use with both a regular search interface and faceted search/browsing. Our search and discovery interface is an extension of DSpace and it allows us to find out about any broken links in the system.

We are on High North RD v 1.1. 151,000 documents extracted from more than 50% of the sources appealing in BASE and from all over the world. Many different languages – even if we apply mainly English and Norwegian and Latin in the filtering process. Any subject but weight on the hard sciences. And we are developing the list of keywords as a priority so we have more and better keywords.

When we launched this we tried to get word out as far and wide as possible. Great feedback received so far. The data is really heterogeneous in quality, full text status etc. so feedback received has been great for finding any issues with access to full text documents.

Many use their repository for metadata only. That would be fine if we could identify where a record is metadata only. We could use the dc:rights but many people do not use this. How do we identify records without any full text documents. We need to weed out many non-OA records from High North RD – we only want OA documents, it’s not a bibliographic service we want to make. Looking at document types we have a large amount of text and articles/journals but also a lot of images (14-15% ish). The language distribution shows English. Much smaller percentage in French, Norwegian… and other languages.

So looking at the site (http://highnorth.uit.no/). It’s DSpace and everything in it is included in a single collection. So… if I search for pollution we see 2200 results and huge numbers of keywords that can be drilled down into. You can filter by document type, date, languages etc.

And if we look at an individual record we have a clear feedback button that lets users tell us what the problem is!


Q1) You mentioned checking quality of keywords you don’t trust, and that you have improvements coming to keywords. Are you quality checking the “trusted” keywords.

A1) When we have a problem record we can track back over the keywords and see if one of those is giving is giving us problems, we have to do that this way.

We believe this to be a rather new method, to use keywords in this way to filter content. We haven’t come across it before, it’s simple but interesting. We’d love to hear about any other similar system if there are any. And it would be applicable to any topic.

Topic: International Image Interoperability Framework: Promoting an Ecosystem of Open Repositories and Open Tools for Global Scholarship
Speaker(s): Tom Cramer

I’m going to talk about IIIF but my colleagues here can also answer questions on this project. I think it would be great to get the open repositories community involved in this process and objectives.

There are huge amounts of image resources on the web – books, manuscripts, scrolls, etc. Loads of images and yet really excellent image delivery is hard, it’s slow, it’s expensive, it’s often very disjointed and often it’s too ugly. If you look at bright spots: CDragon, Google Arts, or other places with annotation or transcription it’s amazing to see what they are doing vs. what we do. Its like page turners a few years ago – there were loads, all mediocre. Can we do better?! And we – repositories, software developers, users, funders – all suffer because of this stuff.

So consider…

… a paleographer who would like to compare scribal hands from manuscripts at two different repositories – very different marks and annotations.

— an art and architecture instructor trying to assemble a teaching collection of images from multiple sources..

… a humanities scholar who would like to annotate a high resolution image of an historical map – lots of good tools but not all near those good resources.

… a repository manager who would like to drop a newspaper viewer with deep zoom into her site with no development of customization required

… a funder who would like to underwrite digitization of scholarly resources and decouple content hosting and delivery.

We started last September a year long project to look at this – a group of 6 of the worlds leading libraries and Stanford. Last September we looked at the range of different image interfaces. Across our 7 sites there were 15 to 20 interfaces, including Oxford it was more like 40 or 50 interfaces. Oxford seems to have lots of legacy humanities interfaces – lovely but highly varied – hence the increase in numbers.

So we want specialised tools but less specialised environment. So we have been working on Parker on the web project – mediaeval manuscripts project with KCL and Stanford. the La Munda Le Rose is similar in type. Every one of these many repositories is a silo – no interoperability. Every one is a one-off – big overhead to code and keep. And every user is forced to cope – many UIs, little integration. no way to compare one resource with another. They are great for researchers who fed into the design but much less useful for others.

Our problem is we have confused the role of responsibilities of the stakeholders here. We have scholars who want to find, use, analyze, annotate. they want to mix and match, they want best of breed tools. We have toolers – build useful tools and apps – want users and resources. And we have the repositories who want to host, preserve and enrich records.

So for the Parker project we had various elements managed via APIs. We have a TPEN transcription tool. We sent TPEN hard drive full of Tiffs to work on. Dictionary of Old English, they couldn’t take a big file of TIFFs but we gave them access to the database. We also had our own app. So our data fed into three applications here and we could have taken the data on some round trips – adding annotations before being fed into database. And by taking those APIs into a Framework and up into an Ecosystem we could enable much more flexible solutions – ways to view resources in the same environment.

So we began some DMS Tech work. We pulled together technologists from a dozen institutions to look at best tools to use, best adaptations to make etc. and we came up with basic building blocks for ecosystem: image delivery API (speced and built); data model for medieval manuscripts (M3/SharedCanvas) – we anticipate people wanting to page through documents – for this type of manuscript the page order, flyleafs, inserts etc. are quite challenging; support for authentication and authorization – it would be great if everything was open and free but realistically it’s not; reference implementations of load balanced, performant Djatoka server – this seemed to be everyone’s page turning software solution of choice; interactive open source page turning and image viewing application; OAC-compatible tools for Annotation (Digital Mappaemundi) and transcription (T-PEN).

We began the project last October, some work already available. the DMS Index pulls data from remote repositories and you can explore in a common way as the data is structured in a common way. You can also click to access annotation tools in DM, or to transcribe the page from TPEN etc. So one lets you explore and interact with this diverse collection of resources.

At the third DMS meeting we started wondering if, if this makes sense for manuscripts, doesn’t this make sense for other image materials. IIIF basically takes the work of DMS and looks how we can bring these to the wider world of images. We’ve spent the least 8 or 9 months putting together the basic elements. So there is a Restful interface to pic up an image from a remote location. We have a draft version of the specification available for comment here: http://library.stanford.edu/iiif/image-api. What’s great is the possibility to bring in functionality on images into your environment that you don’t already offer but would like to. Please do comment into 0.9 proclamation you have until 4pm Saturday (Edinburgh time).

The thing about getting images into a common environment is that you need metadata. We want and need to focus on just what the key metadata needs to be – labels, title, sequence, attribution etc. Based on http://shared-canvas.org (synthesis of OAC (open annon. collab) and DMS.

from a software perspective we are not doing software development but we hope to ferment lots of software development. So we have thought of this in terms of tiers for sharing images. Lots of interest in Djatoka, IIIF Image API and then sets of tools for deep panning, zooming, rotating etc. And then moving into domain and modality specific apps. And so we have a wish list for what we want to see developed.

This was a one year planning effort – Sept 2011 – Aug 2012. We will probably do something at DOF as well. We have had three workshops. We are keen to work with those who want to expose their data in this sort of way. Just those organisations in the group have millions of items that could be in here.

So… What is the collective image base of the Open Repository community? What would it take to support IIIF APIs natively from the open repository platforms? What applications do you have that could benefit from IIIF? What use cases can you identify that could and should drive IIIF? What should IIIF do next? Please do let us know what we could do or what you would like us to do.

Useful links: IIIF: http://lib.stanford.edu/iiif; DMS Interop: http://lib.stanford.edu/dmm; Shared-canvas: http://shared-canvas.org.


Q1) Are any of those tools available, open source?

A2) T-Pen and DM are probably available. Both Open Source-y. Not sure if code distributed yet. Shared Canvas code is available but not easy to install.

Q2) What about Djatoka and improved non buggy version?

A2) There is a need for this. Any patches or improvements would be useful. There is a need and no-one has stepped up to the plate yet. We expect that as part of IIIF that we will publish. The national library of Norway rewrote some of the coding in C, which improved performance three-fold. They are happy to share this. It is probably open source but hard to find the code – theoretically open source.

And with that we are off to lunch…

 July 11, 2012  Posted by at 10:03 am LiveBlog, Updates Tagged with:  Comments Off on P3B: Open Source: Software and Frameworks LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Griffith’s Research Data Evolution Journey: Enabling data capture, management, aggregation, discovery and reuse.
Speaker(s): Natasha Simons, Joanne Morris

Griffith’s research group has grown and developed into a project-driven organization over the last year.

Griffith University is young, starting in 1935. Five campuses, international student body. People might not end up studying when they go here, the beaches are just too nice. That said, the university is very active in research.

The Research Hub is a metadata store solution based on VIVO, which pulls from various databases and stores data plus relationships. It can export that information and has researcher profile systems built in, showcasing outputs and data of authors.

Developing the hub was driven by a global push to manage large volumes of research data worldwide, thanks to improvements in technology. Improving accessibility is a key feature, especially for making Griffith a world class research institution.

Done with funding from the Australian National Data Service (ANDS). Their discovery portal pulls together Australian research and makes it available to the public. The MetaData Exchange Hub, which Research Hub is built upon, collects appropriate metadata and provides custom feeds from CMS in a standard format. This improves discovery in a sustainable way.

Where to get data, and information about it? Griffith already had a database and publications repository, as well as research management databases and a meta-directory for other services. The meta-directory made some private information stores accessible in an indirect way.

Sources go into the hub. From the hub, there is interaction with persistent IDs of authors and personnel. Hub pushes out discovery environments to governmental and educational institutions and beyond.

Uses the ISO 2146 standard as its Registry Interchange Format. Four different kinds of objects: collection ,party, activity, service. Each of these objects is related to one another.

Uses VIVO, a semantic web, RDF triple-store approach to gather and share.

The research hub is a one-stop shop for all Griffith research activity, an open source and international solution. Enter data once and use it everywhere. It is automated, aggregates multiple sources, preserves.

Challenges. Early versions of VIVO were not final products, making Griffith a guinea pig. Getting private information or scraping public information from it took some hacks and workarounds. There are some latency issues. Self-editing in the hub is preventing proper presentation of occurring – two versions of a piece of content then exist.

In the future, phase 2 will allow further export and visualization support. It will also begin to track citations in a data citation project. This will help show the value of the hub.

Q: Usefulness of scraped metadata? What is the feedback on usage?

A: It is a fairly rich bit of information, though statistics have not yet been collected on actual use. Data citation is not being used fully yet, so statistics are lacking thus far. It seems that cross-discipline researchers are using the data more.

Q: In terms of persistent IDs, can you explain what standard you are using, how?

A: VIVO has its own standard, as does ANDS. Other identifiers are supported, ones that are used at a national level. The DSpace repository uses handles, and DOIs are being minted.

Q: It seems that people are risk-averse in the UK when it comes to publication and citation. What returns are sought in Australia?

A: Tracking citations is further down the track for Griffith. Until automation can be figured out, the citation and sourcing problem will be a struggle.


Topic: Building an institutional research data management infrastructure
Speaker(s): Sally Rumsey

It’s all about collaboration with everyone.

The need for public data repositories seems to be coming from a different direction than library repos and data management originally did. At Oxford, many projects are underway, but this particular project takes bits from each and adds more for a core research data management infrastructure.

The demand comes from disappearing data, response times, citation tracking, etc.

The EPSRC is leading the way pushing for solutions to these demands.

The DaMaRo project has four strands: research data management policies (which Edinburgh is ahead of the curve on), something in development; training, support and guidance – not just students and researchers, but support staff reskilling to bridge university communities; technical development and maintenance; and sustainability, the business plan post-funding.

Data governance is guiding the action. DataStage with SWORD to DataBank, and ViDaaS are headed toward the Oxford DataFinder. DataFinder will be the hub of the whole infrastructure for the university. It holds metadata, relates content, assigns IDs, pulls from regional DataFinders and Colwiz.

The plan is to integrate with federated data banks and share with everyone. DataFinder is just the store for metadata – a convenient search and discovery tool. DataFinder is built on DataBank, so they are totally compatible and import-export functionality is seamless. DataFinder is metadata agnostic – no sense in being picky.

DataFinder can hold records for just about anything, including non-digital objects (papers, specimens, etc). Populating with all of that will require some manual entry, also pulling in existing data. There will be a minimum metadata set about a given object: it will be as small as possible for the sake of the researcher and for cross-discipline functionality. Optional information fields will also be included, and contextually require particular fields of metadata depending on funding bodies and disciplines.

DaMaRo is not a cure-all, it is a foundation for research management in the future. By the end of March, some set-in-stone goals should have been met: it has to be good enough, not perfect. Just enough services, metadata, and the capacity to provide immediate and flexible needs.

Q: Why is content format not a metadata requirement?

A: Hopefully it will be automatically picked up like the size of the content file. It could be asked for, for preservation purposes, but automatic is ideal.

Q: What is the status of the code for DataFinder?

A: Really new. Look around in the Autumn for something to play with.

Q: Did you weigh up the pros and cons of making researchers give more, better data? Is there a balance between ease of use and quality?

A: People don’t like to do this stuff manually. There is encouragement for as much information as possible, but mandating fields doesn’t necessarily work across fields. Certain fields have their own complex metadata stores as well, but starting with low barriers will give a realistic view of what people are willing to do.

Q: Will the cost of running this be passed on to researchers?

A: That seems inevitable. As part of research funding, the cost of sharing outputs will have to be put in. This will be included in researcher training, hopefully, but unfunded research posts a problem as well. How will that all be funded? Who knows. Most institutions will have to handle this problem.


Topic: Institutional Infrastructure for Research Data Management
Speaker(s): Anthony Errol Beitz, Paul Bonnington, Steve Androulakis, Simon Yu, David Saint

Why do researchers care about data more? We know they have an onslaught of data they access and share, whether it’s their own or others. There are legal obligations of privacy, of fulfilling grant requirements that infrastructure can help with. And access to data will change the way researchers propose and work in general.

Research institutions can get additional funding, better folks with good data stores. And they can escape legal risk.

Researchers work in an interpretive mode, focusing on outcomes. They are open ended and thrive on ambiguity, and they are very responsive. Things are always shifting and so are they. Researchers will be loyal to their research community, if not their institution or their ICTs.

Universities that support researchers with IT divide that responsibility into administration, education, and research. Each works toward different ends but all need to work in the same space. Continuity is a general priority. IT groups work in an analytical mode, not an interpretive one, so there is a bit of a clash.

Data is growing at an exponential rate, and budgets are not. How to keep up is a big worry.

Data management planning fits into the design stage of research, while research data management covers everything from experimenting to publishing. Repositories and portals handle exposure.

Research data management is in its early days. Researchers are still using physical media to move their data around. That isn’t just inefficient, it’s dangerous. Providing one RDM solution for every field, every institution, will not work. It takes a good cultural fit, a community solution and not an institutional solution – it comes back to loyalty. That means universities need to adopt tools, adapt them, and develop from there. Creating unique tools is not sustainable, or even initially possible. Developing new things is expensive and it breaks the collaboration cycle of academic communities.

There are many deployment considerations to take into account: hosting options, ethical or legal or security obligations must be met. This just means institutions need to be flexible.

RDM platforms will help researches capture and share and publish. Data Capture infrastructure feeds the whole RDM platform, which is built on storage infrastructure foundations. Support infrastructure, the forward-facing aspect, puts all of the information into discovery services so it can be used in a meaningful way for everyone. Monash university is building infrastructure only when it fits into the data management planning that has already been laid down. Go through the checklist, not just for the sake of acting.

Uptake of interoperable effective RDMs in Australia, and particularly Victoria, is quite high because everything is as easy and functional as possible.

Ensuring fit-for-purpose at Monash University. The technical aspect of a given solution is not pushed – early engagement leads to development. So adopt Agile software development methodologies. This is a product, treat the researcher like a customer with lots of demands.

Promoting good adoption means creating a sense of ownership in the community. Let ‘them’ support it, raise awareness, and find funding for it. This ensures sustainability, and it’s been quite effective so far.

Supporting eResearch services requires a different psyche. Bespoke systems addressing unique needs, with unique support (no vendors or huge communities – so eResearch support groups in general). And many groups need to be engaged.

Again, adopt first, then adapt, and develop your own solution as a last resort.

Q: You’re plugged into specific federated repositories, not a university-specific one. Are there ever cases where there is no disciplinary solution?

A: Yes, but then academics go to a national repository instead of an institutional one. Monash can host data and disseminate it to whatever repository. It doesn’t publicize, it exposes data to research portals.

Q: There’s a primary data policy at Monash – at what granularity? Only publication-associated data or…?

A: It’s pretty much all data, without discrimination. Researchers and institutions will choose, but as of now all is OK. Storage is a non-issue so far.


 July 11, 2012  Posted by at 9:58 am LiveBlog, Updates Tagged with:  Comments Off on P3A: Research Data Management and Infrastructure LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Built to Scale?
Speaker(s): Edwin Shin

I’m going to talk about a project I recently worked on with a really high volume read vs. write. 250 million records – largest blacklight solr application. Only took a couple of days with Solr to index these but for reasonable query performance thresholds things get more complex. The records were staged in a relational database (Postgres). And around 1KB/record (bibliographic journal data). There are some great documented examples here that helped us. And we had a good environment – 3 servers each with 12 physical cores, 100GB RAM. We just moved all that data from Postgres – 80Gig compressed took a long time. The rate of ingest of the first 10K records if constant, suggested that all 250 million could be achieved in under a day but performance really slowed down after the first

Assign 32 GB of heap to JVM – we found RAM has more impact than CPU. Switch to Java 7. Adding documents in batches of 1000. We stopped forcing commits and only did this every 1 million documents. So in the end we indexed, to the level we wanted, 250 million documents in 2.5 days. We were pretty happy with that.

Querying – we were working with 5 facets (Format, Journal, Author, Year and Keywords) and 7 queryable fields. Worst Case was just under a minute. Too sow. So we optimised querying by running optimize after index. Added newSearcher and firstSearcher event handlers. But it was still slow. We started looking at sharing. 12 shards across 2 servers: 3 Tomcat instances per server, each Tomcat with 2 shards. This means splitting the index across machines and Solr is good at letting you do that and search all the shards at once. So our worst case query dropped from 77 seconds to 8 seconds.

But that’s till too slow. We noticed that filterCache wasn’t being used much, it needed to be bigger. Each shard had about 3 million unique keyword terms cached. We hadn’t changed default level at 512. We bumped it to about 40,000. We removed facets with large number of unique terms (e.g. keywords). The Worst case queries were now down to less than 2 seconds.

The general theme is that there was no one big thing we did or could do, it was about looking at the data we were dealing with and making the right measures for our set up.

We recently set up a Hydra installation, again with a huge volume of data to read. We needed to set up ingest/update queues with variable “worker” threads. It became clear that Fedora was the bottleneck for ingest. Fedora objects created programmatically rather than by FOXML documents – making it slower. The latter would have been fast but would have caused problems down the road, less flexibility etc. Solr performed well and wasn’t a bottleneck. But we got errors and data corruption in Fedora when we had 12-15 concurrent worker threads. What was pretty troublesome was that we could semi replicate this in staging for ingesting. But we couldn’t get a test case and never could get to the bottom of this. So we worked around it… and we decided to “shard” a standalone Fedora repository. It’s not natively supported so you have to do it separately. Sharding is handled by ActiveFedora using a simple hashing algorithm to shard things. We started with just 2 shards and use an algorithm much as Fedora uses internally for distributing files. We get on average a pretty even distribution across the Fedora repositories. This more or less doubled ingest performance without any negative impact.

project at the end of last year. A project with 20 million digital objects. 10-39 read transactions per second 24/7. High availability required, no downtime for reads. no more than 24 hours downtime for writes. Very challenging set up.

So, the traditional approach for high up time is the Fedora Journaling Module that allows you to ingest once to many “follower” installations. Journaling is proven, it’s a simple and straightforward design. Every follower is a full redundant node. But that’s also a weakness. Every follower is a full, redundant node – huge amounts of data and computationally expensive processes that happen on EVERY node, that’s expensive in terms of time, storage, traffic. And this approach assumes a Fedora-centric architecture. If you have a complex set up with other components this is more problematic still.

So we modeled the journaling and looked at what else we could do. So we set up the ingest that was replicated but then fed out to a Fedora shared file system and was fed into nodes but not doing FULL journaling.

But backups, upgrade and disaster recovery. But with 20 million digital objects. The classic argument for Fedora is that you can always rebuild. In a disaster that could take months here, though. But we found that most users used new materials – items from the last year – so we did some working around to make that disaster recovery process faster.

Overall the general moral of the story is you can only make these types of improvements if you really know

Q1) What was the garbage collector you mentioned?

A1) G1 Garbage collector that comes with Java 7

Q2) Have you played with the chaos monkey idea? Netflix copies to all its servers and it randomly stops machines to train the programming team to deal with that issue. It’s a neat idea.

A2) I haven’t played with it yet, I’ve yet to meet a client who would let me play with that but it is a neat idea.

Topic: Inter-repository Linking of Research Objects with Webtracks
Speaker(s): Shirley Ying Crompton, Brian Matthews, Cameron Neylon, Simon Coles

Shirley from STFC – Science and Technology Facilities Council. We run large facilities for researchers. We run manage a huge amount of data every year and my group runs the e-Infrastructure for these facilities – including the ICAT Data Catalogues, E-publications archive and Petabyte Data Store. We also contribute to data management, data preservation etc.

Webtracks is a joint programme between STFC and University of Southampton. This is a Web-scale link TRACKing for research data and publications. Science on the web increasingly involves the use of diverse data sources and services plus objects. And this ranges from raw data from experiments through to contextual information, lab books, derived data, research outputs such as publications, protein models etc. When data moves from research facility to home institution to web materials areas we lose that whole picture of the research process.

Linked data allows us to connect up all of these diverse areas. If we allow repositories to communication then we can capture the relationship between research resources in context. It will allow different types f resources to be linked within a discipline – linking a formal publication to on-line blog posts and commentary. Annotation can be added to facilitate intelligence linking. It allows researchers to annotate their own work with materials outside their own institution.

Related protocols here: Trackback (tracking distributed blog conversations, with fixed semantics), Semantic Pingback (RPC protocol using P2P).

In webtracks we took a two pronged approach: inter-repository communications pool and a Reslet Framework. The InteRCom protocol that allows repositories to connect and describe their relationship (cito: isCitedby). InteRCom is a two stage protocol like Trackback, first harvesting of resources and metadata. Then pinging process to post the link request. The architecture is based on the Restlet Framework (with a data layer access, app-spec config (security, encoding, tunneling), resource wrapper. This has to connect to many different institutional policies – whitelisting, pingback (and checking if this is a genuine request), etc. Lastly you have to implement the resource cloud to expose the appropriate links.

Webtracks uses a Resource Info Model. A repository connected to a resource, to a link and each link has subject, predicate and object. The link can be updated and tracked automatically using HTTP. We have two exemplars being used with WebTracks the ICAT investigation resource – DOI landing page, and HTML rep with RDFa – so a machine and human readable version. The other exemplar is EPubs set up much like ICAT.

InterCom Citation Linking – we can see on the ICAT DOI landing page linking to Epubs expression links pae. That ICAT DOI also links to ICAT Investigation links page and in turn that links to Epubs expression page. And that expression page feeds back into the Epubs Expression links page.

Using the Smart Research Framework we have integrated services to automate prescriptive research workflow – that attempt to preemptively catch all of the elements that make up the research project, including policy information, to allow the researcher to concentrate on their core work. That process will be triggered at STFC and will capture citation links in the process.

To summarise, Webtracks provides a simple but effective mechanism to facilitate propagation of citation links to provide a linked web of data. It links diverse types of digital research objects. To restore context to dispersed digital research outputs. No constraints on link semantics and metadata. It’s P2P, does not rely on centralised service. And it’s a highly flexible approach.

Topic: ResourceSync: Web-based Resource Synchronization
Speaker(s): Simeon Warner, Todd Carpenter, Bernhard Haslhofer, Martin Klein, Nettie Legace, Carl Lagoze, Peter Murray, Michael L. Nelson, Robert Sanderson, Herbert Van de Sompel

Simeon is going to be talk about resource synchronization. We are a big team here and have funding from the Sloan Foundation and from JISC. I’m going to be talk about discussions we’ve been having. We have been working on the ResourceSync project, looking at replication of web material… it sounds simple but…

So… synchronization of what? Well web resources – things with a URI that can be dereferenced and are cache-able. Hidden in that is something about support for different representations, for content negotiations. No dependency on underlying OS, technologies etc. For small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources). We want this to be properly scalable to large resources or large collections of resources. And then there is the factor of change – is it a slow change (weeks/month) for an institutional repository maybe, or very quickly (seconds) – like a set of linked data URIs, and where there needs to be latency there. And we want this to work on/via/native to the web.

Why do this? Well because lots of projects are doing  synchronization but do so case by case. The project teams are involved in these projects. Lots of us have experience with OAI-PMH, it’s widely used in repository but XML metadata only and web technologies have moved on hugely since 1999. But there are loads of use cases here with very different needs. We had lots of discussion and decided that some use cases were not but some were in scope. That out of scope for now list is: bidirectional syncronisation; destination-defined selective synchronization (query); special understanding of complex objects; bulk URI migration; Diffs (hooks?) – we understand this will be important for large objects but there is no way to do this without needing to know media types; intra-operation event tracking; content tracking.

So a use case: DBPedia Live duplication. 20 million entries updated once per second. We need push technology, we can be polling this all the time.

Another use case: arXiv mirroring  1 million article versions. about 800 created each day and updated at 8pm US eastern time. metadata and full text for each article. Accuracy very important. want low barrier for others to use. Works but currently use rsync and that’s specific to one authentication regime.

Terminology here:

  • Resource – inject to be synchronizes, a web resource
  • Source – system with the original or master resources
  • Destination – where synchronised to
  • Pull
  • Push
  • Metadata – information about resources such as URI, modification time, checksum etc. Not to be confused with metadata that ARE resources.

We believe there are 3 basic needs to meet for syncronisation. (1) baseline synchronisation – a destination must be able to perform an initial load or catch-up with a source (to avoid out of band setup, provide discovery). (2) Incremental synchronization – destination must have some way to keep up-to-date with changes at a source (subject to some latency; minimal; create/update/delete). (3) Audit – it should be possible to determine whether a destination is synchronised with a source (subject to some latency; want efficiency –> HTTP HEAD).

So two approaches here. We can get an inventory of resources then copy one by one via HTTP GET. Or we can get a dump of the data and extract metadata.  For auditing we could do new Baseline synchronization and compare but likely to be very inefficient. Can optimize by adding getting an inventory and compare copy with destination – using timestamp, digest etc. smartly, a latency issue here again to consider.  And then we can think about Incremental Synchronisation. The simplest method would be audit then copy all new/updated resources plus removal of deleted. Optimize this by changing communication – exchange ChangeSet listing only updates; Resource Transfer – exchange dumps for ChangeSets of even diffs; and Change Memory.

We decided for simplicity to use Pull but some applications may need Push. And we wanted to think about the simplest idea of all: SiteMaps as an inventory. So we have a framework based on Sitemaps. On level 0, the base level. Publish a sitemap and someone can grab all of your resources. A simple feed of URL and last modification date lets us track changes. Sitemap format was designed to allow extension. It’s deliberately simple and extensible. There is an issue about size. The structure is for a list of resources that handles up to 2.5 billion resources before further extension required. Should we try to make this looks like RDF we expect? We think no but map Sitemap RDF to RDA.

At the next level we look at a ChangeSet. This time we reuse Sitemap format but include information only for change events over a certain period. To get a sense of how this looks we tried this with ArXiv. Baseline synchronisation and Audit: 2.3 million resources (300GB); 46 sitemaps and 1 sitemapindex (50k resources/sitemap).

But what if I want Push application that will be quicker? We are trying out XMPP (as used by Twitter etc.) and lots of experience and libraries to work with for this standard. So this model is about rapid notification of change events via XMPP Push. They trialed his at LiveDBpedia. LANL Research Library ran a significant scale experiment of LiveDBPedia database from Los Alamos to two remote sites using XMPP to push notifications.

One thing we haven’t got to is dumps. Two thought so far… Zip file with Sitemap – simple and widely used format  – but custom solution. The other possibility is WARC – Web ARCiving format. Designed for just this purpose but not widely used. We may end up doing both.

Real soon now a rather extended and concrete version of what I’ve said will be made available. First draft of sitemap-based spec is coming July 2012. We will then publicize and want your feedback, revision and experiments etc in September 2012. And hopefully we will have a final specification in August.


Q1) Wouldn’t you need to make a huge index file for a site like ArXiv?

A1) Depends on what you do. I have a program to index ArXiv on my own machine and it takes an hour but it’s a simplified process. I tested the “dumb” way. I’d do it differently on the server. But ArXiv is in a Fedora repository so you already have that list of metadata to find changes.

Q2) I was wondering as you were going over the SiteMap XML… have you considered what to do for multiple representations of the same thing?

A2) It gets really complex. We concluded that multiple representations with same URI is out of scope really.

Q3) Can I make a comment – we will be soon publishing use cases probably on a Wiki and that will probably be on GitHub and I would ask people to look at that and give us feedback.

 July 11, 2012  Posted by at 8:26 am LiveBlog, Updates Tagged with:  Comments Off on P2A: Repository Services LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Augmenting open repositories with a social functions ontology
Speaker(s): Jakub Jurkiewicz, Wojtek Sylwestrzak

The project began with ontologies, motivated by SYNAT project. The project requires specific social functions, and particular ontologies are used to analyze them as completely as possible.

This particular project started in 2001, aiming to create a platform for integrated digital libraries from a custom platform hosting the Virtual Library of Science. Bimetal was used so that different versions of metadata schema could be put into place.

The Virtual Library has 9.2 million articles, mostly full text from journals, but also traditional library content. That traditional content creates problems with search because it is not all digitized.

SYNAT brings together 16 leading Polish research institutions, and the platform aims to manage all of this data in a way that users can interact with well – all ultimately using BWMeta 2.

In Poland an open mandate initiative requires the project to have the capacity to host open licensed data, and allow authors to publish their works (papers, data, etc). Support for restricted access content is also included, with a ‘moving wall’ for embargoed works – content is stored in the repository, and it will switch from closed to open access after a pre-decided time.

Social functions of SYNAT…

Users can discuss the resources, organize amongst themselves into smaller groups, share reading lists, follow the activities of other users or organizations (published content, comments, conferences, etc). This is all part of the original project aims and goals.

Analysis of social functions was based upon some prior work for efficiency. Biblographic elements would use Dublin core and BIBO. Friend of a friend analysis as well. All of the different objects (users, metadata fields, and so on) have been mapped to particular ontologies.

People on the platform can be users, authors. The assessment makes particular note of the fact that persons and users are different – there are likely more users than people involved with the platform. Also, each can be connected to specific works and people or not depending on user preference. People will have published objects as well as unofficial posts (forum thread, comment). Published objects can be related to events based on whether they were published in or because of said event.

So, objects include user profiles, published objects, forum activity and posts, groups, events. These are all related to one another using predicates (of, by, with). This model then satisfies the requirements of the project aims and goals.

It is important to point out how previous work was reused from existing ontologies. It simplifies the analysis process and makes it more precise because of reiteration. There is also easy RFD export from the final system for comparability, though not for storing in the database.

In the future, implementation of these social analysis functions will be done.

Q: Has ontology work been shared with the content suppliers for whatever purpose? Do they think it will add value?

A: They aren’t disinterested, but it isn’t something they are interested in for themselves. They are glad it is offered as part of the service.


Topic: Microblogging Macrochallenges for Repositories
Speaker(s): Leslie Carr, Adam Field

This all comes about from work into the London riots in sociological work. He’d done analysis with interviews and, most importantly, videos posted on Twitter via YouTube by passersby. People took these videos offline quite quickly out of fear of retribution – this meant going back to gather data was difficult.

Les is running a web science and social media course now, with a lot of emphasis on Twitter. It provides a good understanding of group feeling, given the constraints of Twitter. Why not extend repositories to make Twitter useful in that area?

The team built a harvester, which connects to the Twitter search API for now. No need for authentication, no ‘real’ API per se, but it works alright. You can only go back 1500 tweets per search, but that has been enough. The Search API is hit every 20 minutes. This was to be preserved for the sake of the system itself, but other people came out to share their own harvested tweets. There are coding benefits and persistent resource benefits.

Tweets would be a document living under EPrint, those documents themselves XML files rendered into HTML. Unfortunately, doing this did not scale well. Thousands of XML and HTML files under one EPrint. When that system checked file sizes, it would take 30 minutes to render the information – which might as well be broken.

Tweets are quite structured data beyond just the text inside. Stored separately, the other fields make a very rich database. Treating tweets as first class objects in relation to a TweetStream makes them even more valuable.

Live demo on screen of EPrints analysing OR2012 tweets. Who’s been talking and how many things have they said. Which hashtags, which people are discussed? What links are shared? What frequency of tweets per day? All exportable as JSON, CSV, HTML.

There are limitations of the repository – EPrints is designed for publications, not millions of little objects.

Problems with harvesting. URLs are shortened with wrappers, now t.co. The system has trouble resolving all of these redirects, but where a link ends up is enormously important. Following a link, on average, takes 1 second. That’s a huge number with so much content. MySQL processing has also created some limitations, but those have been largely worked around – this took a great deal of optimization with complex understanding of the backend. A third problem was the Twilight problem: popular topics will spike to over 1500 tweets per harvest, and so a lot is missed. This could be overcome with the streaming API, but there are real time issues with using that. Quite complex.

The future. Dealing with URL wrappers. Dealing with the unboundedness of the data – there is so much that optimizations will not be able to keep up. A new strategy for the magnitude problem has to be puzzled out. Potential archival of other content: YouTube videos and comments, Google results over time.

This Twitter harvester is available online for EPrints – lightweight harvesting for nontechnical people.

There are large scale programs for this already, some people need smaller and more accessible tools (masters and doctoral students).

Q: Why are you still using EPrints? It seems like there are a lot of hacks, and you would have been better off using a small application directly over MySQL.

A: EPrints is a mature preservation platform. Easy processing now is not the best thing for the long term. Repositories are supposed to do that, so challenges should be met to overcome that.


Topic: Beyond Bibliographic Metadata: Augmenting the HKU IR
Speaker(s): David Palmer

At universities in Hong Kong, more knowledge exchange was desired to enable more discovery. Then, theoretically, innovation and education indicators would improve. The office of knowledge exchange chose the institutional repository, built on DSpace and developed with CILEA. The common comment on this work after it was first implemented was that part of the picture was missing.

Getting past thin and dirty metadata was a goal, along with augmenting metadata in general: profiles, patents, projects.

Publication data is pushed from HKUs central research database, the Research Output System, filled by authors or assistants. Needs much better metadata. Now trying to get users to work with Endnote, DOI, ISBN so that cleaner metadata comes in.

Bibliographic rectification via merges or split of author profiles with a user API and robots. This has worked quite well.

Search and scrape of the database start with numbers (DOI, ISBN, etc), then search for strings (title, author). Each entry pulls citations, impact factors if available.

Lots of people involved in making this all work, and work well.

Author profiles include achievements, grants, cited as (for alternate naming) and external metrics via web scrape. Also includes prizes (making architects happy) and supervision with titles with a link (making educators happy).

Previously, theses and dissertations (which are very popular) were stored in three separate silos. Now they all integrate with this system for better interactivity of content for tracking, jumping between items.

Grants and projects are tracked and displayed, too. This shows what is available to be applied for, or what has been done already – publications resulting from. Patent records included, with histories and appropriate sharing of information based on application status, publication and granting. Links to published, granted patents and abstracts in whichever countries they exist.

With all of this data, other things can be shown: articles with fastest rate of receiving citation, most citied, who publishes the most. Internal metrics show off locations of site views, views over time, and more. Visualizations are improving, so users can see charts (webs of coauthors for an author, for example) and graphs and things. The data is all in one place from other silos, which is great because on-the-fly charts would be otherwise impossible.

Has all of this increased visibility? Anecdotally, absolutely. People’s reputations are improving. Metrics show improvement as well. The hub is stickier – more pages per visit, more time per page, because everything is hyperlinked.

This work, done with CILEA, is going to be given back to the community in DSpace CRIS modules. Everything. For the sake of knowledge exchange. Mutual benefits will result, in terms of interoperability and co-development.

Q: Is there an API to access data your system has ground out of other sources?

A: A web service is in the works. The office of dentistry is scraping this data manually already, so it’s doable

 July 11, 2012  Posted by at 7:55 am LiveBlog, Updates Tagged with:  Comments Off on P2B: Augmented Content LiveBlog
Jul 102012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.


Topic: The Development of a Socio-technical infrastructure to support Open Access Publishing though Institutional Repositories
Speaker(s): Andrew David Dorward, Peter Burnhill, Terry Sloan


Trying to create an infrastructure before the open access revolution happens. Sooner than later, it seems, so the team is trying to create a template for the UK and Europe, RepNet.

RepNet aims to manage the human interaction that helps make good data happen. This is an attempt to justify the investment that JISC has made into open access and dissemination.

RepNet will use a suite of services that enable cost effective repositories to share what they have.

First, they mapped the funders, researchers, publishers, institutions to see where publications are made.

RepNet hopes to sit between open access and research information management by differentiating between various types of open access, between standards,

Through conversations with all the stakeholders, they’ve put together a catalog of every service and component that would go into a suite for running such a repository.

Funders’, subject, and institutional repositories will all sit upon the RepNet infrastructure. This will offer service support, helpdesk and technical support, and a service directory catalogue for anyone hoping to switch to open access. All of this will then utilize various innovations, hosting, services to get to users.

RepNet also has a testing infrastructure.

RepNet is past the preparation stages now, and moving into implementation of a wave one offering that integrates everything. The next iteration will take what wave one teaches the team and improve the offering further.

Deposit tools, benchmarking, aggregation and registry are already available, and wave two will bring together more and bigger services to do these things with repositories.

The component catalogue is getting quite comprehensive, with JISC helping to bring in and assess new ideas all the time.

RepNet is being based on the information context of today – policy and mandates, plus the strong desire for open access.

The UK is a great country to be in for Open Access, there’s quite a bit of political support in favor of moving in this direction.

If the market is to be truly transparent, gold open access payment mechanisms will have to be handled. This is something new that RepNet is working on figuring out.

The focus now is on optimizing wave one components, a very comprehensive set of tools and funder-publisher policies working with deposit and analysis tools to make everything easily accessible. REPUK, CORE, IRS, OPEN DOAR, ROAR, NAMES2 are all components being looked at, which wave two will.

ITIL is being used as the language for turning strategies and ideas into projects.

There is also a sustainability plan, submitted by SIPG members: subscriptions, contributions, payment for commercial use. Further JISC underpinning is being considered as well.

Part of RepNet will be a constant assessment of services: when one needs to be retired, it will move back into the innovation zone and included again when there’s a demand.

RepNet provides an excellent service to support green and to further investigate gold open access. It will give us a great way of assessing repositories, better integration, and less human-intensive management of repositories.

The aim now is to move to data-driven infrastructure, letting different projects speak to each other through reporting mechanisms. This will make it more integrated and, ultimately, more useful.

Wave two will focus on micro services.

The sustainability plan will hopefully be put in place before 2013

Q: Academics can see how all this works. Are there plans for making these sorts of information and services available to the public?

A: It’s all about integration with common search tools. There’s a vast gap between what has surfaced because of professional search tools, and what something like Google finds via its own crawler. It’s also important to make deposit accessible to everyone else, of at least thinking beyond the academic lockdown instead of just focusing on the expert community


Topic: Repository communities in OpenAIRE: Experiences in building up an Open Access Infrastructure for European research
Speaker(s): Najla Rettberg, Birgit Schmidt

OpenAIRE is rooted in the interests of the European Commission to make an impact on the community. Knowledge is the currency of the research community, and open access needs to be its infrastructure.

The hope is that something stronger that the green mandate comes about as the European Commission talks more about this.

OpenAIRE infrastructure aims to use publication and open data infrastructure to release research data. It involves 27 EU countries and 40 partners to pilot open access and measure impacts.

OpenDOAR, usage data, and EC funding have all fed the growth of the project. The result is a way to ingest publications, to search and browse them, to see statistics linked to the content and assess impact metrics.

Three parts, technical, networking, service. Networking brings together all the partners, stakeholders, open access champions. They run a helpdesk, build the network by finding new users, researchers, publications. This community of practice is very diverse. With everyone together there is an opportunity to link activities and find ways to improve the case for OA.

OpenAIRE provides access to national OA experts, researchers and project coordinators, managers.

Research administrators can consider a few things in their workflows to follow the open mandate and OpenAIRE shows them statistics about their open access data as they do so.

Everyone is invited to participate by registering to OpenDOAR and following OpenAIRE guidelines. OpenAIRE offers a toolkit to project officers to get going. As of now there are about 10000 open access publications in the repository from 5000.

Part 2: OpenAIRE Phase 2

OpenAIRE phase 2 will link to other publications and funding outside of FP7, shifting from pilot to service for users.

300 OA publication repositories are being added, along with new data repositories and an orphan data repository. Don’t forget CRIS, OpenDOAR, ResearchID. All of this will go into the information space, using text mining to clean things up and make it all searchable

Now there are OpenAIRE guidelines being built for data providers. These look at how to connect metadata to research data, and how to export it for use externally. It isn’t so much prescriptive as exploratory. With these in hand, other countries and organizations with less developed OA might be able to improve their own data offerings.

The scope of OpenAIREs work is wide. Most fields and types of data welcome, so keep in touch.

OpenAIRE is building a prototype for ‘enhanced’ publications, letting users play with the data within. This will be cross-discipline, and can be exported to other data infrastructures. Also working on ways to represent enhanced publication more visually and accessibly.

What connects data to the publication? OpenAIRE is on the boundary exploring that question.

The repository landscape is very diverse, but so are the tools for bringing data and repositories together. OpenAIRE is aware of data initiatives, stakeholder AND researcher interests. OpenAIRE is running some workshops and will be at the poster sessions throughout the conference.

Q: CORE has done a lot of text mining work already. Have you spoke to them?

A: There has been discussion with CORE about repositories, but not text mining. OpenAIRE is working with several groups.

Q: You want to develop text mining services. In this area, with linking repos and content, CORE offers an API for finding those links and for reclassification. You aim to develop these services by 2013, are you aiming to use other tools to do this, and are you happy to do so?

A: The technical folks are here for that very purpose, so keep an eye out for the OpenAIRE engineers.



Topic: Enhancing repositories and their value: RCAAP repository services
Speaker(s): Clara Parente Boavida, Eloy Rodrigues, José Carvalho

RCAAP comes from a Portuguese national initiative perspective. It is a national initiative to promote open access for the sake of visibility, accessibility, dissemination of Portuguese work.

The project started in 2008 as a repository hosting service, moving forward to validation and Brazilian cooperation, then a statistics tool.

Overseen by the FCCN.

Learn more at projecto.rcaap.pt if you are a repo manager or journal publisher.

The strategy for SARI, a part of RCAAP, is creating a custom Dspace (Dspace++) for Portuguese academic users. It offers hosting, support, design services, and autonomous administration, all for free. 26 repos use this service, and because of the level of customisability offered, SARI repos all have their own unique look and feel.

Another service in RCAAP is Common Repository, for people who do not produce a lot of content but want it to be openly accessible in a shared area. 13 institutions are using this slimmed down repository tool.

RCAAP search portal enables users to search all open repositories in Portugal, and participating organizations in Brazil. 447934 documents from 42 resources, updated daily. Users can search by source, category, keywords.

Aggregated information is all OAI-PMH compliant. Further, it is an SRU provider. PDF and DOC files are full text searchable. Integration via various other tools.

A given entry can be shared, imported to reference managers. Author CVs and all related metadata are connected.

RCAAP Validator users Driver Guidelines to assess a URL based on validation options. A report will then be emailed, including statistics for openness and errors when checked against Driver Guidelines. Errors are described. Assessment checks aggregation rules, queue validation, XML parsing and definitions check, and also confirms that files in the repository all actually exist. Three types of metadata validation are done: existence of each element (title, author, date, etc), taxonomies (iso and driver types), and structure of metadata content (driver prefixes, proper date formatting). Checking all of this ensures a good search experience.

Dspace add-ons are enabling included repositories and infrastructure to ensure openness and compliance with standards.

Add-ons include OAIextended, Minho Stats, Request Copy (for restricted access content), sharing bar, Degois (for Portuguese researchers), Usage Statistics, OpenAIRE projects Authority Control (auto-documentation of workflow activities), Document Type, and Portuguese Help.

Another service is SCEUR-IR. It aggregates usage statistics and allows the creation and subscription of graphic information.

The Journal Hosting Service uses the same strategy as the repository hosting service mentioned earlier, and allows total autonomy to publishers.

In Portugal and in Brazil, open access is very successful. The factors for that success and interoperability guidelines, community help and advocacy, and integration with research systems.

Q: What sort of numbers does a medium sized university in Portugal see for downloads, use?

A: Thousands of downloads/hits per day. Bigger universities will see 5000 or more.

Q: JISC is mandating metadata profiles and analyzing with Driver guidelines, and is looking to make a validator tool. Is the validator tool open source?

A: No, but information exchange is always welcome.

Q: Further on the validator, you are actually checking every record for every file – whether accessible or missing?

A: That is an option. When checking, the user would set which repository format they are using, and then run a check against that profile. Once an attempt to download has been made, after 2sec of attempted download, the entry will be marked available. Failure to start the download would be marked as an error.


Topic: Shared and not shared: Providing repository services on a national level
Speaker(s): Jyrki Ilva

The national library of Finland, an independent institute within the University of Helsinki, provides many services to the library network in Finland.

While there are 48 organizations with institutional repositories, only 10 public instances exist. Most of these repositories use the national library service.

The National Library provides its services to about 75% of Finnish repositories. They are not the only centralized service provider, but they are in a minority. The ‘do it yourself’ mentality has taken root, but with instancing of repositories anyone can and should have one. The same work does not need to be done over and over again in every organization.

‘Do it yourself’ does not always make sense. It is more expensive and often not as well executed as using other services. In Finland, there is much more sharing going on than in other countries.

Many countries started OA repositories in the 90s and 00s. The National Library started the idea of the digital object management system in 2003 – the first attempt at a proprietary software platform, which did not work as planned. The National Library chose DSpace instead, starting in 2006.

One of the challenges was trying to make one giant DSpace instance for all organizations. Not enough instance, and so the idea was not sustainable. Local managers were concerned with the threat of requirements and demands in the national repository system.

Fortunately, the National Library was chosen to be the service provider at the Rectors’ Conference for Finnish Universities of Applied Sciences in 2007.

Work is divided between customer organizations and the National Library. Curation and publication done locally. National Library develops and maintains the technical system.

Theseus, a multi-institutional repository, has seen much success. 25 universities, tens of thousands of entries.

Doria, another multi-institute platform, is technology neutral and allows more autonomy of its participant communities. This freedom allows for customizability, but less quality metadata and increased confusion amongst users.

Separate repository instances are also provided at extra cost to customers. Some organizations just prefer their own instance. TamPub and Julkari are two examples.

Selling repository services comes down to defining strong practical needs amongst customer organizations. Little marketing has been done – customers have a demand and they find a supply. Long-term access, persistent addresses for content have been selling points. While not trying to make a profit, covering costs with a coherent pricing scheme is necessary. That said, many customers have relatively small needs, and so services must be kept affordable when necessary. National Library is also considering consultation as a service.

Negotiating user contracts is time consuming, though, and balancing customer projects requires a constant assessment of development in infrastructure in general.

Some Finnish universities will continue to host their own repositories, but cooperation benefits everything: technical and policy development can be improved.

Measuring success can be done on various levels. Existence of the repository is a start. Download metrics are another option. Impact assessment should be done. Looking at measures for success, National Library is still struggling with research data and self-archived publications. They’ve had success with dissertations, heritage materials, journals, but there is always room to improve.

Q: What is the relationship between the national library and non-associated repositories? Is meta-data being recorded? Metrics?

A: In most cases, organizations are reporting to an umbrella body, which keeps track of everything.

Q: Any details on the coherent pricing system?

A: Pricing has so far been based on the size of a given organization, how much data and how many hits they will have.

Q: Are you very service oriented, and is this something that the National Library does in general, or are repositories a special case?

A: Partly a special case, because funding is not guaranteed. It wasn’t so much intended as demanded for the sake of sustainability.


Topic: The World Bank Open Knowledge Repository:
 Open Access with global development impact
Speaker(s): Lieven Droogmans, Tom Breineder, Matthew Howells, Carlos Rossel

Publishers are not usually the pushers of open access, but the World Bank isn’t like many other organizations.

Why do this? The World Bank is trying to reach out and inform people of what the organization actually does, what its mission is: relieving poverty.

World Bank funds projects in the developing world, whether pragmatic solutions or research for the sake of outcomes.

When World Bank changes direction, it goes slow but it really changes. The Access to Information Policy is wide and distinct within the organization. In particular, there is a focus on open data, research, and knowledge. This launched in 2010 with the object of ensuring as much bank data was as accessible as possible for the sake of transparency and further reach.

The Open Access Policy has been adopted, as of July 1st. It is an internal mandate for staff to deposit all research into the repository. This also applies to external research funded by the bank. This data is on OKR and Creative Commons attributed (CC BY). Externally published documents are made available as soon as possible, with a more restrictive creative commons attribution.

World Bank wants to join the Open Access community and lead other IGOs to do the same.

There are benefits externally and internally. Externally, policymakers and researchers gain data. Internally, authors and inter-department staff can access information that they did not easily have before.

World Bank was, until a few years ago, a very traditional publisher, but the desire that the Bank has to free its information for reuse has caused a shift.

That’s the why, and here’s the how.

Content included? Books, working papers, journal articles internally and externally. In the future, the Bank aims to include author profiles with incentivized submission, and the recovery of lost or orphaned or unshared legacy materials. Making the submission process easy and showing usage stats are also forthcoming. The latter will make entries more useful and visually appealing. Finally, there will be integration with Bank systems for interoperability of data.

This is all being deployed on IBM Websphere, integrated with Documentum, Siteminder, and the Bank’s website.

Along with the Open Development Agenda, the Bank is also contributing open source code, including facet search code.

Facet search allows results to be refined by metadata category (author, topic, date, etc). Each facet will show result counts.

World Bank uses a taxonomy of items that includes the full map of category hierarchy. To get the facet search filter count right, World Bank checks entries against each other when showing results and return the proper number of items related to a given filter selection.

Now the problem is an attempt to resolve the need for drill-down of search infrastructure, showing multi-indexed browse of an entire hierarchy of a subject.


Q: Is it possible to showcase developing country research?

A: Yes. World Bank is looking for content in Africa right now, and attempting to gain access. This outreach is coming along slowly but surely.

 July 10, 2012  Posted by at 5:02 pm LiveBlog, Updates Tagged with:  Comments Off on P1B: Shared Repository Services and Infrastructure LiveBlog