Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: ORCID update and why you should use ORCIDs in your repository
Speaker(s): Simeon Warner

I am speaking with my Cornell hat on and my ORCID hat on today. So this is a game of two halves. The first half is on ORCID and what it is. And the second half will be about the repository case and interfacing with ORCID.

So, the scholarly record is broken because there is no reliable attribution of authors and contributors is impossible without unique person-level identifiers. I have an unusual name so the issue is mild, but if you have a common name you are in real trouble. We want to find unique identities to person records across data sources and types and to enlist a huge range of stakeholders to do this.

So ORCID is an amazing opportunity that emerged a couple of years ago. Suddently publishers, achivists, etc. all started talking about the same issue. It is an international, interdisciplinary, open and not for profit organization. We have stakeholders that include research institutions, funding organizations, publishers and researchers. We want to create a registry of persistent unique identifyer fo all sorts of roles – not just authors – and all sorts of contributions. We have a clear scope and set of principles. We will create this registry and it will only work if it’s used very widely. The failure of previous systems have been because the scope hasn’t been wide enough. One of the features of research is that things move –  I was a physicist, now repositories, libraries… I don’t live in one space here. To create an identity you need som einformation to manage that. You need a name, an email, some other bits of information, and the option for users to update their profile with stuff that is useful for them. Privacy is an issue – of course. So we have  a principle in ORCID is opt-in. You can hide your record if you want. You can control what is displayed about you. And we have a set of open principles about how ORCID will interact with other systems and infrastructure.

So ORCID will disambiguate researchers and allow tracking. automate repository deposition, and other tasks that levage use of this sort of ID. We have 328 participan organizations, 50 of which have provided sponsorship. And that’s all over the world.

So to go through a research organization workflow: for an organisation it’s a record of what researchers have done in that institution. But you don’t want a huge raft of staff needed to do this. So the organisation registers with ORCID. At some stage ORCID looks for a record of a person and the organisation pulls out data on that person. Once that search is done on already held information. Identifiers can then be created ready for researchers to claim these.

So, granting bodies, in the US there is always a complaint and a worry about the buden of reporting. So what if we tied this up to an ORCID identity? Again the granting body registers with ORCID and then an ORCID::grant linking sent to PI or researcher for confirmation. Same idea again with the publisher. If you have granted the publisher the ability to do it you can let them add the final publication to your name, saving effort and creating a more accurate record.

So a whole set of workflows gives us a sort of vision for researchers as early as possible in the creation of research here. And in phase I system the researcher can self-claim a profile, delegate management and institutional record creation. Fine grained control of privacy settings. Data exchange into grant and manuscript submission system, authorised organisations/publications etc. So right now we have an API, a sandbox server, etc. Now working out launch partners and readying for launch. ORCID registry will launch in Q4 of 2012. Available now: ORCID identifier structure (coordinated with ISNI) will have a specific structure. Code, APIs, etc. available.

So why should you use ORCID in your repository?

Well we have various stakeholders in your repository – authors, academic community and the institutions themselves. Institutional authors want credit for their work, ORCID should and will increase the likelihood of authors publications being recognised. It opens the door to link to articles that wouldn’t be linked up – analyses of citations etc. Opens the door to more nuanced notions of attributions. And it saves efforts by allowing data reuse across institutions. For readers it offers better discovery and analysis tools. Valuable information for improving tools like Microsoft Academic search, better ways to measure research contributions etc. And institutions allows robust links between local and remote repositories, better track and measure use of publications.

And from an arXiv position we’ve looked for years for something to unify author details across our three repositories. We have a small good quality repositories but we need that link between the author and materials. And from UK/JISC perspective there is a report from JISC Research Identifier task force group that indicates the benefits of ORCID. I think for repositories ORCID helps make repositories count in a field we have to play in.

So, you wnat to integrate with ORCID. There are two tiers to the API right now, I’ll talk about both. All APIs return XML or JSON data. The tier 1 API is available to all for free, no access controls. With this you can ask a researcher for their ORCID ID and look at data made public. You could provide pop up in your repository deposit process to check for their ORCID ID. There is a competition between functionality and privacy here but presuming they have made their ID public this will be very useful.

Tier 2 API members will have access to an OAuth2 authentication between service and ORCID allow users to grant certain rights to a service. Access to both public and (if granted) protected data. Ability to add data (if granted). Really three steps to this process. Any member organisation in the process would get an ORCID ID in first stage of the process. Secondly if you have a user approaching the repository that user can login and grant data access to the client repository. The user can be redirected back to the repository along with an access permisssion. And if access is granted then the repository continues to have access to the user’s profile until this permission is revoked by the user (or ORCID). And data can be added to the users profile by the repository if it becomes available.

All code etc. on dev.orcid.org. Follow the project on Twitter @ORCID_Org.


Q1 – Ryan) You mentione dthat ORCID will send information to CrossRef, what about DataCite?

A1) I don’t think I said that. We import data from CrosRef, not an import the other way. I think that would be led by DOI owner, not ORCID. DOI is easy, someone has the right to a publication, people don’t work that way.

Q1) In that case I encourage you to work with DataCite.

A1) If it’s public on ORCID anyone can harvest it. And ORCID can harvest any DOI source.

Q2 – Natasha from Griffith University) An organisation is prompted to remove duplicates? How does that work?

A2) We are working on that. We are not ready to roll out bulk creation of identifiers for third party at th emoment. The initial creation will be by individuals and publications. We need to work out how best to do that. Researchers want this to be more efficient so we need to figure that question out.

Topic: How dinosaurs broke our system: challenges in building national researcher identifier services
Speaker(s): Amanda Hill

So I am going to talk about the wider identifier landscape that ORCID and others fits into. So on the one hand we have book-level data, it’s labour intensive, disambiguation first, authors not involved, open. And then we have publisher angle – automatic, disambiguation later, authors can edit, proprietary. In terms of current international activity we have ISNI as well as ORCID. ISNI is very library driven, disambiguation first, authors not involved, broad scope. ORCID is more publisher instigated, disambiguation later, authors can submit/edit, current researchers. ISNI is looking at fictional entities etc. as well as researchers etc. so somewhat different.

We had a Knowledge Exchange meeting on Digital author identifiers in March 2012 and both groups were encouraged and present, they are aware and working with each other to an extent. Both ISNI and ORCID will use of existing pools of data to populate them. There are a number of national author ID systems – in 2011 there was a JISC-funded survey to look at systems and their maturity. We did this via a survey to national organisations. The Lattes system in Brazil is very long term – its been going since 1999 – and very mature and very well populated but there is  a diverse landscape.

In terms of populating systems there is a mixture – some prepopulated, some manual, some authors edit themselves. In Japan there was an existing researcher identifiers, thesaurus of author names in Netherlands. In Norway they use human resources data for the same purpose. With more mature systems a national organisation generally has oversight – e.g. in Brazil, Norway, Netherlands. There is integration with research fields and organisations etc. It’s a bit different in the UK. The issue was identified in 2006 as part of call for proposals for the JISC-funded repositories and preservation programme. Mimas and British Library proposed a two year project to investigate requirements and build a prototype system. This project, the Names project, can seem dry but actually it’s a complex problem. Everyone has stories of ambiguation.

The initial plan was to use the British Library Zetoc service to create author IDs – journal article information from 1993 but it’s too vast, too international. And it’s only last names and initials, no institutional affiliation. So we scrapped that. And luckily the JISC Merit project used 2008 Research Assessment Exercise data to pre-populate the Names database. It worked well except for twin brothers with the same initials both writing on paleantology and often co-authoring papers… in name authority circles we call this the “Siveter problem” (the brothers surnames). We do have both in the system now.

Merit data covers around 20% of active UK researchers. And we are working to enhance records and create new ones with information from other sources. Working with institutional repositories, british library data sets (Zetoc), Direct input from researachers. With current EPrints the RDF is easy to grab so we’ve used that with Huddersfield data and it works well. And we have a Submission form on the website now so people can submit themselves. Now, an example of why this matters… I read the separatedbyacommonlanguage blog and she was stressing about the fact that her name appears in many forms and the REF process. This is an example of why identifiers matter and why names are not enough. And how strongly people feel about it.

Quality really matters here. Automatic matching can only achieve so much – it’s dependent on data source. And some poeple have multiple affiliations. There is no size fits all solution hre. We have colleagues at the British Library who perform manual check of results of matching new data sources – allows for separation/merging of records – they did similar on ISNI. At the moment people can contribute a record but cannot update it. In the long term we plan to allow poeple to contribute their own information.

So our ultimate aim is to have a high quality set of unique identifiers for UK researchers and research institutions. Available to other systems – national and international (e.g. Names records exported to ISNI in 2011). Business model wise we have looked at possible additional services – such as disambiguation of existing data sets, identification of external researchers. About a quarter of those we asked would be interested in this possibility and paying for such added value services.

There is an API for the Names data that allows for flexible searching. There is an EPrints plugin – based on the API – which was released last year. It allows repository users to choose form a list of Names identifiers – and to create a Names record if none exists.

So, what’s happening with names now? We are hopefully funded until the end of 2012. Simeon mentioned the JISC convened Researcher ID group – final meeting will take place in September. That report went out for consultation in June, the report of the consultants went to JISC earlier this week. So these final aspects will lead to recommendations. And we have been asked to produce an Options Appraisakl Report for Uk national researcher identifier service in December. And we are looking at improving data and adding new records via repositories search.

So Names is kind of a hybrid of library/publisher approaches. Automatic matching/disambiguation; human quality checks; data immediately available for re-use in other systems; and authors can contribute and will be able to edit. When Names set up ORCID was two years away, ISNI hadn’t started yet. Things are moving fast. The main challenges here are cultural and political rather than technical. National author/researcher ID services can be important parts of research infrastructure. It’s vital to get agreement and co-ordination at national level here.


Q1) I should have asked Simeon this but you may have some appreciation here. How are recently deceased authors being handled? You have data since 1993 – how do you pick up deceased authors.

A1) No, I don’t think that we would go back to check that.

Q1) These people will not be in ID systems but retrospective materials will be in repositories so hard to disambiguate these.

A1) It is important. Colleagues on Archives Hub are interestied in disambiguation of long dead people. Right now we are focusing on active resaerchers.

A2 – Simeon) Just wanted to add that ORCID has a similar approach to deceased authors.

Q2 – Lisa from University of Queensland) We have 1300 authors registered with author id – how do you marry national and ORCID ID?

A2) We can accomodate all relevant identifiers as needed, in theory ORCID ID would be one of these.

Q3) How do you integrate this system by Web of Science and other commercial databases?

A3) We haven’t yet but we can hold other identifiers so could do that in theory but it’s still a prototype system.

Q4) Could you elaborate on national id services vs. global services?

A4) When we looked across the world there was a lot of variation. It would depend on each countries requirements. I feel a national service can be more responsive to the need of that community. So in the UK we have the HE statisticas agency who want to identify those in universities for instance, ORCID may not be right for that purpose say. I think there are various ways we could be more flexible or responsible as a national system vs ORCID with such a range of stakeholders.

Topic: Creating Citable Data Identifiers
Speaker(s): Ryan Scherle, Mark Diggory

First of all thank you for sticking around to hear about identifiers! I’m not sure even I’m that excited about identifiers! So instead lets talk about what happened to me on Saturday. I was far away… it was 35 degrees hotter… I was at a little house on the beach. The Mimosa House. It’s at 807 South Virginia Dare Trail. Kill Devil Hills, NC USA. 27898. It isn’t a well known town but it was the place where the first Orville bros. flight tests took place at [gives exact geocordinators]. But I had a problem. My transmission [part number] in my van [engine number] and opened the vent to  a new spider and a deadly spider crawled out [latin name]. I’m fine but it occured to me that we use some really strange combinations of identifiers. And a lot of these are very unusable for humans – those geocoordinates are not designed for humans to read out loud in a presentation [or livebloggers to grab!].

When you want data used and reused we need to make identifiers human friendly. Repositories use identifiers… EPrints can use a 6 digit number and URL, not too bad, In Fedora there isn’t an imposed scheme. In this one there is a short accession number but it’s not very prominent, you have to dig around a long URL. Not really designed for humans (I’ll confess I helped come up with this one so my bad too). DSpace does impose a structure. It’s fairly short and easy to cite. If you are used to repositories. But if you look at Nature – a source scientists understand. They use DOIs. When scientists see a DOI they know what this is and how to cite this. So why don’t repositories do this?

So I’m not going to get controversial. I am going to suggest some principles for citable identifiers, you won’t all agree!

1 ) Use DOIs – they are very familiar to scientists and others. Scientists dont understand handles, purls or info URI. They understand DOI. And using it adds weight to your citation – it looks important. And loads of services and tools are compatible with DOIs. Currently EPrints and DSpace don’t support them, Fedora only with a lot of work.

2) Keep identifiers simple – complex identifiers are fine for machines but bad for humans. Despite our best intentions humans sometimes need to work with identifiers manually. So keep as short and sweet as possible. So do repositories support that? Yes all three do but you need the right policies set up.

3) Use syntax to illustrate relationships – this is the controversial bit. But hints in identifiers can really help the user. A tiny bit of semantics to an identifier is increadibly useful. e.f. http://dx.doi.org/10.5061/dryad.123ab/3. A few slashes here help humans look at higher level objects. Useful for human hacks and useful for stats. You can aggregate stats for higher level stuff. Could break in the future, probably wont! Again EPrints and DSpace don’t enable this. Fedora only with work.

4) When “meaning-bearing” content changes, create a versioned identifier – scientists are pretty picky. Some parts objects have meaning, some don’t. For some objects you might have an excel file. Scientists want that file to be entirely unchanged – and only with new URL. Scientists want datat to be invariant to enable reuse by machines, even a single bit makes a difference. Watch out for implicit abstractions – e.g. thumbnails of different images etc. This kind of process seems intuitive but it kinda flies in face of DOI system and conventions. A DOI for an article it resolves to a landing page that could change every day and contain any number of items. Could be with a different publisher. What the scientist cares about is the article of text itself, webpage not so much of an issues.

Contrast that with…

5) When “meaningless” content changes, retain the current identifier – descriptive metadata must be editable without creating a new identifier. Humans rearely care about metadata changes, especially for citation purposes. Again repositories dont handle this stuff so well. EPrints supports flexible versioning/relationships. DSpace has no support. Fedora has implicit versioning of all data and metadata – useful but too granular!

So to build a repository with all of these features we had a lot of work to do. We had previously been using DSpace so we had some work to do here. What we did was add a new DSpace identifier service. It allows us to handle DOI, and to extend to new identifiers in the future. It allows us granular control of when a new DOI is registered and it lets us send these to citation services as required. So our DSpace identifier system uses EZCite at CDL and then also to DataCite. The DataCite content service lets you look up DOIs, they are linked data compliant – you can see relationships in the metadata. You can export metadata in various formats for textual or machine processing purposes. And we added some data into our citations information. When you load a page in Dryad there is a clear “here’s how to cite this item” note as we really want people to cite our material.

In terms of versioning we have put this under the control of the user and that means that when you push a button a new object is created and goes through all the same creation processes – just a copy of the original. So we can also connect back to related files on the service. And we thus have versioning on files. We plan to do more on versioning on the file and track changes on these. We need to think about tracking information in the background without using new identifiers in the foreground. We are contributing much of this back to DSpace but we want to make sure that the wider DSpace community finds this useful, it meets their requirements.

So, how well has it worked? Well it’s been OK. Lots of community change needed around citing data identifiers. Last year we looked at 186 articles associated with Dryad deposits – 77% had “good” citations to the data. 2% had “bad” citations to the data. And 21% had no data citations at all. We are owrking with the community to raise awareness about that last issue. Looking at articles a lot of people cite data in the text of the article, sometimes in supplementary materials at the end. And a bad citation – they called their identifier an “accession number”.

So, how many of you disagree with me here? [some, not tons of people] Great! come see me at dinner! But no matter whether you agree or not do think about identifiers and humans and how they use them. And finally we are hiring developer and user interface posts at the moment, come talk to me!


Q1 – Rob Sanderson, Los Alamos Public Laboratories) I agree with (4) and (5) but DOIs? I disagree! They are familiar but things can change on a DOI, that’s not what you want!

A1) I maybe over simplified. When you resolve a DOI you get to an HTML landing page. There is content – in our case data files. Those data files we guarantee to be static for a given DOI. We do offer an extension to our DOI – you can add /bitstream to get the static bits. But that page does change and restyle from time to time.

Q2 – Robin Rice, Edinburgh University Data Library) We are thinking about whether to switch from handles for DOI but you can’t have a second DOI for a different location… What do you do if you can’t mind a new DOI for something?

A2) You can promote the existing DOI. I question that you can’t have more than one DOI though, you can have a DOI for each instance for each object.

Q2) Earlier it seemed that the DOI issuing agency wouldn’t allow that

A2)  We haven’t come across that issue yet

A2 – audience) I think the DOI agency would allow your sort of use.

 July 11, 2012  Posted by at 2:30 pm LiveBlog, Updates Tagged with:

Sorry, the comment form is closed at this time.