Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Built to Scale?
Speaker(s): Edwin Shin

I’m going to talk about a project I recently worked on with a really high volume read vs. write. 250 million records – largest blacklight solr application. Only took a couple of days with Solr to index these but for reasonable query performance thresholds things get more complex. The records were staged in a relational database (Postgres). And around 1KB/record (bibliographic journal data). There are some great documented examples here that helped us. And we had a good environment – 3 servers each with 12 physical cores, 100GB RAM. We just moved all that data from Postgres – 80Gig compressed took a long time. The rate of ingest of the first 10K records if constant, suggested that all 250 million could be achieved in under a day but performance really slowed down after the first

Assign 32 GB of heap to JVM – we found RAM has more impact than CPU. Switch to Java 7. Adding documents in batches of 1000. We stopped forcing commits and only did this every 1 million documents. So in the end we indexed, to the level we wanted, 250 million documents in 2.5 days. We were pretty happy with that.

Querying – we were working with 5 facets (Format, Journal, Author, Year and Keywords) and 7 queryable fields. Worst Case was just under a minute. Too sow. So we optimised querying by running optimize after index. Added newSearcher and firstSearcher event handlers. But it was still slow. We started looking at sharing. 12 shards across 2 servers: 3 Tomcat instances per server, each Tomcat with 2 shards. This means splitting the index across machines and Solr is good at letting you do that and search all the shards at once. So our worst case query dropped from 77 seconds to 8 seconds.

But that’s till too slow. We noticed that filterCache wasn’t being used much, it needed to be bigger. Each shard had about 3 million unique keyword terms cached. We hadn’t changed default level at 512. We bumped it to about 40,000. We removed facets with large number of unique terms (e.g. keywords). The Worst case queries were now down to less than 2 seconds.

The general theme is that there was no one big thing we did or could do, it was about looking at the data we were dealing with and making the right measures for our set up.

We recently set up a Hydra installation, again with a huge volume of data to read. We needed to set up ingest/update queues with variable “worker” threads. It became clear that Fedora was the bottleneck for ingest. Fedora objects created programmatically rather than by FOXML documents – making it slower. The latter would have been fast but would have caused problems down the road, less flexibility etc. Solr performed well and wasn’t a bottleneck. But we got errors and data corruption in Fedora when we had 12-15 concurrent worker threads. What was pretty troublesome was that we could semi replicate this in staging for ingesting. But we couldn’t get a test case and never could get to the bottom of this. So we worked around it… and we decided to “shard” a standalone Fedora repository. It’s not natively supported so you have to do it separately. Sharding is handled by ActiveFedora using a simple hashing algorithm to shard things. We started with just 2 shards and use an algorithm much as Fedora uses internally for distributing files. We get on average a pretty even distribution across the Fedora repositories. This more or less doubled ingest performance without any negative impact.

project at the end of last year. A project with 20 million digital objects. 10-39 read transactions per second 24/7. High availability required, no downtime for reads. no more than 24 hours downtime for writes. Very challenging set up.

So, the traditional approach for high up time is the Fedora Journaling Module that allows you to ingest once to many “follower” installations. Journaling is proven, it’s a simple and straightforward design. Every follower is a full redundant node. But that’s also a weakness. Every follower is a full, redundant node – huge amounts of data and computationally expensive processes that happen on EVERY node, that’s expensive in terms of time, storage, traffic. And this approach assumes a Fedora-centric architecture. If you have a complex set up with other components this is more problematic still.

So we modeled the journaling and looked at what else we could do. So we set up the ingest that was replicated but then fed out to a Fedora shared file system and was fed into nodes but not doing FULL journaling.

But backups, upgrade and disaster recovery. But with 20 million digital objects. The classic argument for Fedora is that you can always rebuild. In a disaster that could take months here, though. But we found that most users used new materials – items from the last year – so we did some working around to make that disaster recovery process faster.

Overall the general moral of the story is you can only make these types of improvements if you really know

Q1) What was the garbage collector you mentioned?

A1) G1 Garbage collector that comes with Java 7

Q2) Have you played with the chaos monkey idea? Netflix copies to all its servers and it randomly stops machines to train the programming team to deal with that issue. It’s a neat idea.

A2) I haven’t played with it yet, I’ve yet to meet a client who would let me play with that but it is a neat idea.

Topic: Inter-repository Linking of Research Objects with Webtracks
Speaker(s): Shirley Ying Crompton, Brian Matthews, Cameron Neylon, Simon Coles

Shirley from STFC – Science and Technology Facilities Council. We run large facilities for researchers. We run manage a huge amount of data every year and my group runs the e-Infrastructure for these facilities – including the ICAT Data Catalogues, E-publications archive and Petabyte Data Store. We also contribute to data management, data preservation etc.

Webtracks is a joint programme between STFC and University of Southampton. This is a Web-scale link TRACKing for research data and publications. Science on the web increasingly involves the use of diverse data sources and services plus objects. And this ranges from raw data from experiments through to contextual information, lab books, derived data, research outputs such as publications, protein models etc. When data moves from research facility to home institution to web materials areas we lose that whole picture of the research process.

Linked data allows us to connect up all of these diverse areas. If we allow repositories to communication then we can capture the relationship between research resources in context. It will allow different types f resources to be linked within a discipline – linking a formal publication to on-line blog posts and commentary. Annotation can be added to facilitate intelligence linking. It allows researchers to annotate their own work with materials outside their own institution.

Related protocols here: Trackback (tracking distributed blog conversations, with fixed semantics), Semantic Pingback (RPC protocol using P2P).

In webtracks we took a two pronged approach: inter-repository communications pool and a Reslet Framework. The InteRCom protocol that allows repositories to connect and describe their relationship (cito: isCitedby). InteRCom is a two stage protocol like Trackback, first harvesting of resources and metadata. Then pinging process to post the link request. The architecture is based on the Restlet Framework (with a data layer access, app-spec config (security, encoding, tunneling), resource wrapper. This has to connect to many different institutional policies – whitelisting, pingback (and checking if this is a genuine request), etc. Lastly you have to implement the resource cloud to expose the appropriate links.

Webtracks uses a Resource Info Model. A repository connected to a resource, to a link and each link has subject, predicate and object. The link can be updated and tracked automatically using HTTP. We have two exemplars being used with WebTracks the ICAT investigation resource – DOI landing page, and HTML rep with RDFa – so a machine and human readable version. The other exemplar is EPubs set up much like ICAT.

InterCom Citation Linking – we can see on the ICAT DOI landing page linking to Epubs expression links pae. That ICAT DOI also links to ICAT Investigation links page and in turn that links to Epubs expression page. And that expression page feeds back into the Epubs Expression links page.

Using the Smart Research Framework we have integrated services to automate prescriptive research workflow – that attempt to preemptively catch all of the elements that make up the research project, including policy information, to allow the researcher to concentrate on their core work. That process will be triggered at STFC and will capture citation links in the process.

To summarise, Webtracks provides a simple but effective mechanism to facilitate propagation of citation links to provide a linked web of data. It links diverse types of digital research objects. To restore context to dispersed digital research outputs. No constraints on link semantics and metadata. It’s P2P, does not rely on centralised service. And it’s a highly flexible approach.

Topic: ResourceSync: Web-based Resource Synchronization
Speaker(s): Simeon Warner, Todd Carpenter, Bernhard Haslhofer, Martin Klein, Nettie Legace, Carl Lagoze, Peter Murray, Michael L. Nelson, Robert Sanderson, Herbert Van de Sompel

Simeon is going to be talk about resource synchronization. We are a big team here and have funding from the Sloan Foundation and from JISC. I’m going to be talk about discussions we’ve been having. We have been working on the ResourceSync project, looking at replication of web material… it sounds simple but…

So… synchronization of what? Well web resources – things with a URI that can be dereferenced and are cache-able. Hidden in that is something about support for different representations, for content negotiations. No dependency on underlying OS, technologies etc. For small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources). We want this to be properly scalable to large resources or large collections of resources. And then there is the factor of change – is it a slow change (weeks/month) for an institutional repository maybe, or very quickly (seconds) – like a set of linked data URIs, and where there needs to be latency there. And we want this to work on/via/native to the web.

Why do this? Well because lots of projects are doing  synchronization but do so case by case. The project teams are involved in these projects. Lots of us have experience with OAI-PMH, it’s widely used in repository but XML metadata only and web technologies have moved on hugely since 1999. But there are loads of use cases here with very different needs. We had lots of discussion and decided that some use cases were not but some were in scope. That out of scope for now list is: bidirectional syncronisation; destination-defined selective synchronization (query); special understanding of complex objects; bulk URI migration; Diffs (hooks?) – we understand this will be important for large objects but there is no way to do this without needing to know media types; intra-operation event tracking; content tracking.

So a use case: DBPedia Live duplication. 20 million entries updated once per second. We need push technology, we can be polling this all the time.

Another use case: arXiv mirroring  1 million article versions. about 800 created each day and updated at 8pm US eastern time. metadata and full text for each article. Accuracy very important. want low barrier for others to use. Works but currently use rsync and that’s specific to one authentication regime.

Terminology here:

  • Resource – inject to be synchronizes, a web resource
  • Source – system with the original or master resources
  • Destination – where synchronised to
  • Pull
  • Push
  • Metadata – information about resources such as URI, modification time, checksum etc. Not to be confused with metadata that ARE resources.

We believe there are 3 basic needs to meet for syncronisation. (1) baseline synchronisation – a destination must be able to perform an initial load or catch-up with a source (to avoid out of band setup, provide discovery). (2) Incremental synchronization – destination must have some way to keep up-to-date with changes at a source (subject to some latency; minimal; create/update/delete). (3) Audit – it should be possible to determine whether a destination is synchronised with a source (subject to some latency; want efficiency –> HTTP HEAD).

So two approaches here. We can get an inventory of resources then copy one by one via HTTP GET. Or we can get a dump of the data and extract metadata.  For auditing we could do new Baseline synchronization and compare but likely to be very inefficient. Can optimize by adding getting an inventory and compare copy with destination – using timestamp, digest etc. smartly, a latency issue here again to consider.  And then we can think about Incremental Synchronisation. The simplest method would be audit then copy all new/updated resources plus removal of deleted. Optimize this by changing communication – exchange ChangeSet listing only updates; Resource Transfer – exchange dumps for ChangeSets of even diffs; and Change Memory.

We decided for simplicity to use Pull but some applications may need Push. And we wanted to think about the simplest idea of all: SiteMaps as an inventory. So we have a framework based on Sitemaps. On level 0, the base level. Publish a sitemap and someone can grab all of your resources. A simple feed of URL and last modification date lets us track changes. Sitemap format was designed to allow extension. It’s deliberately simple and extensible. There is an issue about size. The structure is for a list of resources that handles up to 2.5 billion resources before further extension required. Should we try to make this looks like RDF we expect? We think no but map Sitemap RDF to RDA.

At the next level we look at a ChangeSet. This time we reuse Sitemap format but include information only for change events over a certain period. To get a sense of how this looks we tried this with ArXiv. Baseline synchronisation and Audit: 2.3 million resources (300GB); 46 sitemaps and 1 sitemapindex (50k resources/sitemap).

But what if I want Push application that will be quicker? We are trying out XMPP (as used by Twitter etc.) and lots of experience and libraries to work with for this standard. So this model is about rapid notification of change events via XMPP Push. They trialed his at LiveDBpedia. LANL Research Library ran a significant scale experiment of LiveDBPedia database from Los Alamos to two remote sites using XMPP to push notifications.

One thing we haven’t got to is dumps. Two thought so far… Zip file with Sitemap – simple and widely used format  – but custom solution. The other possibility is WARC – Web ARCiving format. Designed for just this purpose but not widely used. We may end up doing both.

Real soon now a rather extended and concrete version of what I’ve said will be made available. First draft of sitemap-based spec is coming July 2012. We will then publicize and want your feedback, revision and experiments etc in September 2012. And hopefully we will have a final specification in August.


Q1) Wouldn’t you need to make a huge index file for a site like ArXiv?

A1) Depends on what you do. I have a program to index ArXiv on my own machine and it takes an hour but it’s a simplified process. I tested the “dumb” way. I’d do it differently on the server. But ArXiv is in a Fedora repository so you already have that list of metadata to find changes.

Q2) I was wondering as you were going over the SiteMap XML… have you considered what to do for multiple representations of the same thing?

A2) It gets really complex. We concluded that multiple representations with same URI is out of scope really.

Q3) Can I make a comment – we will be soon publishing use cases probably on a Wiki and that will probably be on GitHub and I would ask people to look at that and give us feedback.

 July 11, 2012  Posted by at 8:26 am LiveBlog, Updates Tagged with:  Comments Off on P2A: Repository Services LiveBlog
Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Augmenting open repositories with a social functions ontology
Speaker(s): Jakub Jurkiewicz, Wojtek Sylwestrzak

The project began with ontologies, motivated by SYNAT project. The project requires specific social functions, and particular ontologies are used to analyze them as completely as possible.

This particular project started in 2001, aiming to create a platform for integrated digital libraries from a custom platform hosting the Virtual Library of Science. Bimetal was used so that different versions of metadata schema could be put into place.

The Virtual Library has 9.2 million articles, mostly full text from journals, but also traditional library content. That traditional content creates problems with search because it is not all digitized.

SYNAT brings together 16 leading Polish research institutions, and the platform aims to manage all of this data in a way that users can interact with well – all ultimately using BWMeta 2.

In Poland an open mandate initiative requires the project to have the capacity to host open licensed data, and allow authors to publish their works (papers, data, etc). Support for restricted access content is also included, with a ‘moving wall’ for embargoed works – content is stored in the repository, and it will switch from closed to open access after a pre-decided time.

Social functions of SYNAT…

Users can discuss the resources, organize amongst themselves into smaller groups, share reading lists, follow the activities of other users or organizations (published content, comments, conferences, etc). This is all part of the original project aims and goals.

Analysis of social functions was based upon some prior work for efficiency. Biblographic elements would use Dublin core and BIBO. Friend of a friend analysis as well. All of the different objects (users, metadata fields, and so on) have been mapped to particular ontologies.

People on the platform can be users, authors. The assessment makes particular note of the fact that persons and users are different – there are likely more users than people involved with the platform. Also, each can be connected to specific works and people or not depending on user preference. People will have published objects as well as unofficial posts (forum thread, comment). Published objects can be related to events based on whether they were published in or because of said event.

So, objects include user profiles, published objects, forum activity and posts, groups, events. These are all related to one another using predicates (of, by, with). This model then satisfies the requirements of the project aims and goals.

It is important to point out how previous work was reused from existing ontologies. It simplifies the analysis process and makes it more precise because of reiteration. There is also easy RFD export from the final system for comparability, though not for storing in the database.

In the future, implementation of these social analysis functions will be done.

Q: Has ontology work been shared with the content suppliers for whatever purpose? Do they think it will add value?

A: They aren’t disinterested, but it isn’t something they are interested in for themselves. They are glad it is offered as part of the service.


Topic: Microblogging Macrochallenges for Repositories
Speaker(s): Leslie Carr, Adam Field

This all comes about from work into the London riots in sociological work. He’d done analysis with interviews and, most importantly, videos posted on Twitter via YouTube by passersby. People took these videos offline quite quickly out of fear of retribution – this meant going back to gather data was difficult.

Les is running a web science and social media course now, with a lot of emphasis on Twitter. It provides a good understanding of group feeling, given the constraints of Twitter. Why not extend repositories to make Twitter useful in that area?

The team built a harvester, which connects to the Twitter search API for now. No need for authentication, no ‘real’ API per se, but it works alright. You can only go back 1500 tweets per search, but that has been enough. The Search API is hit every 20 minutes. This was to be preserved for the sake of the system itself, but other people came out to share their own harvested tweets. There are coding benefits and persistent resource benefits.

Tweets would be a document living under EPrint, those documents themselves XML files rendered into HTML. Unfortunately, doing this did not scale well. Thousands of XML and HTML files under one EPrint. When that system checked file sizes, it would take 30 minutes to render the information – which might as well be broken.

Tweets are quite structured data beyond just the text inside. Stored separately, the other fields make a very rich database. Treating tweets as first class objects in relation to a TweetStream makes them even more valuable.

Live demo on screen of EPrints analysing OR2012 tweets. Who’s been talking and how many things have they said. Which hashtags, which people are discussed? What links are shared? What frequency of tweets per day? All exportable as JSON, CSV, HTML.

There are limitations of the repository – EPrints is designed for publications, not millions of little objects.

Problems with harvesting. URLs are shortened with wrappers, now t.co. The system has trouble resolving all of these redirects, but where a link ends up is enormously important. Following a link, on average, takes 1 second. That’s a huge number with so much content. MySQL processing has also created some limitations, but those have been largely worked around – this took a great deal of optimization with complex understanding of the backend. A third problem was the Twilight problem: popular topics will spike to over 1500 tweets per harvest, and so a lot is missed. This could be overcome with the streaming API, but there are real time issues with using that. Quite complex.

The future. Dealing with URL wrappers. Dealing with the unboundedness of the data – there is so much that optimizations will not be able to keep up. A new strategy for the magnitude problem has to be puzzled out. Potential archival of other content: YouTube videos and comments, Google results over time.

This Twitter harvester is available online for EPrints – lightweight harvesting for nontechnical people.

There are large scale programs for this already, some people need smaller and more accessible tools (masters and doctoral students).

Q: Why are you still using EPrints? It seems like there are a lot of hacks, and you would have been better off using a small application directly over MySQL.

A: EPrints is a mature preservation platform. Easy processing now is not the best thing for the long term. Repositories are supposed to do that, so challenges should be met to overcome that.


Topic: Beyond Bibliographic Metadata: Augmenting the HKU IR
Speaker(s): David Palmer

At universities in Hong Kong, more knowledge exchange was desired to enable more discovery. Then, theoretically, innovation and education indicators would improve. The office of knowledge exchange chose the institutional repository, built on DSpace and developed with CILEA. The common comment on this work after it was first implemented was that part of the picture was missing.

Getting past thin and dirty metadata was a goal, along with augmenting metadata in general: profiles, patents, projects.

Publication data is pushed from HKUs central research database, the Research Output System, filled by authors or assistants. Needs much better metadata. Now trying to get users to work with Endnote, DOI, ISBN so that cleaner metadata comes in.

Bibliographic rectification via merges or split of author profiles with a user API and robots. This has worked quite well.

Search and scrape of the database start with numbers (DOI, ISBN, etc), then search for strings (title, author). Each entry pulls citations, impact factors if available.

Lots of people involved in making this all work, and work well.

Author profiles include achievements, grants, cited as (for alternate naming) and external metrics via web scrape. Also includes prizes (making architects happy) and supervision with titles with a link (making educators happy).

Previously, theses and dissertations (which are very popular) were stored in three separate silos. Now they all integrate with this system for better interactivity of content for tracking, jumping between items.

Grants and projects are tracked and displayed, too. This shows what is available to be applied for, or what has been done already – publications resulting from. Patent records included, with histories and appropriate sharing of information based on application status, publication and granting. Links to published, granted patents and abstracts in whichever countries they exist.

With all of this data, other things can be shown: articles with fastest rate of receiving citation, most citied, who publishes the most. Internal metrics show off locations of site views, views over time, and more. Visualizations are improving, so users can see charts (webs of coauthors for an author, for example) and graphs and things. The data is all in one place from other silos, which is great because on-the-fly charts would be otherwise impossible.

Has all of this increased visibility? Anecdotally, absolutely. People’s reputations are improving. Metrics show improvement as well. The hub is stickier – more pages per visit, more time per page, because everything is hyperlinked.

This work, done with CILEA, is going to be given back to the community in DSpace CRIS modules. Everything. For the sake of knowledge exchange. Mutual benefits will result, in terms of interoperability and co-development.

Q: Is there an API to access data your system has ground out of other sources?

A: A web service is in the works. The office of dentistry is scraping this data manually already, so it’s doable

 July 11, 2012  Posted by at 7:55 am LiveBlog, Updates Tagged with:  Comments Off on P2B: Augmented Content LiveBlog
Jul 102012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.


Topic: The Development of a Socio-technical infrastructure to support Open Access Publishing though Institutional Repositories
Speaker(s): Andrew David Dorward, Peter Burnhill, Terry Sloan


Trying to create an infrastructure before the open access revolution happens. Sooner than later, it seems, so the team is trying to create a template for the UK and Europe, RepNet.

RepNet aims to manage the human interaction that helps make good data happen. This is an attempt to justify the investment that JISC has made into open access and dissemination.

RepNet will use a suite of services that enable cost effective repositories to share what they have.

First, they mapped the funders, researchers, publishers, institutions to see where publications are made.

RepNet hopes to sit between open access and research information management by differentiating between various types of open access, between standards,

Through conversations with all the stakeholders, they’ve put together a catalog of every service and component that would go into a suite for running such a repository.

Funders’, subject, and institutional repositories will all sit upon the RepNet infrastructure. This will offer service support, helpdesk and technical support, and a service directory catalogue for anyone hoping to switch to open access. All of this will then utilize various innovations, hosting, services to get to users.

RepNet also has a testing infrastructure.

RepNet is past the preparation stages now, and moving into implementation of a wave one offering that integrates everything. The next iteration will take what wave one teaches the team and improve the offering further.

Deposit tools, benchmarking, aggregation and registry are already available, and wave two will bring together more and bigger services to do these things with repositories.

The component catalogue is getting quite comprehensive, with JISC helping to bring in and assess new ideas all the time.

RepNet is being based on the information context of today – policy and mandates, plus the strong desire for open access.

The UK is a great country to be in for Open Access, there’s quite a bit of political support in favor of moving in this direction.

If the market is to be truly transparent, gold open access payment mechanisms will have to be handled. This is something new that RepNet is working on figuring out.

The focus now is on optimizing wave one components, a very comprehensive set of tools and funder-publisher policies working with deposit and analysis tools to make everything easily accessible. REPUK, CORE, IRS, OPEN DOAR, ROAR, NAMES2 are all components being looked at, which wave two will.

ITIL is being used as the language for turning strategies and ideas into projects.

There is also a sustainability plan, submitted by SIPG members: subscriptions, contributions, payment for commercial use. Further JISC underpinning is being considered as well.

Part of RepNet will be a constant assessment of services: when one needs to be retired, it will move back into the innovation zone and included again when there’s a demand.

RepNet provides an excellent service to support green and to further investigate gold open access. It will give us a great way of assessing repositories, better integration, and less human-intensive management of repositories.

The aim now is to move to data-driven infrastructure, letting different projects speak to each other through reporting mechanisms. This will make it more integrated and, ultimately, more useful.

Wave two will focus on micro services.

The sustainability plan will hopefully be put in place before 2013

Q: Academics can see how all this works. Are there plans for making these sorts of information and services available to the public?

A: It’s all about integration with common search tools. There’s a vast gap between what has surfaced because of professional search tools, and what something like Google finds via its own crawler. It’s also important to make deposit accessible to everyone else, of at least thinking beyond the academic lockdown instead of just focusing on the expert community


Topic: Repository communities in OpenAIRE: Experiences in building up an Open Access Infrastructure for European research
Speaker(s): Najla Rettberg, Birgit Schmidt

OpenAIRE is rooted in the interests of the European Commission to make an impact on the community. Knowledge is the currency of the research community, and open access needs to be its infrastructure.

The hope is that something stronger that the green mandate comes about as the European Commission talks more about this.

OpenAIRE infrastructure aims to use publication and open data infrastructure to release research data. It involves 27 EU countries and 40 partners to pilot open access and measure impacts.

OpenDOAR, usage data, and EC funding have all fed the growth of the project. The result is a way to ingest publications, to search and browse them, to see statistics linked to the content and assess impact metrics.

Three parts, technical, networking, service. Networking brings together all the partners, stakeholders, open access champions. They run a helpdesk, build the network by finding new users, researchers, publications. This community of practice is very diverse. With everyone together there is an opportunity to link activities and find ways to improve the case for OA.

OpenAIRE provides access to national OA experts, researchers and project coordinators, managers.

Research administrators can consider a few things in their workflows to follow the open mandate and OpenAIRE shows them statistics about their open access data as they do so.

Everyone is invited to participate by registering to OpenDOAR and following OpenAIRE guidelines. OpenAIRE offers a toolkit to project officers to get going. As of now there are about 10000 open access publications in the repository from 5000.

Part 2: OpenAIRE Phase 2

OpenAIRE phase 2 will link to other publications and funding outside of FP7, shifting from pilot to service for users.

300 OA publication repositories are being added, along with new data repositories and an orphan data repository. Don’t forget CRIS, OpenDOAR, ResearchID. All of this will go into the information space, using text mining to clean things up and make it all searchable

Now there are OpenAIRE guidelines being built for data providers. These look at how to connect metadata to research data, and how to export it for use externally. It isn’t so much prescriptive as exploratory. With these in hand, other countries and organizations with less developed OA might be able to improve their own data offerings.

The scope of OpenAIREs work is wide. Most fields and types of data welcome, so keep in touch.

OpenAIRE is building a prototype for ‘enhanced’ publications, letting users play with the data within. This will be cross-discipline, and can be exported to other data infrastructures. Also working on ways to represent enhanced publication more visually and accessibly.

What connects data to the publication? OpenAIRE is on the boundary exploring that question.

The repository landscape is very diverse, but so are the tools for bringing data and repositories together. OpenAIRE is aware of data initiatives, stakeholder AND researcher interests. OpenAIRE is running some workshops and will be at the poster sessions throughout the conference.

Q: CORE has done a lot of text mining work already. Have you spoke to them?

A: There has been discussion with CORE about repositories, but not text mining. OpenAIRE is working with several groups.

Q: You want to develop text mining services. In this area, with linking repos and content, CORE offers an API for finding those links and for reclassification. You aim to develop these services by 2013, are you aiming to use other tools to do this, and are you happy to do so?

A: The technical folks are here for that very purpose, so keep an eye out for the OpenAIRE engineers.



Topic: Enhancing repositories and their value: RCAAP repository services
Speaker(s): Clara Parente Boavida, Eloy Rodrigues, José Carvalho

RCAAP comes from a Portuguese national initiative perspective. It is a national initiative to promote open access for the sake of visibility, accessibility, dissemination of Portuguese work.

The project started in 2008 as a repository hosting service, moving forward to validation and Brazilian cooperation, then a statistics tool.

Overseen by the FCCN.

Learn more at projecto.rcaap.pt if you are a repo manager or journal publisher.

The strategy for SARI, a part of RCAAP, is creating a custom Dspace (Dspace++) for Portuguese academic users. It offers hosting, support, design services, and autonomous administration, all for free. 26 repos use this service, and because of the level of customisability offered, SARI repos all have their own unique look and feel.

Another service in RCAAP is Common Repository, for people who do not produce a lot of content but want it to be openly accessible in a shared area. 13 institutions are using this slimmed down repository tool.

RCAAP search portal enables users to search all open repositories in Portugal, and participating organizations in Brazil. 447934 documents from 42 resources, updated daily. Users can search by source, category, keywords.

Aggregated information is all OAI-PMH compliant. Further, it is an SRU provider. PDF and DOC files are full text searchable. Integration via various other tools.

A given entry can be shared, imported to reference managers. Author CVs and all related metadata are connected.

RCAAP Validator users Driver Guidelines to assess a URL based on validation options. A report will then be emailed, including statistics for openness and errors when checked against Driver Guidelines. Errors are described. Assessment checks aggregation rules, queue validation, XML parsing and definitions check, and also confirms that files in the repository all actually exist. Three types of metadata validation are done: existence of each element (title, author, date, etc), taxonomies (iso and driver types), and structure of metadata content (driver prefixes, proper date formatting). Checking all of this ensures a good search experience.

Dspace add-ons are enabling included repositories and infrastructure to ensure openness and compliance with standards.

Add-ons include OAIextended, Minho Stats, Request Copy (for restricted access content), sharing bar, Degois (for Portuguese researchers), Usage Statistics, OpenAIRE projects Authority Control (auto-documentation of workflow activities), Document Type, and Portuguese Help.

Another service is SCEUR-IR. It aggregates usage statistics and allows the creation and subscription of graphic information.

The Journal Hosting Service uses the same strategy as the repository hosting service mentioned earlier, and allows total autonomy to publishers.

In Portugal and in Brazil, open access is very successful. The factors for that success and interoperability guidelines, community help and advocacy, and integration with research systems.

Q: What sort of numbers does a medium sized university in Portugal see for downloads, use?

A: Thousands of downloads/hits per day. Bigger universities will see 5000 or more.

Q: JISC is mandating metadata profiles and analyzing with Driver guidelines, and is looking to make a validator tool. Is the validator tool open source?

A: No, but information exchange is always welcome.

Q: Further on the validator, you are actually checking every record for every file – whether accessible or missing?

A: That is an option. When checking, the user would set which repository format they are using, and then run a check against that profile. Once an attempt to download has been made, after 2sec of attempted download, the entry will be marked available. Failure to start the download would be marked as an error.


Topic: Shared and not shared: Providing repository services on a national level
Speaker(s): Jyrki Ilva

The national library of Finland, an independent institute within the University of Helsinki, provides many services to the library network in Finland.

While there are 48 organizations with institutional repositories, only 10 public instances exist. Most of these repositories use the national library service.

The National Library provides its services to about 75% of Finnish repositories. They are not the only centralized service provider, but they are in a minority. The ‘do it yourself’ mentality has taken root, but with instancing of repositories anyone can and should have one. The same work does not need to be done over and over again in every organization.

‘Do it yourself’ does not always make sense. It is more expensive and often not as well executed as using other services. In Finland, there is much more sharing going on than in other countries.

Many countries started OA repositories in the 90s and 00s. The National Library started the idea of the digital object management system in 2003 – the first attempt at a proprietary software platform, which did not work as planned. The National Library chose DSpace instead, starting in 2006.

One of the challenges was trying to make one giant DSpace instance for all organizations. Not enough instance, and so the idea was not sustainable. Local managers were concerned with the threat of requirements and demands in the national repository system.

Fortunately, the National Library was chosen to be the service provider at the Rectors’ Conference for Finnish Universities of Applied Sciences in 2007.

Work is divided between customer organizations and the National Library. Curation and publication done locally. National Library develops and maintains the technical system.

Theseus, a multi-institutional repository, has seen much success. 25 universities, tens of thousands of entries.

Doria, another multi-institute platform, is technology neutral and allows more autonomy of its participant communities. This freedom allows for customizability, but less quality metadata and increased confusion amongst users.

Separate repository instances are also provided at extra cost to customers. Some organizations just prefer their own instance. TamPub and Julkari are two examples.

Selling repository services comes down to defining strong practical needs amongst customer organizations. Little marketing has been done – customers have a demand and they find a supply. Long-term access, persistent addresses for content have been selling points. While not trying to make a profit, covering costs with a coherent pricing scheme is necessary. That said, many customers have relatively small needs, and so services must be kept affordable when necessary. National Library is also considering consultation as a service.

Negotiating user contracts is time consuming, though, and balancing customer projects requires a constant assessment of development in infrastructure in general.

Some Finnish universities will continue to host their own repositories, but cooperation benefits everything: technical and policy development can be improved.

Measuring success can be done on various levels. Existence of the repository is a start. Download metrics are another option. Impact assessment should be done. Looking at measures for success, National Library is still struggling with research data and self-archived publications. They’ve had success with dissertations, heritage materials, journals, but there is always room to improve.

Q: What is the relationship between the national library and non-associated repositories? Is meta-data being recorded? Metrics?

A: In most cases, organizations are reporting to an umbrella body, which keeps track of everything.

Q: Any details on the coherent pricing system?

A: Pricing has so far been based on the size of a given organization, how much data and how many hits they will have.

Q: Are you very service oriented, and is this something that the National Library does in general, or are repositories a special case?

A: Partly a special case, because funding is not guaranteed. It wasn’t so much intended as demanded for the sake of sustainability.


Topic: The World Bank Open Knowledge Repository:
 Open Access with global development impact
Speaker(s): Lieven Droogmans, Tom Breineder, Matthew Howells, Carlos Rossel

Publishers are not usually the pushers of open access, but the World Bank isn’t like many other organizations.

Why do this? The World Bank is trying to reach out and inform people of what the organization actually does, what its mission is: relieving poverty.

World Bank funds projects in the developing world, whether pragmatic solutions or research for the sake of outcomes.

When World Bank changes direction, it goes slow but it really changes. The Access to Information Policy is wide and distinct within the organization. In particular, there is a focus on open data, research, and knowledge. This launched in 2010 with the object of ensuring as much bank data was as accessible as possible for the sake of transparency and further reach.

The Open Access Policy has been adopted, as of July 1st. It is an internal mandate for staff to deposit all research into the repository. This also applies to external research funded by the bank. This data is on OKR and Creative Commons attributed (CC BY). Externally published documents are made available as soon as possible, with a more restrictive creative commons attribution.

World Bank wants to join the Open Access community and lead other IGOs to do the same.

There are benefits externally and internally. Externally, policymakers and researchers gain data. Internally, authors and inter-department staff can access information that they did not easily have before.

World Bank was, until a few years ago, a very traditional publisher, but the desire that the Bank has to free its information for reuse has caused a shift.

That’s the why, and here’s the how.

Content included? Books, working papers, journal articles internally and externally. In the future, the Bank aims to include author profiles with incentivized submission, and the recovery of lost or orphaned or unshared legacy materials. Making the submission process easy and showing usage stats are also forthcoming. The latter will make entries more useful and visually appealing. Finally, there will be integration with Bank systems for interoperability of data.

This is all being deployed on IBM Websphere, integrated with Documentum, Siteminder, and the Bank’s website.

Along with the Open Development Agenda, the Bank is also contributing open source code, including facet search code.

Facet search allows results to be refined by metadata category (author, topic, date, etc). Each facet will show result counts.

World Bank uses a taxonomy of items that includes the full map of category hierarchy. To get the facet search filter count right, World Bank checks entries against each other when showing results and return the proper number of items related to a given filter selection.

Now the problem is an attempt to resolve the need for drill-down of search infrastructure, showing multi-indexed browse of an entire hierarchy of a subject.


Q: Is it possible to showcase developing country research?

A: Yes. World Bank is looking for content in Africa right now, and attempting to gain access. This outreach is coming along slowly but surely.

 July 10, 2012  Posted by at 5:02 pm LiveBlog, Updates Tagged with:  Comments Off on P1B: Shared Repository Services and Infrastructure LiveBlog
Jul 102012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Moving from a scientific data collection system to an open data repository
Speaker(s): Michael David Wilson, Tom Griffin, Brian Matthews, Alistair Mills, Sri Nagella, Arif Shaon, Erica Yang

I am here presenting on behalf of myself and my colleagues from the Science and Technology Facilities Council. We run facilities ranging from CERN Large Hadron Collider to the Rutherfod Appleton Laboratory. I will be talking about the ISIS Facility, which is based at Rutherford. People put in their scientific sample and that crystal goes into the facility and then it may examine that crystal for anything from maybe an hour to a few days. The facility produces 2 to 120 files per experiment in several formats including NeXus, RAW (no, not that one, a Rutherford Appleton format). In 2009 we had run 834 experiments, 0.5 million files, 0.5Tb of data. But that’s just one facility. We have petabytes of data across our facilities.

We want to maximise the value of STFC data, as Cameron indicated in his talk earlier it’s about showing the value to the taxpayer.

  1. Researchers want to access their own data
  2. Other researchers validate published results
  3. Meta-studies incorporating data – reuse or new subsets of data can expand the use of the original intent for data
  4. Set experimental parameters and test new computational models/theories
  5. User for new science not yet considered – we have satellites but the oldest climate data we have is on river depth, collected 6 times a day. Its 17th century data but it has huge 21st century climate usefulness. Science can involve uses of data that is radically different than original envisioned
  6. Defend patents on innovations derived from science – biological data, drug related data etc. is relevant here.
  7. Evidence based policy making – we know they want this data but what the impact of that is maybe arguable.

That one at the top of the list (1) is the one we started with when we began collecting data. We started collecting about 1984. The Web came along about 1994 1995 and by 1998 researchers could access their own data on the web – they could find the data set they had produced using an experiment number. It wasn’t useful for others but it was useful for them. And the infrastructure reflected this. It was very simple. We have instrument PCs as the data acquisition system, there was a distributed file system and server, delivery and the user.


Moving to reason (2) we want people to validate the published results. We have the raw data from the experiment. We have calibrated data – that’s the basis for any form of scientific analysis. That data is owned by the facility and preserved by the facility. But the researchers do the data analysis at their own institution. The publisher may eventually share some derived data. We want to hold all of that data, the original data, the calibration data, and the derived data. So when do we publish data? We have less than 1% commercial data so that’s not an issue. But we have data policies (different science, difference facilities, different policy) around PhD period largely so we have a 3 year data embargo. It’s generally accepted by most of our users now but a few years ago were not happy with that. We do keep a record of who accesses data. And we embargo metadata as well as data as if it’s known, say, that a drug company supports a particular research group or university a competitor may start copying the line of inquiry even on the basis of the metadata… don’t think this is just about corporates though… In 2004 a research group in California arranged a meeting about a possible new planet, some researchers in Spain looked at the data they’d been using and reasoning that that research team had found a planet announced that THEY had found a planet. It’s not just big corporations; academics are really competitive!


But when we make the data available we make it easy to discover that data and reward it. For any data published we create a Data DOI that enables Google to find the page but also in the UK HEFCE have said that the open access research dataset use will be allowed in new REF. And data will also be going into the citation index that is used in the assessment of research centres.


So on our diagram of the infrastructure we now have metadata and Data DOI added.


Onto (3) and (4). In our data we include schedule and proposal – who, funder, what etc. that goes with that data. Except about 5% don’t do what they proposed so mostly that job is easily done but sometimes it can be problematic. We add publications data and analysis data – we can do this as we are providing the funding, facility and tools they are using. The data can be searched via Datacity. Our in-house TopCat system allows in-house browsing as well. And we’ve added new elements to infrastructure here.


Looking at (5), (6) and (7) new science, patents, policy. We are trying to find socio-economic impact into the process. We have adopted a commercial product called Tesella Safety Depositr Box with Fixity checks. We have a data format migration. And we have our own long term storage as well.


So that infrastructure looks more complex still. But this is working. We are meeting our preservation objectives. We are meeting the timescale of objectives (short, medium, long). Designated communities, additional information, security requirements are met. We can structure a business case using these arguments.



Q1) Being a repository major I was interested to hear that over the last few years 80% of researchers had gone from unhappy at sharing data to most now being happy. What made the difference?

A1) The driver was the funding implications of data citations. The barrier was distrust in others using or misinterpreting their data but our data policies helped to ameliorate that.

Topic: Postgraduate Research Data: a New Type of Challenge for Repositories?
Speaker(s): Jill Evans, Gareth Cole, Hannah Lloyd-Jones

I am going to be talking about Open Exeter project. This was funded under the Managing Research Data programme and was working as a pilot biosciences research project but we are expanding this to other departments. We created a survey for researchers to comment on Post Graduates by Research (PGRs) and researchers. We have created several different Research Data Management plans, some specifically targeted at PGRs. We have taken a very open approach to what might be data, and that is informed by that survey.

We currently have three repositories – ERIC, EDA, DCO – but we plan to merge these so that research is in the same place from data to publications.  We will be doing this with DSpace 1.8.2 and Oracle 11g database system. We are using Sword2 and testing various types of upload at the moment.

The current situation is that thesis deposit is mandatory for PGRs but not deposit of data. There is no clear guidance or strategy for this nor a central data store for this. But there is no clear strategy for deposit for large size files and deposits of this kind are growing. But why archive PGR data? Well enhanced discoverability is important especially for early career researchers, raised research profile/portfolio is also good for the institution. There is also an ability to validate findings if queried – good for institution and individual.  And this allows funder compliance – expected for a number of funders including the Wellcom Trust. And the availability of data on open access allows fuller exploitation of data and enables future funding opportunities.

Currently there is very varied practice. One issue is problem of loss of data – this has impact on their own work but increasingly PGRs are part of research groups so lacking access can be hugely problematic. Lack of visibility – limits potential for reuse, lack of recognition. And Inaccessibility can mean duplication of effort and inaccessibility can block research that might build on their work.

The solution will be to support deposit of big data alongside thesis. It will be a simple deposit. And a long term curation process will take place that is file agnostic and provides persistent IDS. Awareness raising and training will take place and we hope to embed cultural change in the research community. This will be supported by policy and guidance as well as a holistic support network.

The policy is currently in draft and mandates deposit if required by funder; encourages in other cases. We hope the policy will be ratified by 2013. There are various issues that need to addressed though:

  • When should data be deposited
  • Who checks data integrity
  • IP/Confidentiality issues
  • Who pays for the time taken to clean and package the data? This may not be covered by funders and may delay their studies but one solution may be ongoing assessment of data throughout the PGR process.
  • Service costs and sustainability.

Find out more here



Q1, Anthony from Mont Ash) How would you motivate researchers to assess and cleanse data regularly?

A1) That will be about training. I don’t think we’ll be able to check individual cases though.

Q2, Anna Shadboldt, University of NZ) Given what we’re doing across the work with data mandates is there a reason

A2) We wanted to follow where the funders are starting to mandate deposit but all students funded by the university will also have to deposit data so that will have wider reach. In terms of self-funded students we didn’t think that was achievable.

Q3) Rob Stevenson, Los Alamos Labs) Any plans about different versions of data?

A3) Not yet resolved but at the moment we use handles. But we are looking into DOIs. The DOI system is working with the Handle system so that Handle will be able to deal with DOI. But versioning is really important to a lot of our potential depositors.

Q4 Simon Hodson from JISC) You described this as applying to PG students generally. Have you worked on a wider policy to wider research communities? Have there been any differences with supervisors or research groups approach this?

A4) We have a mandate for researchers across the university. We developed a PGR policy separately as they face different issues. In general supervisors are very pro preserving student data as reuse and use as this problem within research projects has arisen before. We have seen PGRS are generally pro this, researchers it tends to vary greatly by discipline.

More information: http://ex.ac.uk/bQ, project team: http://ex.ac.uk/dp and draft policies are at http://ex.ac.uk/dq and http://ex.ac.uk/dr

Topic: Big Data Challenges in Repository Development
Speaker(s): Leslie Johnston, Library of Congress

A lot of people have asked why we are at this sort of event, we don’t have a repository, we don’t have researchers, we don’t fund research. Well we actually do have a repository of a sort. We are meant to store and preserve the cultural output of the entire USA. We like to talk about our collections as big data. We have to develop new types of data that are very different to our old service model. We have learned that we have no way of knowing how our collections will be used. We talked about “collections” or “content” or “items” or “files”. But recently we have started to talk about and think about our materials as data. We have Big Data in libraries, archives and museums.

We first looked into this via Digging into Data Challenge through the National Endowment for the Arts and Humanities. This was one of the first introductions to our community, the libraries, archives and museums community, that research are interested in data – including bulk corpora – in their research.

So, what constitutes Big Data? Well the definition is very fluid and a moving target. We have a huge amount of data – 10-20TB per week per collection. We still have collections but what we also have is big data, which requires us to rethink the infrastructure that is needed to support Big Data services. We are used to mediating the researchers experience so the idea that they will use data without us knowing perhaps is radically different.

My first case study is our web archives. We try to collect what is on the web but it’s about heavily curated content around big events, around specific topics etc. When we started this in 2000 we thought researchers would be browsing to see how websites used to look. That’s not the case. People want to data mine the whole collection and look for trends = say for elections for instance. This is 360TB right now, billions of files. How do we curate and catalogue these? And how do we make them accessible? We also have an issue that we cannot archive without permission so we have had to get permission for all of these and in some cases the pages are only available on a terminal in the library.

Our next case study is our historic newspapers collections. We have worked with 25 states to bring in 5 million page images from historic newspapers all available with OCR. This content is well understood in terms of ingest. It’s four image files and an OCR file and a METS file and a MEDs file. But we’ve also made data available as an API. You can download all of those files and images if you want.

Case Study – Twitter. The twitter archive has tens of billions (21 billions) files in it. We are still somewhat under press archive. We received 2006-2010 archive this year. We are just now working with it. We have had over 300 research requests already in the two years since this was announced. This is a huge scale of research requests. This collection grows by tens of millions of items per hour. This is a tech and infrastructure challenge but also a social and training challenge. And under the terms of the gift researchers will have to come into the library, we cannot put this on the open web.

Case study – Viewshare. A lot of this is based on the SIMILE toolkit from MIT. This is a web tool to upload and share visualisations of metadata. It’s on sourceforge – all open access. Or the site itself: http://viewshare.org/. Any data shared is available as a visualisation but also, if depositor allows, the raw data. What does that mean for us?

We are working with lots of other projects, which could be use cases. Electronic journal articles for instance – 100GB with 1 million files. How about born-digital broadcast television? We have a lot of things to grapple with?

Can each of our organisations support real-time querying of billions of full text items? Should we provide the tools?

We thought we understood ingest at scale until we did it. Like many universities access is one thing, actual delivery is enough. And then there are fixities and check sums, validating against specifications. We killed a number of services attempting to do this. We are now trying three separate possibilities: our current kit, on better kit and on amazon cloud services. About ingest AND indexing. Indexing is crucial to making things available. How much processing should we do on this stuff? We are certainly not about to catalogue tweets! But expectations of researchers and librarians are about catalogues. This is a full text collection, and it will never be catalogued. It may be one record for the whole collection. We will do some chunking by time and in their native JSON. I can’t promise when or how this stuff will be happening.

With other collections we are doing more. But what happens if one file is corrupted? Does that take away from the whole collection? We have tried several tools for analysis – BigInsights and Greenplum. Neither is right yet though. We will be making files discoverable but we can’t handle the download traffic… we share the same core web and infrastructure as lse.gov and congress.gov etc. Can our staff handle these new duties or do we leave researchers to fend for themselves? We are mainly thinking about unmediated access for data of this type? We have custodial issues here? Who owns Twitter – it crosses all linguistic and cultural boundaries.


Q1) What is the issue with visiting these collections in person?

A1) With the web archives you can come in and use them. Some agreements allow take away of that data, some can only be used on-site. Some machines with analytics can be used. We don’t control access to research based on collections however.

Q2) You mentioned the Twitter collection. And you are talking about self-service collections. And people say stupid stuff there

A2) We only get tweets, we get username, we know user relations but we don’t get profile information or their graph. We don’t get most of the personal information. I’ve been asked if we will remove bad language – no. Twitter for us is like diaries, letters, news reporting, citizen journalism etc. We don’t want to filter this. There was a court case decided last week in New York that said that Twitter could be subpoenaed to give over a users tweets – we are looking at implications for us. But as we have 2006-10 archive this is less likely to be of interest. And we have a six month embargo on all tweets and any deleted tweets or deleted accounts won’t be making available. That’s an issue for us actually; this will be a permanently redacted archive in some ways.

Topic: Towards a Scalable Long-term Preservation Repository for Scientific Research Datasets
Speaker(s): Arif Shaon, Simon Lambert, Erica Yang, Catherine Jones, Brian Matthews, Tom Griffin

This is very much a follow up to Micheals talk earlier as I am also at the Science and Technologies Facilities Council. The pitch here is that we re interested in the long-term preservation of scientific data. Lots going on here and it’s a complex area thanks to the complex dependencies of digital objects also needing preservation to enable reusability and the large volumes of digital objects that need scalable preservation solutions. And Scientific data adds further complexity – unique requirements to preserve the original context (e.g. processed data, final publications, etc.). And may involve preservation of software and other tools etc.

As Michael said we provide large scale scientific facilities to UK Science. And those experiments running on STFC facilities generate large volumes of data that needs effective and sustainable preservation with contextual data. There is significant investment here – billions of €’s involved – and we have a huge community of usage here as well. We have 30K+ user visitors each year in Europe.

We have a fairly well established STFC scientific workflow. Being central facilities we have lots of control here. And you’ve seen our infrastructure for this. But what are the aims of the long term preservation programme? Well we want to keep data safe – the bits that are retrievable and the same as the original. We want to keep data usable – that which can be understood and reused at a later date. And we have three emerging themes in our work:

  • Data Preservation Policy – what is the value in keeping data
  • Data preservation Analysis – what are the issues and costs involved
  • Data Preservation Infrastructure – what tools do we use

But there are some key data preservation challenges:

  • Data Volume – for instance single run of ISIS experiment could be files of 1.2GB in size. An experiment typically has 100s of runs – files of 100+GB in total size. ISIS is a good test bed as these sizes are relatively small.
  • Data Complexity- scientific HDF data format (NeXus), structural and semantic diversity in files
  • Data Compatibility – 20 years of data archives here.

We are trialing a system that is proprietary and commercial and manages integrity and format verification; designed within library and archive context; turns a data storage service in to a data archive service. But there are some issues. There is limited scalability – not happy with files over several GBs. There is no support for syntactic and semantic validation of data. No support for linking data to its context (e.g. process description, publications). There is no support for effective preservation planning (tools like Plato).


We are doing this in the context of a project called SCAPE – Scalable Preservation Environments – an EC FP7 project with 16 partners (Feb 2011-Jan 2015) and it’s a follow on from the PLANETS project. We are looking at facilitating compute-intensive preservation processes that involve large (multi-TB) data sets. We are developing cloud-based preservation solutions using Apache Hadoop. For us the key products from the project for us will be a scalable platform for performing preservation operations (with potential format conversion), to enable automatic preservation processes. So our new infrastructure will add further context into our preservation service, a watch service will also alert us to necessary preservations over time. We will be storing workflows, policies and what we call PNMs for particular datasets. The tricky areas for us are the cloud based execution platform and the preservation platform.


The cloud-based workflow execution platform will be with Apache Hadoop and workflows may range from ingest operations etc. We are considering using Taverna for workflows. The PNM is Preservation Network Models (PNM) a technique developed by the CASPAR project and to formally represent the outputs of preservation planning. These models should help us control policies, workflows, and what happens with preservation watch.

Finally this is sort of the workflows we are looking at to control this. The process we might do for a particular file. Ingest via JOVE type. Then we check semantic integrity of the file. Then we build our AIP (archive in package) construction etc.

So at the moment we are in the design stage of this work but there are further refinements and assessment to come. And we have potential issues to overcome – including how Taverna might work with the system.

But we know that a scalable preservation infrastructure is needed for STFC’s large volumes of scientific data.


Q1) We run the Australian Synchotron so this was quite interesting for me. When you run the data will that data automatically be preserved? Our one is shipped to a data centre and can then be accessed as wanted.

A1) For ISIS the data volumes are relatively low so we would probably routinely store and preserve data. For Synchotron the data volumes are much larger so that’s rather difference. Although the existing work on crystallography may help us with identifying what can or cannot be preserved.

Q2) Where do you store your data? In Hadoop or somewhere else? Do you see Hadoop as a feasible long term data solution?

A2) I think we will be mainly storing in our own data systems. We see it as a tool to compute really.

Q3) What is software in data centre to store that much data?

A4) We have a variety of solutions. Our own home grown system is use. We use CASTA, the CERN system. We have a number of different ones as new ones emerge. Backup really depends on your data customer. If they are prepared to pay for extra copies you can do that. That’s a risk analysis. CERN has a couple of copies around the world. Others may be prepared to take the risk of data loss rather than pay for storage.

Topic: DTC Archive: using data repositories to fight against diffuse pollution
Speaker(s): Mark Hedges, Richard Gartner, Mike Haft, Hardy Schwamm

The Demonstration Test Catchment Project is funded by Defra and runs from Jan 2011 and Dec 2014. It’s a collaboration between the Freshwater Biological Association and KCL (Centre of eResearch) and builds upon previous JISC-funded research. To understand the project you need to understand the background to the data.

Diffuse Pollution is the release of polluting agent that may not have immediate effect but may have long term cumulative impact. Examples of diffuse pollution includes run off from roads, discharges of fertilisers in farms etc. What is Catchment? Well typically this is the catchment area of a particular body of water draining into a particular point. And the final aspect is the Water Framework Directive. This is a legal instruction for EU member states that must be implemented through national legislation within a prescribed time-scale. This framework impacts on water quality and so this stretches beyond academia and eResearch.

The project is investigating how the impact of diffuse pollution can be reduced through on-farm mitigation methods (changes to reduce pollution) and those have to be cost effective and maintain food production capacity. There are 3 catchment areas in England for tests to demonstrate three different environment types.

So how does the project work? Well roughly speaking we monitor various environmental markers; we try out mitigation measures, and then analyze changes in baseline readings. And it’s our job to curate that data and make it available and usable by various different stakeholders. So these measurements come in various forms – bankside water quality monitoring systems etc.

So the DTC archive project is being developed. We need that data to be useful to researchers, land managers, farmers, etc. So we have to create the data archive, but also the querying, browsing, visualizing, analysing and other interactions. There need to be integrated views across diverse data that suits their need. Most of the data is numerical – spreadsheets, databases, CSV files. Some of this is sensor data (automated, telemetry) and some are manual samples or analysis. The Sensor data are more regular, more risk of inconsistencies in manual data. There is also data on species/ecological data. Also geo-data. Also less highly structured information such as time series images, video, stakeholder surveys, unstructured documents etc.

Typically you need data from various objects etc. So checking levels of potassium you need data from of points in sensor data as well as contextual data from adjacent farms. So looking at data we see spreadsheets of sensor data, weather data, and land usage data as a map of usage for instance that might all be needed.

Some challenges around this data. The datasets are diverse in terms of structure, there are different degrees of structuring – both highly structured and highly unstructured combined here. And another challenge for us is INSPIRE with the intent of creating a European Spatial Data Infrastructure for improved sharing of spatial information and improve environmental policy. It includes various standards for geospatial data (e.g. Gemini2 and GML – Geography Markup Language) and it builds on various ISO standards (ISO 19100 series).

The generic data model is based around ISO 19156 concerned with observation and measurements. The model facilitates the sharing of observations across communities and includes metadata/contextual information and the people responsible for measurement. And this allows multiple data representations. The generic data model implemented in several ways for different purposes. For archival representation (based on library/archival standards), data representation for data integration (“atomic” representation as triples), and various derived forms.

In the IslanDora repository we create a data and metadata METS files and MADS files and MODs are there. That relationship to library standards is a reflection of the fact that this archive sits within a bigger more bibliographic type archive. The crucial thing here is ensuring consistency across data components for conceptual entities etc. So to do this we are using MADS a Metadata Archiving Description Standard that helps explain the structure and format of the files and links to vocabulary terms and table search. The approach we are taking is to break data out to RDF based model. This approach has been chosen because of simplicity of data model and flexibility of that data model.

Most of this work is in the future really but based on that earlier JISC work – breaking data out of tables and assembling in triples. Something that is clear form an example data set – where we see collection method, actor, dataset, tarn, site, locating, and a multiple observation sets each with observations, all as a network of elements. So to do this we need common vocabularies – we need columns, concepts, entities mapped to formal vocabularies. Mappings defined as archive objects. We have automated, computer-assisted and manual approaches here. The latter require domain experience and mark up of text.

Architecturally we have diverse data as archival data in islandora. Then mapped and broken into RDF triples and then mapped again out to browsing, visualisation, search, analysis for particular types of access or visualisation. That break up may be a bit perverse. We think of it as breaking into atoms and recombining it again.

The initial aim is to meet needs of specific sets of stakeholders, we haven’t thought about the wider world but this data and research may be of interest to other types of researchers and broader publics in the future.

At the moment we are in the early stages. Datasets are already being generated in large quantities. There is some prototype functionality. We are looking next at ingest and modeling of data. Find out more here: http://dtcarchive.org/


Q1) This sounds very complex and specific. How much of this work is reusable by other disciplines?

A1) If it works then I think the general method could be applicable to other disciplines. But the specifics are very much for this use case but the methodology would be transferrable.

Q2) Can you track use of this data?

A2) We think so, we can explain more about this

Q3) It strikes me that these sorts of complex collections of greatly varying data is a common type of data in many disciplines so I would imagine the approach is very reusable. But the Linked Data approach is more time consuming and expensive so could you explain cost benefit of this?

A3) We are being funded to deliver this for a specific community. Moving to the end of the project converting the software to another area would be costly – developing vocabularies say. It’s not just about taking and reusing this work, that’s difficult, it’s about the general structure.

And with that this session is drawing to a close with thank you from our chair Elin Strangeland.

 July 10, 2012  Posted by at 2:42 pm LiveBlog, Updates Tagged with:  1 Response »
Jul 102012

Today we are liveblogging from the OR2012 conference at George Square Lecture Theatre (GSLT), George Square, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Kevin is introducing the Minute Madness by reminding us that all posters will be being shown at our drinks reception this evening so these very short introductions will be to entice you to visit their stand. Les Carr is chairing the madness and will buy drinks for any presentation under 45 seconds as an incentive for speed!

Our first speaker in the room is poster #105 DataONE (Observation Network for Earth) – we just heard the reasoning for why we need this, there are thousands of repositories that need to be linked together. DataONE does this, integrating data and tools for earth observation data. Tools researchers use already like Excel, like SAAS etc.

#100 is a mystery!

#109 on Metadata Analyser Portal – checks metadata quality, checks s for depositor, for repository manager, and we want to build ranking based on quality of metadata. Come to my poster and discuss this with me!

#112 on Open Access Publishing in the Social Science – one of the leading repositories in Germany. I want to talk about the roll ethos kind of repository can take, how we can ensure quality of publications.

#114 Open Access Directory – its hard to check open access status of data. Come chat to us at our poster, more importantly look at our website oad.simmond.edu.

#121 Design and development of LISIR for Scholarly Publications of Karnataka State – looking at how universities in Edinburgh have been using this technology to deposit in DSpace

#136 Can LinkedIn and Academic.edu enhance access to Open Repositories – how do we get our research out? It’s all about links and connectedness, the commercial publishers encourage this, why don’t you? Come tell me?

#149 Sharing experiences and expertise in the professional development of promoting OA and IRs between repository communities in Japan and the UK

#? Another mystery

#160 Making Data repositories visible – building a register of research data repositories. We want to encourage sharing and reuse of research data. We have research work planned on this, come talk to me about it!

#161 another mystery

#207 Metadata Database for upper atmosphere for using DSpace – a metadata repository talk geospatial data. We have solved the issue of cross searching for this metadata repository – come find out more!

#209 Revealing presence of amateurs at an institutional repository by analysing queries at Search engines – I think it is difficult to segment repository users into different groupings but it’s importance, they have different needs. Come see me to find out how we have overcome this.

#223 Integrating Fedora into the Semantic Web using Apache Stanbol – we are trying to graph the web and come along to find out more about using semantic web without losing durability of data

#224 Using CKAN – storing data for re-use – as used in data.gov.uk. The public hub lets you share data, your code, your files – you get an API for your data and stats. You can use ours or download and run your own.

#251 Developing Value-added services facilitating the outreach of institutional repositories at Chinese Academy of Sciences – maybe you don’t get good opportunities to visit China but we will share our experience – come see our poster



#263 The RSP Embedding Guide – there was once a sad dusty library and no one spoke to it. Sometimes people would throw it an article and it would be happy… but then quickly sad again. Then one day the repository manager found the RSP embedding guide and you could find out all about the happy ending at our poster!

#268 Duracloud poster proposal  – digital preservation is important but not all institutions are able to deliver this. We have built DuraCloud a web based solution. Our poster will debunk the myths of the cloud – duracloud and other cloud services – for checking data integrity

#271 SafeArchive – automated policy based auditing and provisions of replicated content  – there are many good tools in this space such as DuraCloud, such as local systems such as LOCKSS, what it’s difficult to do with these tools is to show a relationship between replication services and policy, SafeArchive does that

#274 current and future effects of social media based metrics an open access and IRs – my open data archive provides an open access repository and it is a social media based OA repository. One of the smallest repositories, but well known on social media. I want to discuss any metrics come see my poster!

#275 Adapting a spoken language data model for a fedora repository – this data type is hard to process and expensive to produce so we need repositories and data models that works with this. Annotations of video and audio, metadata specific to this etc. will all be at my poster!

#276 All about Hot Topics the duraspace community webinar series – this is a web seminar series addressing issues bubbling up from the community, Talk to me about the series and perhaps how you can get involved.

#277 A handshake system for Japanese Academic Societies and Insti8tutional repositories – we work as something like JISC or JANET and we recent started a repository hosting service called Jairo Cloud. We have tried to make a handshake for academic society repositories – I’ll explain how at my poster!

#278 create attract deposit – We at the New Bulgarian University have a poster on how we have increased deposit into our institutional repositories. We use web 2.0 to increase our deposits from 0.7% to 2% in just a year! Come and find out what we did and how we promote these materials.

#279 Engage – using data about research clusters to enhance collaboration – funded in part under JISC business dev strand come see us to find out more and tell us your experiences

#281 CSIC Bridge – linking digital.CSIC to Institutional CRIS – we have used homegrown software and other external tools to automate ingestion. I’ll talk about pros and cons and also integration with DSPACE IR and how we are using CRIS rather than DSPACE deposit tool


#283 JAIRO Cloud – national infrastructure for institutional repositories in JAPAN – I am a tech person without much money. In Japan there are 800 universities and 600 are a bit like me in that regard so the national institute of informatics has begun to share a cloud repositories, 17 are already open to the public. Come find out more

#284 The CORE Family – Connected Repositories is the project. Like William Wallace we are fighting for freedom in terms of open access. We are providing access to millions of resources. But hopefully we won’t end up in the same way: hung, drawn and quartered!

#285 Enhancing DSPace to Synchronise with sources having distinct updating patterns – I am presenting Lume a repository aggregating work from several different data sources and how we are enabling provision of embedded videos

#286 Cellar – the project for common access to EU information – 43 million file in 23 languages, delivered in multiple formats including JSON and SPARQL

#288 Moving DSpace to a fully feature CRIS System – come see how we have been doing this, adaptations made etc.

#291 Makerere University’s dynamic experience of setting up, content collection and use of an institutional repository running on DSpace – we have been doing this for 5 years, come find out about our taking this to the next level.


#294 History DMP: managing data for historical research – we have very active history researchers and got funding to work with those historians to gather and curate data through data management plans created with historians, we have 3 case studies, and we enhanced our repository for these results.

#295 NSF DMP content analysis: what are researchers saying about repositories – find out what crazy things researchers have been saying

#296 Making DSpace Content Accessible via Drupal – we recently moved to Drupal and as departments migrated we wanted to deposit publications etc. into the repository and they were fine with that but wanted it to look just like the website. So come find out how we did this via DSPace REST interface

#297 Databib – an online bibliography of research data repositories – perfect for researchers, libraries, repository managers etc. Please stop by the poster or site to make sure your repository is represented. All our metadata is available via CC0

#298 Making it work – operationalizing digital repositories: moving from strategic direction to integrated core library service – we stared out like a garage band with just our moms and boyfriends hanging out. But like better garage band we’ve gotten better and high level researchers now want to jam with us. Come find out what how we moved from garage band to centre stage!

#299 Publishing system as the new tool for the open access promotion at the university – we migrated over to an open journals system, come find out more about this.

#300 The CARPET project – an information platform on ePublishing technology for users, developers and providers – match matching these groups and technologies. Please come to our poster and ask me how e can help you

#301 Proactive personalized self archiving – we have written an application for outside repositories that allow users to submit metadata and data into repositories

#302 DataFlow project – DataStage – personalised file management – this is a love story of DataBank and DataStage, They were made for each other but didn’t know it! We are an open access project so this is an open relationship between these two components

#303 Databank – a restful web interface for repositories – come see us!

#304 Repositories at Warwick – how we refreshed our marketing for the repositories and how we used the “highlight your research” strapline. We launched the service late last year and in first 10 months we saw a nearly 50% increase in deposit. Come find out about our process and end project

#306 University of Prince Edward Islands VRE Service – this poster is a chronological narrative/fairytale tracing the repository process at PEI and Islandora itself. If you are a small institution trying to make your repository work come speak to me!

#307 CRUD (the good kind) at Northwestern University Library – a Drupal based system, a Hydra based deposit system and a Fedora Repository. It’s fun stuff, come talk

#309 Client side interfaces for content re-use framework based on OAI-PMH – an extension of OAI-PMH with image via JSON. Should be brand new framework, a very beautiful framework. Come see me.

#311 Agricultural repositories in India – darn, our presenter isn’t here

#312 If you love them, set them free: developing digital archives collections at Salford – we have been working with our local community to share and make collections available. We’ve worked hard to make our stuff more discoverable and easier to enjoy

#315 At the centre – a story first (with props!). One of the first journeys to St Andrews was by a monk to move bones of St Andrews for safekeeping. Today researchers are still inspired to come to St Andrews… our poster explains how research@StAndrews has led to all sorts of adventures and encounters.


#319 Introducing Islandora Stack and Architecture 0 Islandora is open access repository software that connects to Drupal (used by everyone from NASA To Playboy). Come find out more about Islandora, about recent updates, or about Prince Edward Island where we will be hosting OR2013

#320 Implementing preservation services in an Islandora Framework – various approaches will be discussed notably Duracloud

#324 Use of shared central virtual open access agriculture and aquaculture repository (VOA3R) – an open source portal that harvests OA scientific literature from different institutional repositories and it embarks a social network. This project funded by EU Framework7 and technology is reusable and open source.

#325 Integrating an institutional CRIS with an OA IR – find out how we are using text harvesting with Symplectic elements to create a repository full of high quality open metadata

#326 SAS OJS: overlaying an Open Journals service onto an Institutional Repositories – SAS OJS better known as “sausages” – find out about our pilot with legal researchers at University of London

#328 Putting the Repository First: Implementing a Repository to RIS EWorkflow for QUT ePrints – we’ve made the repository the only deposit for metadata for research publications

#329 Implementing an enhanced repository statistics system for QUT ePrints – so important but our researchers wanted to collate statistics at author, research group, school, faculty and home repository level (as well as article level) – my poster talks about how we implemented this and how it has gone down.

#287 Open AIRE: supporting open science in Europe – a pitch with a poem that I can’t do justice to here. But we talk about supporting open science in Europe, training… add Continental Chic to your OR2012!

#197 Open AIRE Data Infrastructure Services: On Interlinking European Institutional Repositories, Dataset Archives and CRIS systems – how we did the technical work here and how it can be reused by you!


 July 10, 2012  Posted by at 1:08 pm LiveBlog, Updates Tagged with:  1 Response »
Jul 102012

Today we are liveblogging from the OR2012 conference at George Square Lecture Theatre (GSLT), George Square, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

John Howard, Chair of the Open Repositories Steering Committee is introducing our opening keynote. I know there are lots of people who have been to Open Repositories previously. I think it’s fair to say that OR is a conference for people with the hearts and minds to make open repositories work: it’s for developers, suppliers, for everyone involved in the expanding ecosystem of repositories. It’s a learning opportunity, a networking opportunity, a way to step out of our day to day roles and make some new connections and gather new ideas.

This is our 7th OR conference. You will hear more about OR2013 in the closing plenary.

OR was started by people very much like yourselves who are passionate about repositories and wanted to share ideas and experience on an international scale. I wanted to thank the current OR steering group. The Steering Committee steer the direction of the year and select the programme chair, this year Kevin Ashley is the programme chair and he has been engaging in fortnightly calls with ourselves and the c0-chairs of the local Host Organising Group: Stuart Macdonald and William Nixon. Thank you also to our User Group chairs: John Dunn, Robin Rice and William Nixon.

If you want to stay in touch with us throughout the year I would ask you to join our Google Group and follow our new twitter account: @ORConference. And if you have ideas about a logo for OR please do let us know.

Now to Kevin Ashley, Programme Chair of OR2012.

Welcome on behalf of my own organisation the DCC, to EDINA and Edinburgh University! This year we have more attendees, more sponsorship and more sunshine at this year’s conference… one of those may be untrue!

This year we have tried to bring the spirit of the fringe to Open Repositories as we have, for many years, run the annual Repository Fringe event. So please join in the fringier aspects of the programme! And connected to that I want to remind you that ideas for the Developer Challenge must be in by 4pm today so submit them soon!

Cameron has been an advocate and activist for open research for years. Cameron has just taken up a new role as Director of Openness at Public Library of Science (PLoS)

Cameron Neylon

I am going to talk about what we have learned about open repositories from a high overview

Please do use, film, tweet, share, use anything I share today in whatever ways are useful to you?

So, what is the challenge that we face in making research effective for the people who fund it? For researchers we have this incredible level of frustration about being unable to deal with demands of funders, of stakeholders, or colleagues to make the most of what we do. We often feel like we are shouting at each other, more broadcasting than understanding of issues.

There is a sense that something is missing. WE HAVE ALL these tools to do something new but in terms of delivering what we can create and convert from the money we receive, the resources we have access to… but there is something missing in this delivery pipeline that’s not getting us to where we should be.

So I’m going to structure this talk around a sort of 3-2-1 pattern. 3 things to change, wrapped in 2 conceptual changes, around one central principle which, for me, is the useful way to bring these issues and thoughts together.

Lets start from my background. I’m now working for PLoS and I have been involved in open access advocacy for 7 years but I’ve also been interested in open things, open data, open science etc. for years.

I could make a public good argument for open research but that’s not really the environment we are in today. We need more hardnosed and pragmatic reasons to approach these problems. I shall take the tie-wearing approach. A business case. What do we have to deliver as a business, as a service provider, for the people who fund our salaries, our work?

I want to talk about quality of service, value for money, sustainability.

And if I’m shaping a business case then who are we serving? Who is the customer? Who are we marketing to?

You might think it’s policy makers… but they just funnel money through to us. Yes it’s important to make arguments to government but it’s much MORE important to make the case to taxpayers – that wider public that includes us. There is something sophisticated about research and the amount of time it takes to deliver. There is an appreciation of research and the time it takes to do. We saw that last week in the case of the Higgs Boson announcement last week (albeit in comic sans). So the customer is the global public. And they want outcomes. It’s not research outputs but how we effectively translate that to meaningful outcomes.

So why are we having this conversation, why is it happening and why is it happening now? Well we are going through the biggest change in technology since the reprinting press, perhaps since writing. Our ability to communicate has changed SO radically that we are in a totally different world than 20 years ago. Networks qualitatively change what we can do and achieve.

Most of you can remember a time without mobile phones. 20 years ago if I’d shown up and wanted to meet for a drink it would have been difficult or impossible. Email wasn’t useful back then either as so few people had it. When you start with nodes and start joining up the network… for a long time little changes. You just let people communicate in the same way you did before… right up until everyone has access to a mobile phone. Or everyone has email. You move from a network that is better connected network to a network that can be traversed in new ways. For chemists this is a cooperative phase transition. Where the network crystalises out from a solution.

But that’s a really big concept. So if we look at Tim Gowers, a mathematician and a blogger. He wondered if there was a new way to do academic math. He posted a problem on the web, a hard one. He said he didn’t intend to solve the problem but he wanted to involve as many people as possible in commenting on his approach. He expected the problem to take 6 to 9 months. And 6 weeks later he felt his problem was solved, along with a much larger problem being approached in a new way. And it wasn’t as he expected. What happened was a large group of mathematicians discussing the issue on a WordPress blog have been able to think through approaches and solve together a problem one of the worlds greatest mathematicians had not been able to solve alone. It allowed things to be done that were not possible before. A qualitative change in research capacity mediated by a pretty ropey system in which conversations can take place.

I want to talk now about GalaxyZoo. Astronomy is very much driven by the idea of testing hypotheses and that means looking through huge amounts of data. That’s a problem because you can only do about 100 sets of data a week. But you need about 10,000 classifications of galaxies to reach a level where you can publish a paper. Even a PhD student can only do 50,000 in the course of his or her studies. And there is a further problem. Lots of people look at the same data as well so this is hugely inefficient. But that data is from Sloan digital Sky Survey – an open data source of sky data. And there were a million sets of data. Computers don’t classify this stuff effectively. So what did they do? Well they took that data, they put it on the website, and they created a simple mechanism. Those million galaxies were checked 5 times over by 300,000 people in 6 months. That’s qualitatively different.

In both cases the change is because of scale, because of connectivity and mobility of data, and critically because of the efficient transfer of information. Galaxy Zoo could push high resolution images and data could be pushed back by users.

So the question as service providers has to be “how do we get some of that?” How do we make networks? And how do we deliver them so they are the right shape, the right size, the right connectivity for the right problems.

We need

1. Connectivity

2. Low friction

3. Demand on side filters

The first two of those are easy. We have the web. And really easy transfer of digital and even physical (as long as metadata objects are good) is fast and efficient. But that last issue is hard as out current approach is based on limiting and controlling access. And if you are doing that with research then you are delivering something that no one wants.

So how do we think about that? How can re reconfigure the way we do things?

So, his is a paper by Gunther Eysenbach (2011) JMIR 4:e123, it’s quite a controversial paper, but it starts from the principle that letting people know about research will increase how much it is used. If you can connect those who can and want to use research to that research it makes sense. But it’s a naive way to think about this. Connecting up just the research network overlooks the 400 million people on twitter. There will be someone who will make that connection and help connect those two connections. More than one person. This is a serendipity engine. And you can do new things you haven’t thought of before, expand into new areas of research, you can connect people who do not work in that research process and are interested in that, there are more of them than researchers at this scale.

The problem is that as we let people know that this research exists those connections drop, the effect fades out, because of acces to that research, those publications. Each time you break that network you lose potential outcomes, you lose value, you fail to optimise the network here. You guys know this. You know that open research and collaboration leads to more and better research…

But the problem is that we are used to thinking commercially. The analogy is we take our car to be serviced, and then we rent it back. The problem is that the garage has the ability to say any loaning of a car or renting out breaches the contract. They have to find new ways to make money out of new opportunities. But if we turn that on its head, if you pay upfront in kind or in cash, then the service providers’ interests can be aligned to those of the researcher or the public – if the service provider provides access to the most people possible.

When we talk about publications we need to talk about first copy access. But we can look at recent research in Denmark about economic cost to small business of not having access to research equivalent to maybe £700m in the UK. Let alone saved costs to government etc.

So we talked about those three aspects.

1) Scale the network to make things available. This is being addressed as the old publication model ends. This service industry, ways of making content sharable and discoverable, is a great service to be in.

2) We need to think about filtering at the demand side of the system. We are used to peer review as the filter. But that filtering is a friction if it’s on the supplier side. Whether peer review works or not it can’t always be the right filter, certainly not the right filter for everyone. The thing that you don’t share because the results aren’t useful I need to understand methodology, those results you don’t share as it doesn’t support your argument I need because I want the data, and that garbage paper you wrote I need to learn from myself how to do things better.

We need filters that we control to deal with the issue of filter failure. As a reader or use I want a way to discover what I want, for the purpose I want, at that point in time. Ideally I want to know about this stuff ahead of time. I think this is the biggest opportunity to make everything available in a way that progresses research. This is what you do!

So what does this mean in practical terms?

Well we were at a stage about putting things into the repository, we’ve moved beyond that to thinking about using things in repositories and understanding that use. We need to optimise that repository? What are the barriers? What is the friction? Licensing. Just sort it out. Make open the default. But we also still have lots of broken connections, how can we connect them up? How can we aggregate data on usage and citation? What is the diversity in your repositories? How can we connect things to the wider graph and systems? How can we support social discovery? And how can we enable annotation and link this across resources. Annotation is a link; it probably won’t come from depositors. Mostly it will come from fairly random people on the web!

And the other big shift is to think about quality assurance. Badge it, make high quality stuff clear, But share everything. Just badge and certify the good stuff. It saves you filtering it all down and allows all sorts of usage.

So repositories must be open, they must be accessible, and they have to be open to incoming connections from the global networks,

We are judged on research outcomes, usage when the right person finds it. And in that context a new connection could be more valuable than a new resource. This is a change to our way of thinking. We have to build those networks.

So again 3 areas to deliver… Scale and connectivity of the networks, reduced friction, and demand side filters.

1. The old model of giving away our intellectual property to pay for printing of it is dead! We need mechanisms, maybe through repositories, to make sure research is effective as possible

2. Filter on demand side, probably even automated

And that’s wrapped in one central idea. Think at the scale of networks. Assume that hundreds of thousands of people are looking at your work or want to. Assume that you cannot predict the most important use of your data. How you apply limited resources to engage with the fact that we are operating at a whole new scale is crucial.

We can’t build a system on the old truths. We could build a system on today’s truths but it wouldn’t last long. The only thing we can do for the future is to build for things we don’t expect, to be ahead of trends. Innovators don’t follow markets. They build them. When we provide services for the general public as innovators we need to build for the future. The network and its infrastructures and its systems and capacity are our future.


Q1 – Brian Kelly) You talked about building, not following, markets. We have twitter etc. Should we build the open one?

A1) That’s a really good question. I’ve always been against a Facebook for Science or Github for Science… the best Facebook is Facebook, the best Github is Github. But that was a world where the web is more open. There could be a point where it is worth our while to build tools for connectivity

We are probably a long way away from needing a new Twitter. But we need to be looking out for that. Twitter has 400 million people; whatever we build will have less. It has to get much worse to be worth shifting but we should argue against it getting a lot worse.

Q2 – Les Carr) You talked about the web, about systems. But the web is a socio-technical framework full of people with their own agendas. The web is a disruptive technology, how do we create disruptive academic communities that will make a real difference rather than playing it softly as we have been?

A2) Part of the answer is getting in people’s faces more. And making opportunities for that. For me what will drive that is the way the government is monitoring outcomes and use of outcomes. There is pressure on researchers to do that but we are not used to that. The place to be disruptive is at the point of maximum pain. That’s coming soon for EPSRC funded research. It might be coming soon with implications of Finch report impacts on UK publications. Pick the point carefully but in next 6 to 24 months there will be the right pain point to be disruptive and to show that we can ease that pain. The time of sitting back and facilitating researchers has probably passed.

Q3 – Dave Tarrant) The New World Journal answer is to charge for journals. So how can we connect up this community together rather than still have the serials model where we charge for good stuff and have other stuff out there.

A3) That issue of silo-ing is important but that issue is solved for me by proper licensing, where people can pull content together in any form they want for free. That problem hopefully goes away. There are still technical barriers but they can be overcome. But that only works if the content is properly licensed. The other problem is we don’t want to exchange a problem with access on the read side to problems on the write side. Publishing formatting and distributing research costs money. Anyone funding research really needs to insure those costs are part of that funding. We have a lot of thinking to do around the transitional process. One way would be to shift the peer review process. If we could flip or change that model that would bring costs down. Contributions in kind should be considered – not exactly sure what that is but we need to think about it. And those of us on the publications side have to facilitate this. At PLoS when you publish we say how much this journal costs to publish in, what can you afford to pay, even if nothing. But that’s not long term as a solution for all. That becomes charity. We need to remember that a paper is just research and a publication is just a repository. If it’s not worth the cost of publication and the IR is the solution then so be it. Publishers worry about value for money – otherwise you wouldn’t see embargoes. Open access publishers are not threatened about the value they add to the deposited copy.

Q4) I think that model works for research papers, for software too. But for data? That’s much more complex?
A4) I was in New Zealand last week and my default CC0 answer isn’t possible with copyright law. There is licensing and there are legal instruments. We want stuff to be interoperable and open licences allow the most reuse possible. The principle should be for maximum interoperability with the most open license you can. So adopting Susan Morrison’s work there are some licenses to suggest for different sorts of objects. I hope in CC version 4 that the licence can deal properly internationally with data. Another problem here is that licensing is used for social signaling. Many people do not use them as a legal instrument. There is a social signaling element that has gotten tied up with legal instruments. I hope we can resolve that in the long term by thinking about transfer across network and use of research because it’s in their own best interests to see stuff used. The end game has to say please use this as much as possible, in as many ways as possible and I’d like to hear about it. I think that’s what I’d like to see and I think that should solve out problem.

And Kevin is closing the keynote with a big thank you to Cameron.

 July 10, 2012  Posted by at 11:48 am LiveBlog, Updates Tagged with:  2 Responses »
Jul 092012

Welcome to the LiveBlog area for Open Repositories 2012!

Throughout the conference we will be recording sessions, tweeting and posting live blog updates from Keynotes, Parallel Sessions, the Repository Fringe strand and our fabulous events. All of these posts will appear on the main OR2012 page but can also be found here in the LiveBlog category as well. We hope to also have some guest bloggers covering some of the workshops and user groups either live or after the conference has finished.

If you are interested in being one of those bloggers please get in touch (nicola.osborne@ed.ac.uk) and we’ll set you up with access to post!

What to Expect

Our glamorous team of bloggers will be posted in various venues around the conference adding live updates to the blog here. Sometimes these will be summaries of what is being said, sometimes near verbatim accounts (it depends on various factors, particularly the speaker’s talking speed!). Our bloggers will also be tweeting and keeping an eye on the conference hashtag #or2012 for key information, updates, and questions from those reading the blog from their own desks away from the conference. Most of the content at this year’s event will be videoed and made available during or shortly after the event. And our blogging team will also be taking pictures and sharing them via Flickr (where you can also share your images of the event) though some of these may take a bit longer to reach the blog.

If you have a question, comment, or need some information then feel free to say hello to our bloggers and they may be able to help you, or at least put you in touch with other organisers who can assist. Please bear in mind our bloggers will be very busy throughout the week so if it takes them a while to approve a comment or reply to a tweet just give them a wee friendly nudge.

As a general rule we use the EDINA Social Media Guidelines to help us in our work – feel free to take a look particularly if you’d like to join us in the blogging.

Meet the Team

So that you can spot them at the event here are the team who will be blogging, tweeting, videoing and generally helping share Open Repositories 2012 with you…

Nicola Osborne is the Social Media Officer for EDINA and is leading our Social Media activity and amplification this year. She has refined her liveblogging skills through years of covering the Repository Fringe events!

Ask about: Twitter, videoing, how to become an OR2012 blogger, etc.

Superpower: Speedy liveblogging and image taking in parallel.

Contact: @suchprettyeyes, nicola.osborne@ed.ac.uk or via @OpenRepos2012

Zack O’Leary is a PhD student at the University of Edinburgh and is assisting EDINA with social media throughout the summer.

Ask about: Twitter, liveblogs, mobile, burritos.

Superpower: Playing the QWERTY keyboard, he can create a symphony of social media.

Contact: @zaleary

Image of Nick Shepard

Nick Sheppard is Repository Developer at Leeds Metropolitan University and Technical Officer for the UK Council of Research Repositories (UKCoRR). Nick considers himself a Shambrarian and blogs on technical and cultural aspects of repository development , research management and Open Educational Resources (OER) for both his institution and UKCoRR.

Ask about: Twitter, integrated research management (IR/CRIS), OER, Jorum

Superpower: Superfast, highly accurate rock-music assisted metadata creation

Contact: @mrnickn.e.sheppard@leedsmet.ac.uk

Blogs: http://repositorynews.wordpress.com/http://ukcorr.org/activity/blog/

Kirsty PitkinKirsty Pitkin is an event amplifier, who tears around the country covering a wide range of fascinating events.  She works closely with the DevCSI project and will be blogging about the Developer Challenge throughout OR 2012.  If you’re taking part in the challenge, make sure you tell her about your cool idea!

Ask about:  The DevCSI Developer Challenge
Superpower:  Managing multiple social media channels at once.
Contact: @devcsi, http://devcsi.ukoln.ac.uk, @eventamplifier, http://eventamplifier.wordpress.com
A picture is also attached.


Steph Taylor, is a researcher, consultant and trainer based in Manchester. Her interests lie in Digital libraries, repositories, research data management and social media (read more on her Crowdvine page). She’ll be updating us on her superpowers shortly.

Natasha Simons: Natasha Simons is a Senior Project Manager in eResearch Services, Scholarly Information and Research, at Griffith University, Australia. She manages a number of projects focussed on building eResearch infrastructure. She’ll be updating her own blog here: http://natashajsimons.blogspot.com.au/. As she’s currently en route to the UK but will update us on her superpowers shortly.

Join the Team!

We welcome and encourage your input before, during and after Open Repositories 2012. If you would like to be one of our livebloggers there is still time – email nicola.osborne@ed.ac.uk and we’ll get you set up. Otherwise feel free to tweet, post on Crowdvine, share your images on Flickr, comment here on the blog – or just enjoy the conference whether online or here in person!

If you will be liveblogging or writing up Open Repositories 2012 somewhere else on the web just let us know and we’ll link to your write-up from our OR2012 Buzz page.

 July 9, 2012  Posted by at 10:12 am LiveBlog Tagged with: , , , , , ,  Comments Off on Welcome to the OR2012 LiveBlog