Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh.

Topic: Moving from a scientific data collection system to an open data repository
Speaker(s): Michael David Wilson, Tom Griffin, Brian Matthews, Alistair Mills, Sri Nagella, Arif Shaon, Erica Yang

I am here presenting on behalf of myself and my colleagues from the Science and Technology Facilities Council. We run facilities ranging from CERN Large Hadron Collider to the Rutherfod Appleton Laboratory. I will be talking about the ISIS Facility, which is based at Rutherford. People put in their scientific sample and that crystal goes into the facility and then it may examine that crystal for anything from maybe an hour to a few days. The facility produces 2 to 120 files per experiment in several formats including NeXus, RAW (no, not that one, a Rutherford Appleton format). In 2009 we had run 834 experiments, 0.5 million files, 0.5Tb of data. But that’s just one facility. We have petabytes of data across our facilities.

We want to maximise the value of STFC data, as Cameron indicated in his talk earlier it’s about showing the value to the taxpayer.

  1. Researchers want to access their own data
  2. Other researchers validate published results
  3. Meta-studies incorporating data – reuse or new subsets of data can expand the use of the original intent for data
  4. Set experimental parameters and test new computational models/theories
  5. User for new science not yet considered – we have satellites but the oldest climate data we have is on river depth, collected 6 times a day. Its 17th century data but it has huge 21st century climate usefulness. Science can involve uses of data that is radically different than original envisioned
  6. Defend patents on innovations derived from science – biological data, drug related data etc. is relevant here.
  7. Evidence based policy making – we know they want this data but what the impact of that is maybe arguable.

That one at the top of the list (1) is the one we started with when we began collecting data. We started collecting about 1984. The Web came along about 1994 1995 and by 1998 researchers could access their own data on the web – they could find the data set they had produced using an experiment number. It wasn’t useful for others but it was useful for them. And the infrastructure reflected this. It was very simple. We have instrument PCs as the data acquisition system, there was a distributed file system and server, delivery and the user.


Moving to reason (2) we want people to validate the published results. We have the raw data from the experiment. We have calibrated data – that’s the basis for any form of scientific analysis. That data is owned by the facility and preserved by the facility. But the researchers do the data analysis at their own institution. The publisher may eventually share some derived data. We want to hold all of that data, the original data, the calibration data, and the derived data. So when do we publish data? We have less than 1% commercial data so that’s not an issue. But we have data policies (different science, difference facilities, different policy) around PhD period largely so we have a 3 year data embargo. It’s generally accepted by most of our users now but a few years ago were not happy with that. We do keep a record of who accesses data. And we embargo metadata as well as data as if it’s known, say, that a drug company supports a particular research group or university a competitor may start copying the line of inquiry even on the basis of the metadata… don’t think this is just about corporates though… In 2004 a research group in California arranged a meeting about a possible new planet, some researchers in Spain looked at the data they’d been using and reasoning that that research team had found a planet announced that THEY had found a planet. It’s not just big corporations; academics are really competitive!


But when we make the data available we make it easy to discover that data and reward it. For any data published we create a Data DOI that enables Google to find the page but also in the UK HEFCE have said that the open access research dataset use will be allowed in new REF. And data will also be going into the citation index that is used in the assessment of research centres.


So on our diagram of the infrastructure we now have metadata and Data DOI added.


Onto (3) and (4). In our data we include schedule and proposal – who, funder, what etc. that goes with that data. Except about 5% don’t do what they proposed so mostly that job is easily done but sometimes it can be problematic. We add publications data and analysis data – we can do this as we are providing the funding, facility and tools they are using. The data can be searched via Datacity. Our in-house TopCat system allows in-house browsing as well. And we’ve added new elements to infrastructure here.


Looking at (5), (6) and (7) new science, patents, policy. We are trying to find socio-economic impact into the process. We have adopted a commercial product called Tesella Safety Depositr Box with Fixity checks. We have a data format migration. And we have our own long term storage as well.


So that infrastructure looks more complex still. But this is working. We are meeting our preservation objectives. We are meeting the timescale of objectives (short, medium, long). Designated communities, additional information, security requirements are met. We can structure a business case using these arguments.



Q1) Being a repository major I was interested to hear that over the last few years 80% of researchers had gone from unhappy at sharing data to most now being happy. What made the difference?

A1) The driver was the funding implications of data citations. The barrier was distrust in others using or misinterpreting their data but our data policies helped to ameliorate that.

Topic: Postgraduate Research Data: a New Type of Challenge for Repositories?
Speaker(s): Jill Evans, Gareth Cole, Hannah Lloyd-Jones

I am going to be talking about Open Exeter project. This was funded under the Managing Research Data programme and was working as a pilot biosciences research project but we are expanding this to other departments. We created a survey for researchers to comment on Post Graduates by Research (PGRs) and researchers. We have created several different Research Data Management plans, some specifically targeted at PGRs. We have taken a very open approach to what might be data, and that is informed by that survey.

We currently have three repositories – ERIC, EDA, DCO – but we plan to merge these so that research is in the same place from data to publications.  We will be doing this with DSpace 1.8.2 and Oracle 11g database system. We are using Sword2 and testing various types of upload at the moment.

The current situation is that thesis deposit is mandatory for PGRs but not deposit of data. There is no clear guidance or strategy for this nor a central data store for this. But there is no clear strategy for deposit for large size files and deposits of this kind are growing. But why archive PGR data? Well enhanced discoverability is important especially for early career researchers, raised research profile/portfolio is also good for the institution. There is also an ability to validate findings if queried – good for institution and individual.  And this allows funder compliance – expected for a number of funders including the Wellcom Trust. And the availability of data on open access allows fuller exploitation of data and enables future funding opportunities.

Currently there is very varied practice. One issue is problem of loss of data – this has impact on their own work but increasingly PGRs are part of research groups so lacking access can be hugely problematic. Lack of visibility – limits potential for reuse, lack of recognition. And Inaccessibility can mean duplication of effort and inaccessibility can block research that might build on their work.

The solution will be to support deposit of big data alongside thesis. It will be a simple deposit. And a long term curation process will take place that is file agnostic and provides persistent IDS. Awareness raising and training will take place and we hope to embed cultural change in the research community. This will be supported by policy and guidance as well as a holistic support network.

The policy is currently in draft and mandates deposit if required by funder; encourages in other cases. We hope the policy will be ratified by 2013. There are various issues that need to addressed though:

  • When should data be deposited
  • Who checks data integrity
  • IP/Confidentiality issues
  • Who pays for the time taken to clean and package the data? This may not be covered by funders and may delay their studies but one solution may be ongoing assessment of data throughout the PGR process.
  • Service costs and sustainability.

Find out more here



Q1, Anthony from Mont Ash) How would you motivate researchers to assess and cleanse data regularly?

A1) That will be about training. I don’t think we’ll be able to check individual cases though.

Q2, Anna Shadboldt, University of NZ) Given what we’re doing across the work with data mandates is there a reason

A2) We wanted to follow where the funders are starting to mandate deposit but all students funded by the university will also have to deposit data so that will have wider reach. In terms of self-funded students we didn’t think that was achievable.

Q3) Rob Stevenson, Los Alamos Labs) Any plans about different versions of data?

A3) Not yet resolved but at the moment we use handles. But we are looking into DOIs. The DOI system is working with the Handle system so that Handle will be able to deal with DOI. But versioning is really important to a lot of our potential depositors.

Q4 Simon Hodson from JISC) You described this as applying to PG students generally. Have you worked on a wider policy to wider research communities? Have there been any differences with supervisors or research groups approach this?

A4) We have a mandate for researchers across the university. We developed a PGR policy separately as they face different issues. In general supervisors are very pro preserving student data as reuse and use as this problem within research projects has arisen before. We have seen PGRS are generally pro this, researchers it tends to vary greatly by discipline.

More information: http://ex.ac.uk/bQ, project team: http://ex.ac.uk/dp and draft policies are at http://ex.ac.uk/dq and http://ex.ac.uk/dr

Topic: Big Data Challenges in Repository Development
Speaker(s): Leslie Johnston, Library of Congress

A lot of people have asked why we are at this sort of event, we don’t have a repository, we don’t have researchers, we don’t fund research. Well we actually do have a repository of a sort. We are meant to store and preserve the cultural output of the entire USA. We like to talk about our collections as big data. We have to develop new types of data that are very different to our old service model. We have learned that we have no way of knowing how our collections will be used. We talked about “collections” or “content” or “items” or “files”. But recently we have started to talk about and think about our materials as data. We have Big Data in libraries, archives and museums.

We first looked into this via Digging into Data Challenge through the National Endowment for the Arts and Humanities. This was one of the first introductions to our community, the libraries, archives and museums community, that research are interested in data – including bulk corpora – in their research.

So, what constitutes Big Data? Well the definition is very fluid and a moving target. We have a huge amount of data – 10-20TB per week per collection. We still have collections but what we also have is big data, which requires us to rethink the infrastructure that is needed to support Big Data services. We are used to mediating the researchers experience so the idea that they will use data without us knowing perhaps is radically different.

My first case study is our web archives. We try to collect what is on the web but it’s about heavily curated content around big events, around specific topics etc. When we started this in 2000 we thought researchers would be browsing to see how websites used to look. That’s not the case. People want to data mine the whole collection and look for trends = say for elections for instance. This is 360TB right now, billions of files. How do we curate and catalogue these? And how do we make them accessible? We also have an issue that we cannot archive without permission so we have had to get permission for all of these and in some cases the pages are only available on a terminal in the library.

Our next case study is our historic newspapers collections. We have worked with 25 states to bring in 5 million page images from historic newspapers all available with OCR. This content is well understood in terms of ingest. It’s four image files and an OCR file and a METS file and a MEDs file. But we’ve also made data available as an API. You can download all of those files and images if you want.

Case Study – Twitter. The twitter archive has tens of billions (21 billions) files in it. We are still somewhat under press archive. We received 2006-2010 archive this year. We are just now working with it. We have had over 300 research requests already in the two years since this was announced. This is a huge scale of research requests. This collection grows by tens of millions of items per hour. This is a tech and infrastructure challenge but also a social and training challenge. And under the terms of the gift researchers will have to come into the library, we cannot put this on the open web.

Case study – Viewshare. A lot of this is based on the SIMILE toolkit from MIT. This is a web tool to upload and share visualisations of metadata. It’s on sourceforge – all open access. Or the site itself: http://viewshare.org/. Any data shared is available as a visualisation but also, if depositor allows, the raw data. What does that mean for us?

We are working with lots of other projects, which could be use cases. Electronic journal articles for instance – 100GB with 1 million files. How about born-digital broadcast television? We have a lot of things to grapple with?

Can each of our organisations support real-time querying of billions of full text items? Should we provide the tools?

We thought we understood ingest at scale until we did it. Like many universities access is one thing, actual delivery is enough. And then there are fixities and check sums, validating against specifications. We killed a number of services attempting to do this. We are now trying three separate possibilities: our current kit, on better kit and on amazon cloud services. About ingest AND indexing. Indexing is crucial to making things available. How much processing should we do on this stuff? We are certainly not about to catalogue tweets! But expectations of researchers and librarians are about catalogues. This is a full text collection, and it will never be catalogued. It may be one record for the whole collection. We will do some chunking by time and in their native JSON. I can’t promise when or how this stuff will be happening.

With other collections we are doing more. But what happens if one file is corrupted? Does that take away from the whole collection? We have tried several tools for analysis – BigInsights and Greenplum. Neither is right yet though. We will be making files discoverable but we can’t handle the download traffic… we share the same core web and infrastructure as lse.gov and congress.gov etc. Can our staff handle these new duties or do we leave researchers to fend for themselves? We are mainly thinking about unmediated access for data of this type? We have custodial issues here? Who owns Twitter – it crosses all linguistic and cultural boundaries.


Q1) What is the issue with visiting these collections in person?

A1) With the web archives you can come in and use them. Some agreements allow take away of that data, some can only be used on-site. Some machines with analytics can be used. We don’t control access to research based on collections however.

Q2) You mentioned the Twitter collection. And you are talking about self-service collections. And people say stupid stuff there

A2) We only get tweets, we get username, we know user relations but we don’t get profile information or their graph. We don’t get most of the personal information. I’ve been asked if we will remove bad language – no. Twitter for us is like diaries, letters, news reporting, citizen journalism etc. We don’t want to filter this. There was a court case decided last week in New York that said that Twitter could be subpoenaed to give over a users tweets – we are looking at implications for us. But as we have 2006-10 archive this is less likely to be of interest. And we have a six month embargo on all tweets and any deleted tweets or deleted accounts won’t be making available. That’s an issue for us actually; this will be a permanently redacted archive in some ways.

Topic: Towards a Scalable Long-term Preservation Repository for Scientific Research Datasets
Speaker(s): Arif Shaon, Simon Lambert, Erica Yang, Catherine Jones, Brian Matthews, Tom Griffin

This is very much a follow up to Micheals talk earlier as I am also at the Science and Technologies Facilities Council. The pitch here is that we re interested in the long-term preservation of scientific data. Lots going on here and it’s a complex area thanks to the complex dependencies of digital objects also needing preservation to enable reusability and the large volumes of digital objects that need scalable preservation solutions. And Scientific data adds further complexity – unique requirements to preserve the original context (e.g. processed data, final publications, etc.). And may involve preservation of software and other tools etc.

As Michael said we provide large scale scientific facilities to UK Science. And those experiments running on STFC facilities generate large volumes of data that needs effective and sustainable preservation with contextual data. There is significant investment here – billions of €’s involved – and we have a huge community of usage here as well. We have 30K+ user visitors each year in Europe.

We have a fairly well established STFC scientific workflow. Being central facilities we have lots of control here. And you’ve seen our infrastructure for this. But what are the aims of the long term preservation programme? Well we want to keep data safe – the bits that are retrievable and the same as the original. We want to keep data usable – that which can be understood and reused at a later date. And we have three emerging themes in our work:

  • Data Preservation Policy – what is the value in keeping data
  • Data preservation Analysis – what are the issues and costs involved
  • Data Preservation Infrastructure – what tools do we use

But there are some key data preservation challenges:

  • Data Volume – for instance single run of ISIS experiment could be files of 1.2GB in size. An experiment typically has 100s of runs – files of 100+GB in total size. ISIS is a good test bed as these sizes are relatively small.
  • Data Complexity- scientific HDF data format (NeXus), structural and semantic diversity in files
  • Data Compatibility – 20 years of data archives here.

We are trialing a system that is proprietary and commercial and manages integrity and format verification; designed within library and archive context; turns a data storage service in to a data archive service. But there are some issues. There is limited scalability – not happy with files over several GBs. There is no support for syntactic and semantic validation of data. No support for linking data to its context (e.g. process description, publications). There is no support for effective preservation planning (tools like Plato).


We are doing this in the context of a project called SCAPE – Scalable Preservation Environments – an EC FP7 project with 16 partners (Feb 2011-Jan 2015) and it’s a follow on from the PLANETS project. We are looking at facilitating compute-intensive preservation processes that involve large (multi-TB) data sets. We are developing cloud-based preservation solutions using Apache Hadoop. For us the key products from the project for us will be a scalable platform for performing preservation operations (with potential format conversion), to enable automatic preservation processes. So our new infrastructure will add further context into our preservation service, a watch service will also alert us to necessary preservations over time. We will be storing workflows, policies and what we call PNMs for particular datasets. The tricky areas for us are the cloud based execution platform and the preservation platform.


The cloud-based workflow execution platform will be with Apache Hadoop and workflows may range from ingest operations etc. We are considering using Taverna for workflows. The PNM is Preservation Network Models (PNM) a technique developed by the CASPAR project and to formally represent the outputs of preservation planning. These models should help us control policies, workflows, and what happens with preservation watch.

Finally this is sort of the workflows we are looking at to control this. The process we might do for a particular file. Ingest via JOVE type. Then we check semantic integrity of the file. Then we build our AIP (archive in package) construction etc.

So at the moment we are in the design stage of this work but there are further refinements and assessment to come. And we have potential issues to overcome – including how Taverna might work with the system.

But we know that a scalable preservation infrastructure is needed for STFC’s large volumes of scientific data.


Q1) We run the Australian Synchotron so this was quite interesting for me. When you run the data will that data automatically be preserved? Our one is shipped to a data centre and can then be accessed as wanted.

A1) For ISIS the data volumes are relatively low so we would probably routinely store and preserve data. For Synchotron the data volumes are much larger so that’s rather difference. Although the existing work on crystallography may help us with identifying what can or cannot be preserved.

Q2) Where do you store your data? In Hadoop or somewhere else? Do you see Hadoop as a feasible long term data solution?

A2) I think we will be mainly storing in our own data systems. We see it as a tool to compute really.

Q3) What is software in data centre to store that much data?

A4) We have a variety of solutions. Our own home grown system is use. We use CASTA, the CERN system. We have a number of different ones as new ones emerge. Backup really depends on your data customer. If they are prepared to pay for extra copies you can do that. That’s a risk analysis. CERN has a couple of copies around the world. Others may be prepared to take the risk of data loss rather than pay for storage.

Topic: DTC Archive: using data repositories to fight against diffuse pollution
Speaker(s): Mark Hedges, Richard Gartner, Mike Haft, Hardy Schwamm

The Demonstration Test Catchment Project is funded by Defra and runs from Jan 2011 and Dec 2014. It’s a collaboration between the Freshwater Biological Association and KCL (Centre of eResearch) and builds upon previous JISC-funded research. To understand the project you need to understand the background to the data.

Diffuse Pollution is the release of polluting agent that may not have immediate effect but may have long term cumulative impact. Examples of diffuse pollution includes run off from roads, discharges of fertilisers in farms etc. What is Catchment? Well typically this is the catchment area of a particular body of water draining into a particular point. And the final aspect is the Water Framework Directive. This is a legal instruction for EU member states that must be implemented through national legislation within a prescribed time-scale. This framework impacts on water quality and so this stretches beyond academia and eResearch.

The project is investigating how the impact of diffuse pollution can be reduced through on-farm mitigation methods (changes to reduce pollution) and those have to be cost effective and maintain food production capacity. There are 3 catchment areas in England for tests to demonstrate three different environment types.

So how does the project work? Well roughly speaking we monitor various environmental markers; we try out mitigation measures, and then analyze changes in baseline readings. And it’s our job to curate that data and make it available and usable by various different stakeholders. So these measurements come in various forms – bankside water quality monitoring systems etc.

So the DTC archive project is being developed. We need that data to be useful to researchers, land managers, farmers, etc. So we have to create the data archive, but also the querying, browsing, visualizing, analysing and other interactions. There need to be integrated views across diverse data that suits their need. Most of the data is numerical – spreadsheets, databases, CSV files. Some of this is sensor data (automated, telemetry) and some are manual samples or analysis. The Sensor data are more regular, more risk of inconsistencies in manual data. There is also data on species/ecological data. Also geo-data. Also less highly structured information such as time series images, video, stakeholder surveys, unstructured documents etc.

Typically you need data from various objects etc. So checking levels of potassium you need data from of points in sensor data as well as contextual data from adjacent farms. So looking at data we see spreadsheets of sensor data, weather data, and land usage data as a map of usage for instance that might all be needed.

Some challenges around this data. The datasets are diverse in terms of structure, there are different degrees of structuring – both highly structured and highly unstructured combined here. And another challenge for us is INSPIRE with the intent of creating a European Spatial Data Infrastructure for improved sharing of spatial information and improve environmental policy. It includes various standards for geospatial data (e.g. Gemini2 and GML – Geography Markup Language) and it builds on various ISO standards (ISO 19100 series).

The generic data model is based around ISO 19156 concerned with observation and measurements. The model facilitates the sharing of observations across communities and includes metadata/contextual information and the people responsible for measurement. And this allows multiple data representations. The generic data model implemented in several ways for different purposes. For archival representation (based on library/archival standards), data representation for data integration (“atomic” representation as triples), and various derived forms.

In the IslanDora repository we create a data and metadata METS files and MADS files and MODs are there. That relationship to library standards is a reflection of the fact that this archive sits within a bigger more bibliographic type archive. The crucial thing here is ensuring consistency across data components for conceptual entities etc. So to do this we are using MADS a Metadata Archiving Description Standard that helps explain the structure and format of the files and links to vocabulary terms and table search. The approach we are taking is to break data out to RDF based model. This approach has been chosen because of simplicity of data model and flexibility of that data model.

Most of this work is in the future really but based on that earlier JISC work – breaking data out of tables and assembling in triples. Something that is clear form an example data set – where we see collection method, actor, dataset, tarn, site, locating, and a multiple observation sets each with observations, all as a network of elements. So to do this we need common vocabularies – we need columns, concepts, entities mapped to formal vocabularies. Mappings defined as archive objects. We have automated, computer-assisted and manual approaches here. The latter require domain experience and mark up of text.

Architecturally we have diverse data as archival data in islandora. Then mapped and broken into RDF triples and then mapped again out to browsing, visualisation, search, analysis for particular types of access or visualisation. That break up may be a bit perverse. We think of it as breaking into atoms and recombining it again.

The initial aim is to meet needs of specific sets of stakeholders, we haven’t thought about the wider world but this data and research may be of interest to other types of researchers and broader publics in the future.

At the moment we are in the early stages. Datasets are already being generated in large quantities. There is some prototype functionality. We are looking next at ingest and modeling of data. Find out more here: http://dtcarchive.org/


Q1) This sounds very complex and specific. How much of this work is reusable by other disciplines?

A1) If it works then I think the general method could be applicable to other disciplines. But the specifics are very much for this use case but the methodology would be transferrable.

Q2) Can you track use of this data?

A2) We think so, we can explain more about this

Q3) It strikes me that these sorts of complex collections of greatly varying data is a common type of data in many disciplines so I would imagine the approach is very reusable. But the Linked Data approach is more time consuming and expensive so could you explain cost benefit of this?

A3) We are being funded to deliver this for a specific community. Moving to the end of the project converting the software to another area would be costly – developing vocabularies say. It’s not just about taking and reusing this work, that’s difficult, it’s about the general structure.

And with that this session is drawing to a close with thank you from our chair Elin Strangeland.

Today we are liveblogging from the OR2012 conference at George Square Lecture Theatre (GSLT), George Square, part of the University of Edinburgh.

Kevin is introducing the Minute Madness by reminding us that all posters will be being shown at our drinks reception this evening so these very short introductions will be to entice you to visit their stand. Les Carr is chairing the madness and will buy drinks for any presentation under 45 seconds as an incentive for speed!

Our first speaker in the room is poster #105 DataONE (Observation Network for Earth) – we just heard the reasoning for why we need this, there are thousands of repositories that need to be linked together. DataONE does this, integrating data and tools for earth observation data. Tools researchers use already like Excel, like SAAS etc.

#100 is a mystery!

#109 on Metadata Analyser Portal – checks metadata quality, checks s for depositor, for repository manager, and we want to build ranking based on quality of metadata. Come to my poster and discuss this with me!

#112 on Open Access Publishing in the Social Science – one of the leading repositories in Germany. I want to talk about the roll ethos kind of repository can take, how we can ensure quality of publications.

#114 Open Access Directory – its hard to check open access status of data. Come chat to us at our poster, more importantly look at our website oad.simmond.edu.

#121 Design and development of LISIR for Scholarly Publications of Karnataka State – looking at how universities in Edinburgh have been using this technology to deposit in DSpace

#136 Can LinkedIn and Academic.edu enhance access to Open Repositories – how do we get our research out? It’s all about links and connectedness, the commercial publishers encourage this, why don’t you? Come tell me?

#149 Sharing experiences and expertise in the professional development of promoting OA and IRs between repository communities in Japan and the UK

#? Another mystery

#160 Making Data repositories visible – building a register of research data repositories. We want to encourage sharing and reuse of research data. We have research work planned on this, come talk to me about it!

#161 another mystery

#207 Metadata Database for upper atmosphere for using DSpace – a metadata repository talk geospatial data. We have solved the issue of cross searching for this metadata repository – come find out more!

#209 Revealing presence of amateurs at an institutional repository by analysing queries at Search engines – I think it is difficult to segment repository users into different groupings but it’s importance, they have different needs. Come see me to find out how we have overcome this.

#223 Integrating Fedora into the Semantic Web using Apache Stanbol – we are trying to graph the web and come along to find out more about using semantic web without losing durability of data

#224 Using CKAN – storing data for re-use – as used in data.gov.uk. The public hub lets you share data, your code, your files – you get an API for your data and stats. You can use ours or download and run your own.

#251 Developing Value-added services facilitating the outreach of institutional repositories at Chinese Academy of Sciences – maybe you don’t get good opportunities to visit China but we will share our experience – come see our poster



#263 The RSP Embedding Guide – there was once a sad dusty library and no one spoke to it. Sometimes people would throw it an article and it would be happy… but then quickly sad again. Then one day the repository manager found the RSP embedding guide and you could find out all about the happy ending at our poster!

#268 Duracloud poster proposal  – digital preservation is important but not all institutions are able to deliver this. We have built DuraCloud a web based solution. Our poster will debunk the myths of the cloud – duracloud and other cloud services – for checking data integrity

#271 SafeArchive – automated policy based auditing and provisions of replicated content  – there are many good tools in this space such as DuraCloud, such as local systems such as LOCKSS, what it’s difficult to do with these tools is to show a relationship between replication services and policy, SafeArchive does that

#274 current and future effects of social media based metrics an open access and IRs – my open data archive provides an open access repository and it is a social media based OA repository. One of the smallest repositories, but well known on social media. I want to discuss any metrics come see my poster!

#275 Adapting a spoken language data model for a fedora repository – this data type is hard to process and expensive to produce so we need repositories and data models that works with this. Annotations of video and audio, metadata specific to this etc. will all be at my poster!

#276 All about Hot Topics the duraspace community webinar series – this is a web seminar series addressing issues bubbling up from the community, Talk to me about the series and perhaps how you can get involved.

#277 A handshake system for Japanese Academic Societies and Insti8tutional repositories – we work as something like JISC or JANET and we recent started a repository hosting service called Jairo Cloud. We have tried to make a handshake for academic society repositories – I’ll explain how at my poster!

#278 create attract deposit – We at the New Bulgarian University have a poster on how we have increased deposit into our institutional repositories. We use web 2.0 to increase our deposits from 0.7% to 2% in just a year! Come and find out what we did and how we promote these materials.

#279 Engage – using data about research clusters to enhance collaboration – funded in part under JISC business dev strand come see us to find out more and tell us your experiences

#281 CSIC Bridge – linking digital.CSIC to Institutional CRIS – we have used homegrown software and other external tools to automate ingestion. I’ll talk about pros and cons and also integration with DSPACE IR and how we are using CRIS rather than DSPACE deposit tool


#283 JAIRO Cloud – national infrastructure for institutional repositories in JAPAN – I am a tech person without much money. In Japan there are 800 universities and 600 are a bit like me in that regard so the national institute of informatics has begun to share a cloud repositories, 17 are already open to the public. Come find out more

#284 The CORE Family – Connected Repositories is the project. Like William Wallace we are fighting for freedom in terms of open access. We are providing access to millions of resources. But hopefully we won’t end up in the same way: hung, drawn and quartered!

#285 Enhancing DSPace to Synchronise with sources having distinct updating patterns – I am presenting Lume a repository aggregating work from several different data sources and how we are enabling provision of embedded videos

#286 Cellar – the project for common access to EU information – 43 million file in 23 languages, delivered in multiple formats including JSON and SPARQL

#288 Moving DSpace to a fully feature CRIS System – come see how we have been doing this, adaptations made etc.

#291 Makerere University’s dynamic experience of setting up, content collection and use of an institutional repository running on DSpace – we have been doing this for 5 years, come find out about our taking this to the next level.


#294 History DMP: managing data for historical research – we have very active history researchers and got funding to work with those historians to gather and curate data through data management plans created with historians, we have 3 case studies, and we enhanced our repository for these results.

#295 NSF DMP content analysis: what are researchers saying about repositories – find out what crazy things researchers have been saying

#296 Making DSpace Content Accessible via Drupal – we recently moved to Drupal and as departments migrated we wanted to deposit publications etc. into the repository and they were fine with that but wanted it to look just like the website. So come find out how we did this via DSPace REST interface

#297 Databib – an online bibliography of research data repositories – perfect for researchers, libraries, repository managers etc. Please stop by the poster or site to make sure your repository is represented. All our metadata is available via CC0

#298 Making it work – operationalizing digital repositories: moving from strategic direction to integrated core library service – we stared out like a garage band with just our moms and boyfriends hanging out. But like better garage band we’ve gotten better and high level researchers now want to jam with us. Come find out what how we moved from garage band to centre stage!

#299 Publishing system as the new tool for the open access promotion at the university – we migrated over to an open journals system, come find out more about this.

#300 The CARPET project – an information platform on ePublishing technology for users, developers and providers – match matching these groups and technologies. Please come to our poster and ask me how e can help you

#301 Proactive personalized self archiving – we have written an application for outside repositories that allow users to submit metadata and data into repositories

#302 DataFlow project – DataStage – personalised file management – this is a love story of DataBank and DataStage, They were made for each other but didn’t know it! We are an open access project so this is an open relationship between these two components

#303 Databank – a restful web interface for repositories – come see us!

#304 Repositories at Warwick – how we refreshed our marketing for the repositories and how we used the “highlight your research” strapline. We launched the service late last year and in first 10 months we saw a nearly 50% increase in deposit. Come find out about our process and end project

#306 University of Prince Edward Islands VRE Service – this poster is a chronological narrative/fairytale tracing the repository process at PEI and Islandora itself. If you are a small institution trying to make your repository work come speak to me!

#307 CRUD (the good kind) at Northwestern University Library – a Drupal based system, a Hydra based deposit system and a Fedora Repository. It’s fun stuff, come talk

#309 Client side interfaces for content re-use framework based on OAI-PMH – an extension of OAI-PMH with image via JSON. Should be brand new framework, a very beautiful framework. Come see me.

#311 Agricultural repositories in India – darn, our presenter isn’t here

#312 If you love them, set them free: developing digital archives collections at Salford – we have been working with our local community to share and make collections available. We’ve worked hard to make our stuff more discoverable and easier to enjoy

#315 At the centre – a story first (with props!). One of the first journeys to St Andrews was by a monk to move bones of St Andrews for safekeeping. Today researchers are still inspired to come to St Andrews… our poster explains how research@StAndrews has led to all sorts of adventures and encounters.


#319 Introducing Islandora Stack and Architecture 0 Islandora is open access repository software that connects to Drupal (used by everyone from NASA To Playboy). Come find out more about Islandora, about recent updates, or about Prince Edward Island where we will be hosting OR2013

#320 Implementing preservation services in an Islandora Framework – various approaches will be discussed notably Duracloud

#324 Use of shared central virtual open access agriculture and aquaculture repository (VOA3R) – an open source portal that harvests OA scientific literature from different institutional repositories and it embarks a social network. This project funded by EU Framework7 and technology is reusable and open source.

#325 Integrating an institutional CRIS with an OA IR – find out how we are using text harvesting with Symplectic elements to create a repository full of high quality open metadata

#326 SAS OJS: overlaying an Open Journals service onto an Institutional Repositories – SAS OJS better known as “sausages” – find out about our pilot with legal researchers at University of London

#328 Putting the Repository First: Implementing a Repository to RIS EWorkflow for QUT ePrints – we’ve made the repository the only deposit for metadata for research publications

#329 Implementing an enhanced repository statistics system for QUT ePrints – so important but our researchers wanted to collate statistics at author, research group, school, faculty and home repository level (as well as article level) – my poster talks about how we implemented this and how it has gone down.

#287 Open AIRE: supporting open science in Europe – a pitch with a poem that I can’t do justice to here. But we talk about supporting open science in Europe, training… add Continental Chic to your OR2012!

#197 Open AIRE Data Infrastructure Services: On Interlinking European Institutional Repositories, Dataset Archives and CRIS systems – how we did the technical work here and how it can be reused by you!


Today we are liveblogging from the OR2012 conference at George Square Lecture Theatre (GSLT), George Square, part of the University of Edinburgh.

John Howard, Chair of the Open Repositories Steering Committee is introducing our opening keynote. I know there are lots of people who have been to Open Repositories previously. I think it’s fair to say that OR is a conference for people with the hearts and minds to make open repositories work: it’s for developers, suppliers, for everyone involved in the expanding ecosystem of repositories. It’s a learning opportunity, a networking opportunity, a way to step out of our day to day roles and make some new connections and gather new ideas.

This is our 7th OR conference. You will hear more about OR2013 in the closing plenary.

OR was started by people very much like yourselves who are passionate about repositories and wanted to share ideas and experience on an international scale. I wanted to thank the current OR steering group. The Steering Committee steer the direction of the year and select the programme chair, this year Kevin Ashley is the programme chair and he has been engaging in fortnightly calls with ourselves and the c0-chairs of the local Host Organising Group: Stuart Macdonald and William Nixon. Thank you also to our User Group chairs: John Dunn, Robin Rice and William Nixon.

If you want to stay in touch with us throughout the year I would ask you to join our Google Group and follow our new twitter account: @ORConference. And if you have ideas about a logo for OR please do let us know.

Now to Kevin Ashley, Programme Chair of OR2012.

Welcome on behalf of my own organisation the DCC, to EDINA and Edinburgh University! This year we have more attendees, more sponsorship and more sunshine at this year’s conference… one of those may be untrue!

This year we have tried to bring the spirit of the fringe to Open Repositories as we have, for many years, run the annual Repository Fringe event. So please join in the fringier aspects of the programme! And connected to that I want to remind you that ideas for the Developer Challenge must be in by 4pm today so submit them soon!

Cameron has been an advocate and activist for open research for years. Cameron has just taken up a new role as Director of Openness at Public Library of Science (PLoS)

Cameron Neylon

I am going to talk about what we have learned about open repositories from a high overview

Please do use, film, tweet, share, use anything I share today in whatever ways are useful to you?

So, what is the challenge that we face in making research effective for the people who fund it? For researchers we have this incredible level of frustration about being unable to deal with demands of funders, of stakeholders, or colleagues to make the most of what we do. We often feel like we are shouting at each other, more broadcasting than understanding of issues.

There is a sense that something is missing. WE HAVE ALL these tools to do something new but in terms of delivering what we can create and convert from the money we receive, the resources we have access to… but there is something missing in this delivery pipeline that’s not getting us to where we should be.

So I’m going to structure this talk around a sort of 3-2-1 pattern. 3 things to change, wrapped in 2 conceptual changes, around one central principle which, for me, is the useful way to bring these issues and thoughts together.

Lets start from my background. I’m now working for PLoS and I have been involved in open access advocacy for 7 years but I’ve also been interested in open things, open data, open science etc. for years.

I could make a public good argument for open research but that’s not really the environment we are in today. We need more hardnosed and pragmatic reasons to approach these problems. I shall take the tie-wearing approach. A business case. What do we have to deliver as a business, as a service provider, for the people who fund our salaries, our work?

I want to talk about quality of service, value for money, sustainability.

And if I’m shaping a business case then who are we serving? Who is the customer? Who are we marketing to?

You might think it’s policy makers… but they just funnel money through to us. Yes it’s important to make arguments to government but it’s much MORE important to make the case to taxpayers – that wider public that includes us. There is something sophisticated about research and the amount of time it takes to deliver. There is an appreciation of research and the time it takes to do. We saw that last week in the case of the Higgs Boson announcement last week (albeit in comic sans). So the customer is the global public. And they want outcomes. It’s not research outputs but how we effectively translate that to meaningful outcomes.

So why are we having this conversation, why is it happening and why is it happening now? Well we are going through the biggest change in technology since the reprinting press, perhaps since writing. Our ability to communicate has changed SO radically that we are in a totally different world than 20 years ago. Networks qualitatively change what we can do and achieve.

Most of you can remember a time without mobile phones. 20 years ago if I’d shown up and wanted to meet for a drink it would have been difficult or impossible. Email wasn’t useful back then either as so few people had it. When you start with nodes and start joining up the network… for a long time little changes. You just let people communicate in the same way you did before… right up until everyone has access to a mobile phone. Or everyone has email. You move from a network that is better connected network to a network that can be traversed in new ways. For chemists this is a cooperative phase transition. Where the network crystalises out from a solution.

But that’s a really big concept. So if we look at Tim Gowers, a mathematician and a blogger. He wondered if there was a new way to do academic math. He posted a problem on the web, a hard one. He said he didn’t intend to solve the problem but he wanted to involve as many people as possible in commenting on his approach. He expected the problem to take 6 to 9 months. And 6 weeks later he felt his problem was solved, along with a much larger problem being approached in a new way. And it wasn’t as he expected. What happened was a large group of mathematicians discussing the issue on a WordPress blog have been able to think through approaches and solve together a problem one of the worlds greatest mathematicians had not been able to solve alone. It allowed things to be done that were not possible before. A qualitative change in research capacity mediated by a pretty ropey system in which conversations can take place.

I want to talk now about GalaxyZoo. Astronomy is very much driven by the idea of testing hypotheses and that means looking through huge amounts of data. That’s a problem because you can only do about 100 sets of data a week. But you need about 10,000 classifications of galaxies to reach a level where you can publish a paper. Even a PhD student can only do 50,000 in the course of his or her studies. And there is a further problem. Lots of people look at the same data as well so this is hugely inefficient. But that data is from Sloan digital Sky Survey – an open data source of sky data. And there were a million sets of data. Computers don’t classify this stuff effectively. So what did they do? Well they took that data, they put it on the website, and they created a simple mechanism. Those million galaxies were checked 5 times over by 300,000 people in 6 months. That’s qualitatively different.

In both cases the change is because of scale, because of connectivity and mobility of data, and critically because of the efficient transfer of information. Galaxy Zoo could push high resolution images and data could be pushed back by users.

So the question as service providers has to be “how do we get some of that?” How do we make networks? And how do we deliver them so they are the right shape, the right size, the right connectivity for the right problems.

We need

1. Connectivity

2. Low friction

3. Demand on side filters

The first two of those are easy. We have the web. And really easy transfer of digital and even physical (as long as metadata objects are good) is fast and efficient. But that last issue is hard as out current approach is based on limiting and controlling access. And if you are doing that with research then you are delivering something that no one wants.

So how do we think about that? How can re reconfigure the way we do things?

So, his is a paper by Gunther Eysenbach (2011) JMIR 4:e123, it’s quite a controversial paper, but it starts from the principle that letting people know about research will increase how much it is used. If you can connect those who can and want to use research to that research it makes sense. But it’s a naive way to think about this. Connecting up just the research network overlooks the 400 million people on twitter. There will be someone who will make that connection and help connect those two connections. More than one person. This is a serendipity engine. And you can do new things you haven’t thought of before, expand into new areas of research, you can connect people who do not work in that research process and are interested in that, there are more of them than researchers at this scale.

The problem is that as we let people know that this research exists those connections drop, the effect fades out, because of acces to that research, those publications. Each time you break that network you lose potential outcomes, you lose value, you fail to optimise the network here. You guys know this. You know that open research and collaboration leads to more and better research…

But the problem is that we are used to thinking commercially. The analogy is we take our car to be serviced, and then we rent it back. The problem is that the garage has the ability to say any loaning of a car or renting out breaches the contract. They have to find new ways to make money out of new opportunities. But if we turn that on its head, if you pay upfront in kind or in cash, then the service providers’ interests can be aligned to those of the researcher or the public – if the service provider provides access to the most people possible.

When we talk about publications we need to talk about first copy access. But we can look at recent research in Denmark about economic cost to small business of not having access to research equivalent to maybe £700m in the UK. Let alone saved costs to government etc.

So we talked about those three aspects.

1) Scale the network to make things available. This is being addressed as the old publication model ends. This service industry, ways of making content sharable and discoverable, is a great service to be in.

2) We need to think about filtering at the demand side of the system. We are used to peer review as the filter. But that filtering is a friction if it’s on the supplier side. Whether peer review works or not it can’t always be the right filter, certainly not the right filter for everyone. The thing that you don’t share because the results aren’t useful I need to understand methodology, those results you don’t share as it doesn’t support your argument I need because I want the data, and that garbage paper you wrote I need to learn from myself how to do things better.

We need filters that we control to deal with the issue of filter failure. As a reader or use I want a way to discover what I want, for the purpose I want, at that point in time. Ideally I want to know about this stuff ahead of time. I think this is the biggest opportunity to make everything available in a way that progresses research. This is what you do!

So what does this mean in practical terms?

Well we were at a stage about putting things into the repository, we’ve moved beyond that to thinking about using things in repositories and understanding that use. We need to optimise that repository? What are the barriers? What is the friction? Licensing. Just sort it out. Make open the default. But we also still have lots of broken connections, how can we connect them up? How can we aggregate data on usage and citation? What is the diversity in your repositories? How can we connect things to the wider graph and systems? How can we support social discovery? And how can we enable annotation and link this across resources. Annotation is a link; it probably won’t come from depositors. Mostly it will come from fairly random people on the web!

And the other big shift is to think about quality assurance. Badge it, make high quality stuff clear, But share everything. Just badge and certify the good stuff. It saves you filtering it all down and allows all sorts of usage.

So repositories must be open, they must be accessible, and they have to be open to incoming connections from the global networks,

We are judged on research outcomes, usage when the right person finds it. And in that context a new connection could be more valuable than a new resource. This is a change to our way of thinking. We have to build those networks.

So again 3 areas to deliver… Scale and connectivity of the networks, reduced friction, and demand side filters.

1. The old model of giving away our intellectual property to pay for printing of it is dead! We need mechanisms, maybe through repositories, to make sure research is effective as possible

2. Filter on demand side, probably even automated

And that’s wrapped in one central idea. Think at the scale of networks. Assume that hundreds of thousands of people are looking at your work or want to. Assume that you cannot predict the most important use of your data. How you apply limited resources to engage with the fact that we are operating at a whole new scale is crucial.

We can’t build a system on the old truths. We could build a system on today’s truths but it wouldn’t last long. The only thing we can do for the future is to build for things we don’t expect, to be ahead of trends. Innovators don’t follow markets. They build them. When we provide services for the general public as innovators we need to build for the future. The network and its infrastructures and its systems and capacity are our future.


Q1 – Brian Kelly) You talked about building, not following, markets. We have twitter etc. Should we build the open one?

A1) That’s a really good question. I’ve always been against a Facebook for Science or Github for Science… the best Facebook is Facebook, the best Github is Github. But that was a world where the web is more open. There could be a point where it is worth our while to build tools for connectivity

We are probably a long way away from needing a new Twitter. But we need to be looking out for that. Twitter has 400 million people; whatever we build will have less. It has to get much worse to be worth shifting but we should argue against it getting a lot worse.

Q2 – Les Carr) You talked about the web, about systems. But the web is a socio-technical framework full of people with their own agendas. The web is a disruptive technology, how do we create disruptive academic communities that will make a real difference rather than playing it softly as we have been?

A2) Part of the answer is getting in people’s faces more. And making opportunities for that. For me what will drive that is the way the government is monitoring outcomes and use of outcomes. There is pressure on researchers to do that but we are not used to that. The place to be disruptive is at the point of maximum pain. That’s coming soon for EPSRC funded research. It might be coming soon with implications of Finch report impacts on UK publications. Pick the point carefully but in next 6 to 24 months there will be the right pain point to be disruptive and to show that we can ease that pain. The time of sitting back and facilitating researchers has probably passed.

Q3 – Dave Tarrant) The New World Journal answer is to charge for journals. So how can we connect up this community together rather than still have the serials model where we charge for good stuff and have other stuff out there.

A3) That issue of silo-ing is important but that issue is solved for me by proper licensing, where people can pull content together in any form they want for free. That problem hopefully goes away. There are still technical barriers but they can be overcome. But that only works if the content is properly licensed. The other problem is we don’t want to exchange a problem with access on the read side to problems on the write side. Publishing formatting and distributing research costs money. Anyone funding research really needs to insure those costs are part of that funding. We have a lot of thinking to do around the transitional process. One way would be to shift the peer review process. If we could flip or change that model that would bring costs down. Contributions in kind should be considered – not exactly sure what that is but we need to think about it. And those of us on the publications side have to facilitate this. At PLoS when you publish we say how much this journal costs to publish in, what can you afford to pay, even if nothing. But that’s not long term as a solution for all. That becomes charity. We need to remember that a paper is just research and a publication is just a repository. If it’s not worth the cost of publication and the IR is the solution then so be it. Publishers worry about value for money – otherwise you wouldn’t see embargoes. Open access publishers are not threatened about the value they add to the deposited copy.

Q4) I think that model works for research papers, for software too. But for data? That’s much more complex?
A4) I was in New Zealand last week and my default CC0 answer isn’t possible with copyright law. There is licensing and there are legal instruments. We want stuff to be interoperable and open licences allow the most reuse possible. The principle should be for maximum interoperability with the most open license you can. So adopting Susan Morrison’s work there are some licenses to suggest for different sorts of objects. I hope in CC version 4 that the licence can deal properly internationally with data. Another problem here is that licensing is used for social signaling. Many people do not use them as a legal instrument. There is a social signaling element that has gotten tied up with legal instruments. I hope we can resolve that in the long term by thinking about transfer across network and use of research because it’s in their own best interests to see stuff used. The end game has to say please use this as much as possible, in as many ways as possible and I’d like to hear about it. I think that’s what I’d like to see and I think that should solve out problem.

And Kevin is closing the keynote with a big thank you to Cameron.

There are two fantastic ways you can use your smartphone at Open Repositories this week!

Microsoft Research have put together a great little Windows Phone 7 app for frequent conference goers and throwers: My Conference. It pulls in conference information and puts it into a convenient, dare I say attractive, interface. Browse events, tunnel deeper to learn about the delegates, and read what they’ve submitted to earn a slot at the conference. You can also see their other publications via Microsoft Academic Search. And what would any conference app be without a quick game of Guess Who? Here’s a little walkthrough and showoff video using the OR2012 programme.

YouTube Preview Image

You’ll probably see a bunch of pixelated little squares around the conference’s paper programme. You should put those QR codes to use, linking to equivalent pages online and freeing all that information from your printout. You can also use our custom map and read abstracts this way. Watch the video to find out how.

YouTube Preview Image
As Day two workshops get underway we thought we’d take a look through the tweets and updated from Day One and share the highlights. This is the first set of updates, we’ll be adding to it from the tweets and comments throughout the day and after the event so do add links and comments here and come back to take a look…

After registration we realised it had taken a wee while for people to spot the 8GB memory stick hidden inside their delegate badges. Thankfully @williamjnixon was on hand with an explanation of how to use them – “The 8gb flash drive just swivels out of the badge”. Yes, it is really that easy 😉

The DSpace Committers meeting seemed to go really well – lots of interesting stuff raised according to those we’ve been chatting to this morning. If you were along and would like to share your notes/thoughts just let us know!

The Islandora workshop introduced Islandora to lots of folk who didn’t previously know much about it. And @jjtuttle was impressed to see “the #DiscGarden #Islandora video solution pack does reencoding using ffmpeg to generate access files. We want that.”

The Open Access Index workshop described establishing a way to “measure openness of research. What factors should it consider?” (via @openscience). To gather responses they have set up a survey here – do fill it in.

The DCC workshop, Institutional Repositories & Data – Roles and Responsibilities highlighted that Research Data Management is “a relay race, pass the baton at key point in the cycle” (via @wrap_ed). @informnivore tweeted Jared Lyle’s take on data curation challenges: “formats, metadata, privacy, and training”. The ICPSR work was of lots of interest. The full results of the recent ICPSR study will be a here (via @sjDCC) and a handy tip from the workshop: “ICPSR has an anonymizer tool for social science data”. One of the more interesting questions raised here was “what the role of funding agencies in data preservation & curation?” (via @informnivore). Breakouts included researcher workflows, insinstutional responsibilities, and IR limitations (via @pcastromartin). Concerns over the latter included “limited qualifications of library staff to deal with domain-specific metadata”. Apparently it “took us an hour or so but ‘the’ question has come up. So ‘what is research data?'” (via @wrap_ed).

At the text mining workshop @CriticalSteph reported back so regularly she got banned by Twitter for the day! But before the ban she shared news of Argo, which has “the aim is to be a community resource of a complete framework of text mining”.

The Repositories Support Project workshop: Building a national network kicked off with an overview from Balviar Notay of JISC’s work with repositories that left the crowd wanting more and particularly interested in the JISC Elevator and UK RepositoryNet+. And @llordllama wondered “Does the JISC elevator sound like the one in Are You Being Served? That would be neat.”. Jackie Wickham spoke on RSP but also on Sherpa as “Congratulations go to Bill Hubbard, now a very recent dad which has trumped attending #or2012“! (via @williamjnixon). OpenDOAR, Sherpa/Romeo and Juliet were all well recognised by the crowd, even those from overseas and “66% of publishers listed on Romeo allow some form of repository archiving. That figure’s been stable for half a decade” (via @llordllama). And one audience member suggested that adding “Article Processing Charges” (APCs) to Romeo would be a v useful addition”. A great fact from Jackie’s talk on RSP (the website for which was launched in 2006): “UK only second to US in number of repositories” – “on OpenDOAR the 9.5% of the institutional repositories are from the UK” (via @RepoSupport). “RSP has been busy with over 1300 delegates to from from 200+ organisations to events and 90+ consultancy visits” and have an embeddedness self-assessment tool (via @williamjnixon and @nancypontika). Marie Cairney talked about “the evolution of Enlighten from its antecedents in JISC FAIR and DAEDALUS“. Now @uofglibrary has two separate repositories 1 for published papers (Enlighten) & 2nd for theses. This led to discussion of deposit policy and of the Glasgow publications policy a “mix of metadata, full text and use of address”. See also: Building a national network – Nick Sheppard’s excellent liveblog of the RSP session: http://ukcorr.org/2012/07/09/building-a-national-network/

And finally…

The DevCSI Developer Challenge is still looking for your fantastic ideas! Add them here or go and say hello in the Developer Lounge on the 1st floor of Appleton Tower (just near the lifts).

Recommended by the Tweetosphere

Get the picture?


Welcome to the LiveBlog area for Open Repositories 2012!

Throughout the conference we will be recording sessions, tweeting and posting live blog updates from Keynotes, Parallel Sessions, the Repository Fringe strand and our fabulous events. All of these posts will appear on the main OR2012 page but can also be found here in the LiveBlog category as well. We hope to also have some guest bloggers covering some of the workshops and user groups either live or after the conference has finished.

If you are interested in being one of those bloggers please get in touch (nicola.osborne@ed.ac.uk) and we’ll set you up with access to post!

What to Expect

Our glamorous team of bloggers will be posted in various venues around the conference adding live updates to the blog here. Sometimes these will be summaries of what is being said, sometimes near verbatim accounts (it depends on various factors, particularly the speaker’s talking speed!). Our bloggers will also be tweeting and keeping an eye on the conference hashtag #or2012 for key information, updates, and questions from those reading the blog from their own desks away from the conference. Most of the content at this year’s event will be videoed and made available during or shortly after the event. And our blogging team will also be taking pictures and sharing them via Flickr (where you can also share your images of the event) though some of these may take a bit longer to reach the blog.

If you have a question, comment, or need some information then feel free to say hello to our bloggers and they may be able to help you, or at least put you in touch with other organisers who can assist. Please bear in mind our bloggers will be very busy throughout the week so if it takes them a while to approve a comment or reply to a tweet just give them a wee friendly nudge.

As a general rule we use the EDINA Social Media Guidelines to help us in our work – feel free to take a look particularly if you’d like to join us in the blogging.

Meet the Team

So that you can spot them at the event here are the team who will be blogging, tweeting, videoing and generally helping share Open Repositories 2012 with you…

Nicola Osborne is the Social Media Officer for EDINA and is leading our Social Media activity and amplification this year. She has refined her liveblogging skills through years of covering the Repository Fringe events!

Ask about: Twitter, videoing, how to become an OR2012 blogger, etc.

Superpower: Speedy liveblogging and image taking in parallel.

Contact: @suchprettyeyes, nicola.osborne@ed.ac.uk or via @OpenRepos2012

Zack O’Leary is a PhD student at the University of Edinburgh and is assisting EDINA with social media throughout the summer.

Ask about: Twitter, liveblogs, mobile, burritos.

Superpower: Playing the QWERTY keyboard, he can create a symphony of social media.

Contact: @zaleary

Image of Nick Shepard

Nick Sheppard is Repository Developer at Leeds Metropolitan University and Technical Officer for the UK Council of Research Repositories (UKCoRR). Nick considers himself a Shambrarian and blogs on technical and cultural aspects of repository development , research management and Open Educational Resources (OER) for both his institution and UKCoRR.

Ask about: Twitter, integrated research management (IR/CRIS), OER, Jorum

Superpower: Superfast, highly accurate rock-music assisted metadata creation

Contact: @mrnickn.e.sheppard@leedsmet.ac.uk

Blogs: http://repositorynews.wordpress.com/http://ukcorr.org/activity/blog/

Kirsty PitkinKirsty Pitkin is an event amplifier, who tears around the country covering a wide range of fascinating events.  She works closely with the DevCSI project and will be blogging about the Developer Challenge throughout OR 2012.  If you’re taking part in the challenge, make sure you tell her about your cool idea!

Ask about:  The DevCSI Developer Challenge
Superpower:  Managing multiple social media channels at once.
Contact: @devcsi, http://devcsi.ukoln.ac.uk, @eventamplifier, http://eventamplifier.wordpress.com
A picture is also attached.


Steph Taylor, is a researcher, consultant and trainer based in Manchester. Her interests lie in Digital libraries, repositories, research data management and social media (read more on her Crowdvine page). She’ll be updating us on her superpowers shortly.

Natasha Simons: Natasha Simons is a Senior Project Manager in eResearch Services, Scholarly Information and Research, at Griffith University, Australia. She manages a number of projects focussed on building eResearch infrastructure. She’ll be updating her own blog here: http://natashajsimons.blogspot.com.au/. As she’s currently en route to the UK but will update us on her superpowers shortly.

Join the Team!

We welcome and encourage your input before, during and after Open Repositories 2012. If you would like to be one of our livebloggers there is still time – email nicola.osborne@ed.ac.uk and we’ll get you set up. Otherwise feel free to tweet, post on Crowdvine, share your images on Flickr, comment here on the blog – or just enjoy the conference whether online or here in person!

If you will be liveblogging or writing up Open Repositories 2012 somewhere else on the web just let us know and we’ll link to your write-up from our OR2012 Buzz page.

If you are the author of a poster at this year’s conference, make sure you pick up your free poster stickers from the registration desk. We’re trialling a simple way to help poster authors have discussions with delegates about their posters even when they aren’t standing next to them. You can place the poster stickers on your clothing, your conference bag, laptop or anywhere else you think it will attract attention. 4 stickers have the full title of your poster on them as well as an image of it; 3 just have the poster itself. One contains identifying information to make it easy for the registration staff to hand out the right stickers to the right people.

If you are a delegate and see someone wearing a poster sticker, why not ask them about it ?

Stickers were only printed for those who uploaded their posters to the conference system by early July. If you didn’t do that, we’re sorry but you won’t have any stickers waiting for you.

The wait is nearly over! We are just a weekend away from the start of Open Repositories 2012 and a warm welcome from us all here in Edinburgh awaits. There are over 430 coming to OR2012. It’s a packed programme and we know it will be a busy (and exhilarating) week.

We have arranged 14 workshops in total on Monday and Tuesday morning, many of which are now fully booked. We would be grateful if you could check your booking and, if you are no longer able to attend, notify us or cancel the booking yourself as there are waiting lists for several workshops now.

There will also be an opportunity for the repository user group communities (DSpace, Fedora and EPrints) to share their latest developments and work. These user group sessions are open to all delegates and offer an excellent opportunity to find out more.

This year we have introduced a third “Repository Fringe” strand, based on the very successful “Repository Fringe” hosted here at the University of Edinburgh. This includes a wide range of Pecha Kucha presentations which promise to be lively. There’s also an opportunity to contribute to the Open Access Index project and to learn more from Ipsos Mori about online survey tools.

Registration will take place in Appleton Tower (see campus map). The registration desk will open on Monday 9 July at 8.30 am and will be available throughout the conference. A separate conference office will also be available to deal with any further enquiries you may have.

For the latest details please refer to the online programme.

A printed programme and delegate list will be provided upon registration.

We have organised wall-to-wall sunshine for the week starting Monday 9 July but have yet to identify a delivery mechanism. We shall work on this over the weekend!! In the meantime check the forecast here and it may be prudent to pack an umbrella, it is summer after all!

Travel and Accommodation
Details about travel and the conference accommodation can be found on the conference website – click on the Registration link in the menu above to access the relevant pages.

If you haven’t booked accommodation yet please refer to our Accommodation information.  Note: The cost of accommodation is NOT included in the registration fee.

The conference will take place on the George Square campus at the University of Edinburgh.  Opening and closing sessions will take place in George Square Lecture Theatre. All other sessions will take place in Appleton Tower (See central area maps[PDF] and also our OR2012 Google Map of the venues).

For further information about the conference location please refer to: http://or2012.ed.ac.uk/location/

Speakers and Session Chairs
If you are a speaker and haven’t yet sent us your presentations please do, it will really assist the smooth running of the conference. Further guidance about timings, set-up etc is available elsewhere on the conference site for speakers and session chairs.

Network Access and Eduroam
There is wifi throughout the George Square campus through two routes. Users of either wifi option should be aware of the University of Edinburgh Computing Regulations.

Eduroam is available and accessible throughout the buildings so Eduroam users should be able to login with their usual details. You may need to set this up at your own institution before arriving.

We can provide free University of Edinburgh wifi guest accounts will be available for OR2012 – please ask at the registration desk for more information and your guest login details.

If you have a mobile device like a tablet or smartphone, guidance is available from the University of Edinburgh.

Social Media and Recording
We will be recording, blogging, tweeting, using Crowdvine and other exciting social media tools throughout Open Repositories this year. We hope that you’ll join in the fun so, if you are curious about any of these tools but haven’t used them before we’d like to help you get started. We have put together a Beginners Guide to Social Media for OR2012.

Have a look and please do leave a question or comment – or email them to: Nicola.Osborne@ed.ac.uk We are also looking for live bloggers so contact Nicola to volunteer and be part of our social media mix.

#OR2012 is the perfect time to take the plunge with Twitter and we recommend following the conference’s official Twitter account @OpenRepos2012 for all the latest news and breaking action.

Lunches and snacks
Lunches, coffee, teas and snacks will be provided to all delegates each day at the conference. There will be coffee breaks available during the workshops.

Open Repositories 2012 would not be possible without the sponsors, supporters, collaborators and organisers that enable us to make this both a highly useful and very enjoyable event and we would like to take the opportunity to thank them.  Find out more about our sponsors here. Many of the sponsors will be exhibitors in the concourse in Appleton Tower during the week, drop by and say hello.

Social Events
The social events are included in your registration fee.

There will be a Drinks Reception in the Playfair Library on the evening of Tuesday 10 July 6pm – 8pm). This will be opened by the Depute Lord Provost of Edinburgh, Deidre Brock with reply from Professor Jeff Haywood, Vice-Principal of Knowledge Management & Chief Information Officer, University of Edinburgh.

Please note that canapés will be served at the Drinks Reception and as such delegates are advised to make their own dinner arrangements. There are a wide range of restaurants to suit all tastes and budgets in the vicinity.

On Wednesday evening (11 July) there will be a conference dinner and a Ceilidh at the National Museum of Scotland. Drinks will be served at 7 pm with dinner at 8pm and dancing until just before midnight, if you can stand the pace! We are delighted that Dr John Howard, Chair of the Steering Committee has agreed to be our Master of Ceremonies and to announce the winners of the Developer’s Challenge, after their Show and Tell earlier in the evening.

An invitation for dinner will be in your registration pack. If you don’t plan on coming to the conference dinner we would appreciate it if you hadn’t the dinner invitation back to us at registration. This will assist us with numbers.

Anything else? Need help?
Contact us by e-mail at or2012@ed.ac.uk or on Twitter using #or2012info and we will do our best to help, we look forward to seeing you in Edinburgh.

Kevin Ashley
Chair of OR2012 Programme Committee

Uncle Sam I Want You Poster

(Original image by DonkeyHotey, Flickr, 28-04-11. Painting by James Montgomery Flagg, via the Library of Congress)

The developer challenge isn’t just for developers anymore. It doesn’t matter if you speak Perl or Ruby or if you bash your Fedora, so long as you speak repository. We want curators, managers, and users of every sort to join. It takes all kinds to make great new toys, so you should consider signing up and pitching an idea. If metadata gets you going, or if you revel in getting your hands dirty with big data sets, there’s no better place to be this Tuesday night than the developer challenge at OR2012.

Show us something new and cool in the world of Open Repositories

That’s the pitch, and we want to see what you’ve got. It’s going to take a collaboration between code ninjas, database wizards, and SWORD-wielding users to take home the prize. We know there are all sorts of innovations bouncing around amongst the array of attendees, and we want to showcase the best of the best.

You don’t even have to worry about making it work yet, though it certainly wouldn’t hurt. Just refine your idea a bit and get ready to talk about it. On Tuesday all of the challengers will get together and shout it out, airing their plans and giving each other feedback.Then you’ve got just under a day to make any finishing touches before presenting to an audience and judging panel on Wednesday night at 5:00pm.

To the victor go the spoils

Funding, vouchers, widgets, and the attention of the entire conference on Thursday morning are all up for grabs. Not too shabby. So head over to the DevCSI challenge page to iron out the details, then submit your idea in the comments of the entry page before Tuesday the 10th.

Need some inspiration? We’ve got just the thing – here are a few prize winners from OpenRepo DevCSI challenges in 2009 and 2011.


The National Museum of Scotland, one of the finest Victorian buildings in Scotland, will host our conference dinner and ceilidh. The very late setting sun will shine through the glass ceiling of the recently refurbished Grand Gallery and provide a magnificent space for the social event of the conference.

The Grand Gallery at night

After a wine reception and some fine dining we will have a chance to dance to the music of the Wullie Fraser Band.

Ceilidh dancing is energetic fun! You’ll get hot no matter what you wear, so you might as well dress up to mark the occasion. Somewhere between comfortable and positively dashing ought to do it – just no trainers. Skirts and kilts aren’t required outright, but only because our legal team said so.

Scottish dancers in competition

We promise you won't have to do this ('Scottish dancers in competition' by Gordon E. Robertson. Wikimedia Commons. 30-07-11)

As long as you can count to eight you’ll be fine: each dance is a set of repeated steps. As OR2012 delegates your expert attention to detail will definitely come in handy. Sometimes only two people dance together, and sometimes four, six or eight make up a ‘set’ where each couple gets a chance to dance the pattern.

Some more experienced folks will be there to show you how it’s done, and the Wullie Fraser Band will “call” us through each dance before the music starts. Don’t worry about the moves, though. The initial confusion is half the fun, and eventually the patterns will fall into place.

Intertwined ceilidh dancers

See? Easy ('Ceilidh 7' by Barney Moss. Flickr. 03-09-11)

Note that if you are really going to throw yourself into the “birling” with your partner, the best hold is to grasp their right elbow with your right hand, and clasp left hands above. This will save many bruised inside elbows and keeps you spinning on your feet.

You can always learn a bit more about the patterns in advance if you’re really keen.

Ceilidh dancers coming together

And by the end of the night... ('Ceilidh 4' by Barney Moss. Flickr. 03-09-11)

It’s sure to be good food and good fun, all in a fantastic venue that the OR2012 delegates will get all to themselves until the sun finally goes down around midnight. Be sure to join us from 7:30pm on Wednesday 11th July.

