Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.
If you are following the event online please add your comment to this post or use the #or2012 hashtag.
This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.
Topic: Moving from a scientific data collection system to an open data repository
Speaker(s): Michael David Wilson, Tom Griffin, Brian Matthews, Alistair Mills, Sri Nagella, Arif Shaon, Erica Yang
I am here presenting on behalf of myself and my colleagues from the Science and Technology Facilities Council. We run facilities ranging from CERN Large Hadron Collider to the Rutherfod Appleton Laboratory. I will be talking about the ISIS Facility, which is based at Rutherford. People put in their scientific sample and that crystal goes into the facility and then it may examine that crystal for anything from maybe an hour to a few days. The facility produces 2 to 120 files per experiment in several formats including NeXus, RAW (no, not that one, a Rutherford Appleton format). In 2009 we had run 834 experiments, 0.5 million files, 0.5Tb of data. But that’s just one facility. We have petabytes of data across our facilities.
We want to maximise the value of STFC data, as Cameron indicated in his talk earlier it’s about showing the value to the taxpayer.
- Researchers want to access their own data
- Other researchers validate published results
- Meta-studies incorporating data – reuse or new subsets of data can expand the use of the original intent for data
- Set experimental parameters and test new computational models/theories
- User for new science not yet considered – we have satellites but the oldest climate data we have is on river depth, collected 6 times a day. Its 17th century data but it has huge 21st century climate usefulness. Science can involve uses of data that is radically different than original envisioned
- Defend patents on innovations derived from science – biological data, drug related data etc. is relevant here.
- Evidence based policy making – we know they want this data but what the impact of that is maybe arguable.
That one at the top of the list (1) is the one we started with when we began collecting data. We started collecting about 1984. The Web came along about 1994 1995 and by 1998 researchers could access their own data on the web – they could find the data set they had produced using an experiment number. It wasn’t useful for others but it was useful for them. And the infrastructure reflected this. It was very simple. We have instrument PCs as the data acquisition system, there was a distributed file system and server, delivery and the user.
Moving to reason (2) we want people to validate the published results. We have the raw data from the experiment. We have calibrated data – that’s the basis for any form of scientific analysis. That data is owned by the facility and preserved by the facility. But the researchers do the data analysis at their own institution. The publisher may eventually share some derived data. We want to hold all of that data, the original data, the calibration data, and the derived data. So when do we publish data? We have less than 1% commercial data so that’s not an issue. But we have data policies (different science, difference facilities, different policy) around PhD period largely so we have a 3 year data embargo. It’s generally accepted by most of our users now but a few years ago were not happy with that. We do keep a record of who accesses data. And we embargo metadata as well as data as if it’s known, say, that a drug company supports a particular research group or university a competitor may start copying the line of inquiry even on the basis of the metadata… don’t think this is just about corporates though… In 2004 a research group in California arranged a meeting about a possible new planet, some researchers in Spain looked at the data they’d been using and reasoning that that research team had found a planet announced that THEY had found a planet. It’s not just big corporations; academics are really competitive!
But when we make the data available we make it easy to discover that data and reward it. For any data published we create a Data DOI that enables Google to find the page but also in the UK HEFCE have said that the open access research dataset use will be allowed in new REF. And data will also be going into the citation index that is used in the assessment of research centres.
So on our diagram of the infrastructure we now have metadata and Data DOI added.
Onto (3) and (4). In our data we include schedule and proposal – who, funder, what etc. that goes with that data. Except about 5% don’t do what they proposed so mostly that job is easily done but sometimes it can be problematic. We add publications data and analysis data – we can do this as we are providing the funding, facility and tools they are using. The data can be searched via Datacity. Our in-house TopCat system allows in-house browsing as well. And we’ve added new elements to infrastructure here.
Looking at (5), (6) and (7) new science, patents, policy. We are trying to find socio-economic impact into the process. We have adopted a commercial product called Tesella Safety Depositr Box with Fixity checks. We have a data format migration. And we have our own long term storage as well.
So that infrastructure looks more complex still. But this is working. We are meeting our preservation objectives. We are meeting the timescale of objectives (short, medium, long). Designated communities, additional information, security requirements are met. We can structure a business case using these arguments.
Q1) Being a repository major I was interested to hear that over the last few years 80% of researchers had gone from unhappy at sharing data to most now being happy. What made the difference?
A1) The driver was the funding implications of data citations. The barrier was distrust in others using or misinterpreting their data but our data policies helped to ameliorate that.
Topic: Postgraduate Research Data: a New Type of Challenge for Repositories?
Speaker(s): Jill Evans, Gareth Cole, Hannah Lloyd-Jones
I am going to be talking about Open Exeter project. This was funded under the Managing Research Data programme and was working as a pilot biosciences research project but we are expanding this to other departments. We created a survey for researchers to comment on Post Graduates by Research (PGRs) and researchers. We have created several different Research Data Management plans, some specifically targeted at PGRs. We have taken a very open approach to what might be data, and that is informed by that survey.
We currently have three repositories – ERIC, EDA, DCO – but we plan to merge these so that research is in the same place from data to publications. We will be doing this with DSpace 1.8.2 and Oracle 11g database system. We are using Sword2 and testing various types of upload at the moment.
The current situation is that thesis deposit is mandatory for PGRs but not deposit of data. There is no clear guidance or strategy for this nor a central data store for this. But there is no clear strategy for deposit for large size files and deposits of this kind are growing. But why archive PGR data? Well enhanced discoverability is important especially for early career researchers, raised research profile/portfolio is also good for the institution. There is also an ability to validate findings if queried – good for institution and individual. And this allows funder compliance – expected for a number of funders including the Wellcom Trust. And the availability of data on open access allows fuller exploitation of data and enables future funding opportunities.
Currently there is very varied practice. One issue is problem of loss of data – this has impact on their own work but increasingly PGRs are part of research groups so lacking access can be hugely problematic. Lack of visibility – limits potential for reuse, lack of recognition. And Inaccessibility can mean duplication of effort and inaccessibility can block research that might build on their work.
The solution will be to support deposit of big data alongside thesis. It will be a simple deposit. And a long term curation process will take place that is file agnostic and provides persistent IDS. Awareness raising and training will take place and we hope to embed cultural change in the research community. This will be supported by policy and guidance as well as a holistic support network.
The policy is currently in draft and mandates deposit if required by funder; encourages in other cases. We hope the policy will be ratified by 2013. There are various issues that need to addressed though:
- When should data be deposited
- Who checks data integrity
- IP/Confidentiality issues
- Who pays for the time taken to clean and package the data? This may not be covered by funders and may delay their studies but one solution may be ongoing assessment of data throughout the PGR process.
- Service costs and sustainability.
Find out more here
Q1, Anthony from Mont Ash) How would you motivate researchers to assess and cleanse data regularly?
A1) That will be about training. I don’t think we’ll be able to check individual cases though.
Q2, Anna Shadboldt, University of NZ) Given what we’re doing across the work with data mandates is there a reason
A2) We wanted to follow where the funders are starting to mandate deposit but all students funded by the university will also have to deposit data so that will have wider reach. In terms of self-funded students we didn’t think that was achievable.
Q3) Rob Stevenson, Los Alamos Labs) Any plans about different versions of data?
A3) Not yet resolved but at the moment we use handles. But we are looking into DOIs. The DOI system is working with the Handle system so that Handle will be able to deal with DOI. But versioning is really important to a lot of our potential depositors.
Q4 Simon Hodson from JISC) You described this as applying to PG students generally. Have you worked on a wider policy to wider research communities? Have there been any differences with supervisors or research groups approach this?
A4) We have a mandate for researchers across the university. We developed a PGR policy separately as they face different issues. In general supervisors are very pro preserving student data as reuse and use as this problem within research projects has arisen before. We have seen PGRS are generally pro this, researchers it tends to vary greatly by discipline.
More information: http://ex.ac.uk/bQ, project team: http://ex.ac.uk/dp and draft policies are at http://ex.ac.uk/dq and http://ex.ac.uk/dr
Topic: Big Data Challenges in Repository Development
Speaker(s): Leslie Johnston, Library of Congress
A lot of people have asked why we are at this sort of event, we don’t have a repository, we don’t have researchers, we don’t fund research. Well we actually do have a repository of a sort. We are meant to store and preserve the cultural output of the entire USA. We like to talk about our collections as big data. We have to develop new types of data that are very different to our old service model. We have learned that we have no way of knowing how our collections will be used. We talked about “collections” or “content” or “items” or “files”. But recently we have started to talk about and think about our materials as data. We have Big Data in libraries, archives and museums.
We first looked into this via Digging into Data Challenge through the National Endowment for the Arts and Humanities. This was one of the first introductions to our community, the libraries, archives and museums community, that research are interested in data – including bulk corpora – in their research.
So, what constitutes Big Data? Well the definition is very fluid and a moving target. We have a huge amount of data – 10-20TB per week per collection. We still have collections but what we also have is big data, which requires us to rethink the infrastructure that is needed to support Big Data services. We are used to mediating the researchers experience so the idea that they will use data without us knowing perhaps is radically different.
My first case study is our web archives. We try to collect what is on the web but it’s about heavily curated content around big events, around specific topics etc. When we started this in 2000 we thought researchers would be browsing to see how websites used to look. That’s not the case. People want to data mine the whole collection and look for trends = say for elections for instance. This is 360TB right now, billions of files. How do we curate and catalogue these? And how do we make them accessible? We also have an issue that we cannot archive without permission so we have had to get permission for all of these and in some cases the pages are only available on a terminal in the library.
Our next case study is our historic newspapers collections. We have worked with 25 states to bring in 5 million page images from historic newspapers all available with OCR. This content is well understood in terms of ingest. It’s four image files and an OCR file and a METS file and a MEDs file. But we’ve also made data available as an API. You can download all of those files and images if you want.
Case Study – Twitter. The twitter archive has tens of billions (21 billions) files in it. We are still somewhat under press archive. We received 2006-2010 archive this year. We are just now working with it. We have had over 300 research requests already in the two years since this was announced. This is a huge scale of research requests. This collection grows by tens of millions of items per hour. This is a tech and infrastructure challenge but also a social and training challenge. And under the terms of the gift researchers will have to come into the library, we cannot put this on the open web.
Case study – Viewshare. A lot of this is based on the SIMILE toolkit from MIT. This is a web tool to upload and share visualisations of metadata. It’s on sourceforge – all open access. Or the site itself: http://viewshare.org/. Any data shared is available as a visualisation but also, if depositor allows, the raw data. What does that mean for us?
We are working with lots of other projects, which could be use cases. Electronic journal articles for instance – 100GB with 1 million files. How about born-digital broadcast television? We have a lot of things to grapple with?
Can each of our organisations support real-time querying of billions of full text items? Should we provide the tools?
We thought we understood ingest at scale until we did it. Like many universities access is one thing, actual delivery is enough. And then there are fixities and check sums, validating against specifications. We killed a number of services attempting to do this. We are now trying three separate possibilities: our current kit, on better kit and on amazon cloud services. About ingest AND indexing. Indexing is crucial to making things available. How much processing should we do on this stuff? We are certainly not about to catalogue tweets! But expectations of researchers and librarians are about catalogues. This is a full text collection, and it will never be catalogued. It may be one record for the whole collection. We will do some chunking by time and in their native JSON. I can’t promise when or how this stuff will be happening.
With other collections we are doing more. But what happens if one file is corrupted? Does that take away from the whole collection? We have tried several tools for analysis – BigInsights and Greenplum. Neither is right yet though. We will be making files discoverable but we can’t handle the download traffic… we share the same core web and infrastructure as lse.gov and congress.gov etc. Can our staff handle these new duties or do we leave researchers to fend for themselves? We are mainly thinking about unmediated access for data of this type? We have custodial issues here? Who owns Twitter – it crosses all linguistic and cultural boundaries.
Q1) What is the issue with visiting these collections in person?
A1) With the web archives you can come in and use them. Some agreements allow take away of that data, some can only be used on-site. Some machines with analytics can be used. We don’t control access to research based on collections however.
Q2) You mentioned the Twitter collection. And you are talking about self-service collections. And people say stupid stuff there
A2) We only get tweets, we get username, we know user relations but we don’t get profile information or their graph. We don’t get most of the personal information. I’ve been asked if we will remove bad language – no. Twitter for us is like diaries, letters, news reporting, citizen journalism etc. We don’t want to filter this. There was a court case decided last week in New York that said that Twitter could be subpoenaed to give over a users tweets – we are looking at implications for us. But as we have 2006-10 archive this is less likely to be of interest. And we have a six month embargo on all tweets and any deleted tweets or deleted accounts won’t be making available. That’s an issue for us actually; this will be a permanently redacted archive in some ways.
Topic: Towards a Scalable Long-term Preservation Repository for Scientific Research Datasets
Speaker(s): Arif Shaon, Simon Lambert, Erica Yang, Catherine Jones, Brian Matthews, Tom Griffin
This is very much a follow up to Micheals talk earlier as I am also at the Science and Technologies Facilities Council. The pitch here is that we re interested in the long-term preservation of scientific data. Lots going on here and it’s a complex area thanks to the complex dependencies of digital objects also needing preservation to enable reusability and the large volumes of digital objects that need scalable preservation solutions. And Scientific data adds further complexity – unique requirements to preserve the original context (e.g. processed data, final publications, etc.). And may involve preservation of software and other tools etc.
As Michael said we provide large scale scientific facilities to UK Science. And those experiments running on STFC facilities generate large volumes of data that needs effective and sustainable preservation with contextual data. There is significant investment here – billions of €’s involved – and we have a huge community of usage here as well. We have 30K+ user visitors each year in Europe.
We have a fairly well established STFC scientific workflow. Being central facilities we have lots of control here. And you’ve seen our infrastructure for this. But what are the aims of the long term preservation programme? Well we want to keep data safe – the bits that are retrievable and the same as the original. We want to keep data usable – that which can be understood and reused at a later date. And we have three emerging themes in our work:
- Data Preservation Policy – what is the value in keeping data
- Data preservation Analysis – what are the issues and costs involved
- Data Preservation Infrastructure – what tools do we use
But there are some key data preservation challenges:
- Data Volume – for instance single run of ISIS experiment could be files of 1.2GB in size. An experiment typically has 100s of runs – files of 100+GB in total size. ISIS is a good test bed as these sizes are relatively small.
- Data Complexity- scientific HDF data format (NeXus), structural and semantic diversity in files
- Data Compatibility – 20 years of data archives here.
We are trialing a system that is proprietary and commercial and manages integrity and format verification; designed within library and archive context; turns a data storage service in to a data archive service. But there are some issues. There is limited scalability – not happy with files over several GBs. There is no support for syntactic and semantic validation of data. No support for linking data to its context (e.g. process description, publications). There is no support for effective preservation planning (tools like Plato).
We are doing this in the context of a project called SCAPE – Scalable Preservation Environments – an EC FP7 project with 16 partners (Feb 2011-Jan 2015) and it’s a follow on from the PLANETS project. We are looking at facilitating compute-intensive preservation processes that involve large (multi-TB) data sets. We are developing cloud-based preservation solutions using Apache Hadoop. For us the key products from the project for us will be a scalable platform for performing preservation operations (with potential format conversion), to enable automatic preservation processes. So our new infrastructure will add further context into our preservation service, a watch service will also alert us to necessary preservations over time. We will be storing workflows, policies and what we call PNMs for particular datasets. The tricky areas for us are the cloud based execution platform and the preservation platform.
The cloud-based workflow execution platform will be with Apache Hadoop and workflows may range from ingest operations etc. We are considering using Taverna for workflows. The PNM is Preservation Network Models (PNM) a technique developed by the CASPAR project and to formally represent the outputs of preservation planning. These models should help us control policies, workflows, and what happens with preservation watch.
Finally this is sort of the workflows we are looking at to control this. The process we might do for a particular file. Ingest via JOVE type. Then we check semantic integrity of the file. Then we build our AIP (archive in package) construction etc.
So at the moment we are in the design stage of this work but there are further refinements and assessment to come. And we have potential issues to overcome – including how Taverna might work with the system.
But we know that a scalable preservation infrastructure is needed for STFC’s large volumes of scientific data.
Q1) We run the Australian Synchotron so this was quite interesting for me. When you run the data will that data automatically be preserved? Our one is shipped to a data centre and can then be accessed as wanted.
A1) For ISIS the data volumes are relatively low so we would probably routinely store and preserve data. For Synchotron the data volumes are much larger so that’s rather difference. Although the existing work on crystallography may help us with identifying what can or cannot be preserved.
Q2) Where do you store your data? In Hadoop or somewhere else? Do you see Hadoop as a feasible long term data solution?
A2) I think we will be mainly storing in our own data systems. We see it as a tool to compute really.
Q3) What is software in data centre to store that much data?
A4) We have a variety of solutions. Our own home grown system is use. We use CASTA, the CERN system. We have a number of different ones as new ones emerge. Backup really depends on your data customer. If they are prepared to pay for extra copies you can do that. That’s a risk analysis. CERN has a couple of copies around the world. Others may be prepared to take the risk of data loss rather than pay for storage.
Topic: DTC Archive: using data repositories to fight against diffuse pollution
Speaker(s): Mark Hedges, Richard Gartner, Mike Haft, Hardy Schwamm
The Demonstration Test Catchment Project is funded by Defra and runs from Jan 2011 and Dec 2014. It’s a collaboration between the Freshwater Biological Association and KCL (Centre of eResearch) and builds upon previous JISC-funded research. To understand the project you need to understand the background to the data.
Diffuse Pollution is the release of polluting agent that may not have immediate effect but may have long term cumulative impact. Examples of diffuse pollution includes run off from roads, discharges of fertilisers in farms etc. What is Catchment? Well typically this is the catchment area of a particular body of water draining into a particular point. And the final aspect is the Water Framework Directive. This is a legal instruction for EU member states that must be implemented through national legislation within a prescribed time-scale. This framework impacts on water quality and so this stretches beyond academia and eResearch.
The project is investigating how the impact of diffuse pollution can be reduced through on-farm mitigation methods (changes to reduce pollution) and those have to be cost effective and maintain food production capacity. There are 3 catchment areas in England for tests to demonstrate three different environment types.
So how does the project work? Well roughly speaking we monitor various environmental markers; we try out mitigation measures, and then analyze changes in baseline readings. And it’s our job to curate that data and make it available and usable by various different stakeholders. So these measurements come in various forms – bankside water quality monitoring systems etc.
So the DTC archive project is being developed. We need that data to be useful to researchers, land managers, farmers, etc. So we have to create the data archive, but also the querying, browsing, visualizing, analysing and other interactions. There need to be integrated views across diverse data that suits their need. Most of the data is numerical – spreadsheets, databases, CSV files. Some of this is sensor data (automated, telemetry) and some are manual samples or analysis. The Sensor data are more regular, more risk of inconsistencies in manual data. There is also data on species/ecological data. Also geo-data. Also less highly structured information such as time series images, video, stakeholder surveys, unstructured documents etc.
Typically you need data from various objects etc. So checking levels of potassium you need data from of points in sensor data as well as contextual data from adjacent farms. So looking at data we see spreadsheets of sensor data, weather data, and land usage data as a map of usage for instance that might all be needed.
Some challenges around this data. The datasets are diverse in terms of structure, there are different degrees of structuring – both highly structured and highly unstructured combined here. And another challenge for us is INSPIRE with the intent of creating a European Spatial Data Infrastructure for improved sharing of spatial information and improve environmental policy. It includes various standards for geospatial data (e.g. Gemini2 and GML – Geography Markup Language) and it builds on various ISO standards (ISO 19100 series).
The generic data model is based around ISO 19156 concerned with observation and measurements. The model facilitates the sharing of observations across communities and includes metadata/contextual information and the people responsible for measurement. And this allows multiple data representations. The generic data model implemented in several ways for different purposes. For archival representation (based on library/archival standards), data representation for data integration (“atomic” representation as triples), and various derived forms.
In the IslanDora repository we create a data and metadata METS files and MADS files and MODs are there. That relationship to library standards is a reflection of the fact that this archive sits within a bigger more bibliographic type archive. The crucial thing here is ensuring consistency across data components for conceptual entities etc. So to do this we are using MADS a Metadata Archiving Description Standard that helps explain the structure and format of the files and links to vocabulary terms and table search. The approach we are taking is to break data out to RDF based model. This approach has been chosen because of simplicity of data model and flexibility of that data model.
Most of this work is in the future really but based on that earlier JISC work – breaking data out of tables and assembling in triples. Something that is clear form an example data set – where we see collection method, actor, dataset, tarn, site, locating, and a multiple observation sets each with observations, all as a network of elements. So to do this we need common vocabularies – we need columns, concepts, entities mapped to formal vocabularies. Mappings defined as archive objects. We have automated, computer-assisted and manual approaches here. The latter require domain experience and mark up of text.
Architecturally we have diverse data as archival data in islandora. Then mapped and broken into RDF triples and then mapped again out to browsing, visualisation, search, analysis for particular types of access or visualisation. That break up may be a bit perverse. We think of it as breaking into atoms and recombining it again.
The initial aim is to meet needs of specific sets of stakeholders, we haven’t thought about the wider world but this data and research may be of interest to other types of researchers and broader publics in the future.
At the moment we are in the early stages. Datasets are already being generated in large quantities. There is some prototype functionality. We are looking next at ingest and modeling of data. Find out more here: http://dtcarchive.org/
Q1) This sounds very complex and specific. How much of this work is reusable by other disciplines?
A1) If it works then I think the general method could be applicable to other disciplines. But the specifics are very much for this use case but the methodology would be transferrable.
Q2) Can you track use of this data?
A2) We think so, we can explain more about this
Q3) It strikes me that these sorts of complex collections of greatly varying data is a common type of data in many disciplines so I would imagine the approach is very reusable. But the Linked Data approach is more time consuming and expensive so could you explain cost benefit of this?
A3) We are being funded to deliver this for a specific community. Moving to the end of the project converting the software to another area would be costly – developing vocabularies say. It’s not just about taking and reusing this work, that’s difficult, it’s about the general structure.
And with that this session is drawing to a close with thank you from our chair Elin Strangeland.