Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.
If you are following the event online please add your comment to this post or use the #or2012 hashtag.
This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.
Topic: Eating your own dog food: Building a repository with API-driven development
Speaker(s): Nick John Jackson, Joss Luke Winn
The team decided they wanted to build a wholly new RDM, with research data as a focus for the sake of building the best tool for that job. This repository was also designed to store data during research, not just after.
Old repositories work very well, but they assume the entry of a whole file (or a pointer), only retrievable in bulk and in oddly organized pieces. They have generally limited interface methods and capacities. These old repositories also focus on formats, not form (structure and content) unless there is fantastic metadata.
The team wanted to do something different, and built a great backend first. They were prepared to deal with raw data as raw data. The API was built first, not the UI. APIs are the important bit. And those APIs need to be built in a way that people will want to use them.
This is wear eating your own dog food comes in. The team used their own API to build the frontend of the system, and used their own documentation. Everything had to be done well because it was all used in house. Then, they pushed it out to some great users, and made them do what they wanted to do with the ‘minimum viable product’. It works, and you build from there.
Traditional repos have a database, application, users. They might tack an API on at the end for manual and bulk control, but it doesn’t even include all of the functionality of the website usually. That or you screen scrape, and that’s rough work. Instead, this repository builds an API and then interacts with that via the website.
Research tends to happen on a subset of any given data set, nobody wants that whole data set. So forget the containers that hold it all. Give researches shared, easily usable databases. APIs put stuff in and out automatically.
This was also made extensible from day one. Extensible and writeable by everybody to the very core. The team also encourages re-usable modularity. People do the same things to their data over and over – just share that bit of functionality at a low data level. And they rely on things to do things to get things done – in other words, there’s no sense in replicating other people’s work if it’s done well.
The team ended up building better stuff because it uses its own work – if it doesn’t do what it’s meant to, it annoys them and they have to fix it. All functionality is exposed so they can get their work done quick and easy. Consistent and clean error handling were baked in for the sake of their own sanity, but also for everybody else. Once it’s all good and easy for them, it will be easy for 3rd parties to use, whether or not they have a degree in repo magic. And security is forcibly implemented across the board. API-level authentication means that everything is safe and sound.
Improved visibility is another component. Database querying is very robust, and saves the users the trouble of hunting. Quantitative information is quick and easy because the API gives open access to all the data.
This can be scalable horizontally, to as many servers as needed. It doesn’t use server states.
There are some problems involved in eating your own dog food. It takes time to design a decent API first. You also end up doubling up some development, particularly for frontend post-API development. APIs also add overhead. But after some rejigging, it all works with thousands of points per second, and it’s humming nicely.
Q: Current challenges?
A: Resourcing the thing. Lots of cutting edge technology and dependence on cloud architecture. Even with money and demand, IT infrastructure aren’t keeping up just yet.
Q: How are you looking after external users? Is there a more discoverable way to use this thing?
A: The closest thing we have is continuous integration to build the API at multiple levels. A discovery description could be implemented.
Q: Can you talk about scalability? Limitations?
A: Researchers will sometimes not know how to store what they’ve got. They might put pieces of data on their own individual rows when they don’t need to be. That brings us closer to our limit. Scaling up is possible, and doing it beyond limits is possible, but it requires a server-understood format.
Q: Were there issues with developers changing schemas mysteriously? Is that a danger with MongoDB?
A: By using our own documentation, forcing ourselves to look at it when building and questioning. We’ve got a standard object with tracking fields, and if a researcher starts to get adventurous with schemas it’s then on them.
Topic: Where does it go from here? The place of software in digital repositories
Speaker(s): Neil Chue Hong
Going to talk about the way that developers of software are getting overlapping concerns with the repository community. This isn’t software for implementing infrastructure, but software that will be stored in that infrastructure.
Software is pervasive in research now. It is in all elements of research.
The software sustainability institute does a number of things at strategic and tactical levels to help create best practices in research software development.
One question is the role of software in the longer term – five and ten years on? The differences between preservation and sustainability. The former holds onto things for use later on, while the latter keeps understanding in a particular domain. The understanding, the sustainability, is the more important part here.
Several purposes for sustaining and preserving software. For achieving legal compliances (architecture models ought to be kept for the life of a building). For creating heritage value (gaining an overall understanding of influences of a creator). For continued access to data (looking back, through the lens of the software). For software reuse (funders like this one).
There are several approaches. Preserving the technology, whether it’s physical hardware or an emulated environment. Migration from one piece of software to another over time while ensuring functionality, or transitioning to something that does similar. There’s also hibernation, just making sure it can be picked apart some day if need be.
Computational science itself needs to be studied to do a good job of this. Software carpentry teaches scientists basic programming to improve their science. One thing, using repositories, is an important skill. Teaching scientists the exploratory process of hacking together code is the fun part, so they should get to do it.
Re-something is the new black. Reuse, review, replay, rerun, repair. But also reward. How can people be rewarded for good software contributions, the ones that other people end up using. People get pats on the back, glowing blog posts, but really reward in software is in its infancy. That’s where repositories come in.
Rewarding good development often requires publication which requires mention of the developments. That ends up requiring a scientific breakthrough, not a developmental one. Software development is a big part of science and it should be viewed/treated as such.
Software is just data, sure, but along with the Beyond Impact team these guys have been looking at software in terms of preservation beyond just data. What needs to get kept in software and development? Workflows should, because they show the boundaries of using software in a study – the dependencies and outputs of the code. Looking at code on various levels is also important. On the library/software/suite level? The program or algorithm or function level. That decision is huge. The granularity of software needs to be considered.
Versioning is another question. It indicates change, allows sharing of software, and confers some sort of status. Which versions should go in which repositories, though? That decision is based on backup (github), sharing (DRYAD), archiving (DSpace). Different repositories do each.
One of the things being looked at in sustaining software are software metapapers. These are scholarly records including ‘standard’ publication, method, dataset and models, and software. This enables replay, reproduction, and reuse. It’s a pragmatic approach that bundles everything together, and peer review can scrutinize the metadata, not the software.
The Journal of Open Research Software allows for the submission of software metapapers. This leads to where the overlap in development and repositories occurred, and where it’s going.
The potential for confusion occurs when users are brought in and licensing occurs. It’s not CC BY, it’s OSI standard software licenses.
Researchers are developing more software than ever, and trying to do it better. They want to be rewarded for creating a complete scholarly record, which includes software. Infrastructure needs to enable that. And we still don’t know the best way to shift from one repository role to another when it comes to software – software repositories from backup to sharing to archival. The pieces between them need to be explored more.
Q: The inconsistency of licensing between software and data might create problems. Can you talk about that?
A: There is work being done on this, on licensing different parts of scholarly record. Looking at reward mechanisms and computability of licenses in data and software need to be explored – which ones are the same in spirit?
Topic: The UCLA Broadcast News Archive Makes News: A Transformative Approach to Using the News in Teaching, Research, and Publication
Speaker(s): Todd Grappone, Sharon Farb
UCLA has been developing an archive since the Watergate hearings. It was a series of broadcast television recordings for a while, but not it’s digital libraries of broadcast recordings. That content is being put into a searchable, browsable interface. It will be publicly available next year. It grows about a terabyte a month (150000+ programs and counting), which pushes the scope of infrastructure and legality.
It’s possible to do program-level metadata search. Facial recognition, OCR of text on screen, closed caption text, all searchable. And almost 10 billion images. This is a new way for the library to collect the news since papers are dying.
Why is this important? It’s about the mission of the university copyright department: public good, free expression, and the exchange of ideas. That’s critical to teaching and learning. The archive is a great way to fulfill that mission. This is quite different from the ideas of other Los Angeles organizations, the MPAA and RIAA.
The mission of higher education in general is about four principles. The advancement of knowledge through research, through teaching, and of preservation and diffusion of that knowledge.
About 100 news stations being captured so far. Primarily American. International collaborators are helping, too. Pulling all broadcast, under a schedule scheme with data. It’s encoded and analyzed, then pushed to low-latency storage in H.264 (250MB/hr). Metadata is captures automatically (timestamp, show, broadcast ID, duration, and full search by closed captioning). The user interface allows search and browse.
So, what is news? Definitions are really broad. Novelties, information, and a whole lot of other stuff. The scope of the project is equally broad. That means Comedy Central is in there – it’s part of the news record. Other people doing this work are getting no context, little metadata, less broadcasts. And it’s a big legal snafu that is slowly untangling.
Fortunately, this is more than just capturing the news. There’s lots of metadata – transformative levels of information. Higher education and libraries need these archives for the sake of knowledge and preservation.
Q: Contextual metadata is so hard to find, and knowing how to search is hard. How about explore? How about triangulating with textual news via that metadata you do have?
A: We’re pulling in everything we can. Some of the publishing from these archives use almost literally everything (court cases, Twitter, police data, CCTV, etc). We’re excited to bring it all together, and this linkage and exploration is the next thing.
Q: In terms of tech. development, how has this archive reflected trends in the moving image domain? Are you sharing and collaborating with the community?
A: An on-staff archivist is doing just that, but so far this is just for UCLA. It’s all standards-driven so far, and community discussion is the next step.
Topic: Variations on Video: Collaborating toward a robust, open system to provide access to library media collections
Speaker(s): Mark Notess, Jon W. Dunn, Claire Stewart
This project has roots in a project called Variations in 1996. It’s now in use at 20 different institutions, three versions. Variations on Video is a fresh start, coming from a background in media development. Everything is open source, working with existing technologies, and hopefully engaging with a very broad base of users and developers.
The needs that Variations on Video are trying to meet are archival preservation, access for all sorts of uses. Existing repositories aren’t designed for time-based media. Storage, streaming, transcoding, access and media control, and structure all need to be handled in new ways. Access control needs to be pretty sophisticated for copyright and sensitivity issues.
Existing solutions have been an insufficient fit. Variations on Video offers basic functionality that goes beyond them or does them better. File upload, transcoding, and descriptive metadata will let the repository stay clean. Navigation and structural metadata will allow users to find and actually use it all.
VoV is built on a Hydra framework, Opencast Matterhorn, and a streaming server that can serve up content to all sorts of devices.
PBCore was chosen for descriptive metadata, with an ‘Atomic’ content model: parent objects for intellectual descriptions, child objects for master files, children of these for derivatives. There’s ongoing investigation for annotation schemes.
Release 0 was this month (upload, simple metadata, conversion), and release one will come about in December 2012. Development will be funded through 2014.
Uses Backlight for discover, Strobe media player for now. Other media players with more capabilities are being considered.
Variations on Video is becoming AVALON (Audio Video Archives and Libraries Online).
Using the agile Scrum approach with a single team at the university for development. Other partners will install, test, provide feedback. All documentation, code, workflow is open, and there are regular public demos. Hopefully, as the software develops, additional community will get involved.
Q: Delivering to mobile devices?
A: Yes, the formats video will transcode into will be selectable, but most institutions will likely choose a mobile-appropriate format. The player will be able to deliver to any particular device (focusing on iOS and Android).
Q: Can your system cope with huge videos?
A: That’s the plan, but ingesting will take work. We anticipate working with very large stuff.
Q: How are you referencing files internally? Filenames? Checksums? Collisions of named entries?
A: Haven’t talked about identifiers yet. UUIDs generated would be best, since filenames are a fairly fragile method. Fedora is handling identifiers so far.
Q: Can URLs point to specific times or segments?
A: That is an aim, and the audio project already does that.