Jul 112012
 

Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Panel Discussion Proposal: “Effective Strategies for Open Source Collaboration” Panel Proposal “Effective Strategies for Open Source Collaboration”
Speaker(s): Tom Cramer, Jon William Butcher Dunn, Valorie Hollister, Jonathan Markow

This is a panel session, so it’s a little bit different. We’ve asked all of our DuraSpace experts here about collaborations they have been engaged in and then turn to you for your experiences of collaboration – what works and what doesn’t.

So, starting with Tom. I’m going to talk about 3 different Open Source technologies we are involved in. First of these is Blacklight which is in use in many places. It’s a faceted search application – it’s Ruby-on-Rails on solr. Originally developed at UVa around 2007, 1st adopted external to UVa in 2009. It’s had multiple installations, 10+ committer institutions etc.

Hydra is a framework for creating digital asset management apps to supplement Fedora. Started in 2008 in Hull, Stanford and Virginia with FedoraCommons. It’s institutionally-driven And developer-led.

And the last item is IIIF: International Image Interoperability Framework – I’ll be talking more on this later – an initiative by major research libraries across the world – a cooperative definition of APIs to enable cross-repository image collections. It’s a standards not technology project.

Lessons learned…

DO: Work from a common vision; be productive, welcoming and fun; engineer face-time is essential; get great contributors – they lead to more great contributors too!

DON’T: over-plan, over-govern; establish too many cross institution dependencies; get hooked on single sources of funding.

Now over to Jon. A few collaborations. First up Sakaibrary. Sakai is an eLearning/course management tool used by dozens of institutions. There was a collaborative project between Indiana University and University of Michigan Libraries to develop extensions to Sakai and facilitate use of library resources in teaching and learning. Top down initiative from university head librarians. Mellon funding 2006-2008 (http://sakaibrary.org).

The second project is Variations on Video. This one is a collaboration between Indiana University and Northwestern University Libraries – with additional partners for testing and feedback. This is a single cross institution team using AGILE Scrum approaches.

Lessons learned from these projects… Success factors: initial planning periods – shared values and vision being established – helped very much; good project leadership and relationships between leaders important; collaborative development model. Some challenges: Divergent timelines; electronic communication vs. face-to-face – very important to meet face to face; existing community culture; shifts in institutional priorities and sustainability.

Now over to Val, Director of Community Programs for DuraSpace. Part of my role is to encourage teams to collaborate and gain momentum within the DSpace community. We are keen to get more voices into the development process. We had DSpace Developer meeting on Monday and have made some initial tweaks, and continue to tweak, the programme. So what is the DSpace Community Advisory Team? Well we are a group of mostly repository managers/administrators. Developers wanted help/users wanted more input. Formed in Jan 2011, 5-7 active members. DCAT helps review/refine new feature requests – get new voices in there but also share advice, provide developer help. We had a real mission to assess feature requests, gauge interest, and enable discussion.

Some of the successes of DCAT. We have reviewed/gathered feedback on 15_ new feature requests – 3 were included in the last release. It really has broadened development discussion – developers and non-developers, inter/intra-institution. And it has been useful help/resource for developers – community survey by DCAT and provided recommendation on the survey. Feedback on feature implementation.

Challenges for us: no guarantee that a feature makes it in – despite everyone’s efforts features still might not make it in, because of resource limitations; continue to broaden discussion and broaden developer pool; DCAT could also be more helpful during the release process itself – to help with testing, working out bugs etc.

So the collaboration has been successful with discussion and features but continue to do better at this!

Now Jonathan is asking the panel: how important is governance in this process? How does decision making take place?

Tom: Different in different communities. And bottom up vs. top down makes a big difference. In bottom up it’s about developers working together, trusting each other, building the team but keeping code quality is challenging on a local and broader level for risk averse communities.

Jon: governance different between the two projects. In both cases we did have a project charter of sorts. for Sakaibrary it was more consensus based – good in some ways but maybe a bit less productive as a project as a result. In terms of prioritisation of features in the video project we are making use of the scrum concept really and the idea of product owners is very useful there. We try to involve whole team but product owner define priorities. When we expand to other institutions with their own interests we may have to explore other ways of doing things – we’ll need to learn from Hydra etc.

Val: I think DCAT is a wee bit different. Initially this was set up between developers and DCAT and that has been an ongoing conversation. Someone taking the lead on behalf of developers was useful. And for features DCAT members tend to take the lead on a particular request or other to lead analysis etc. of it.

Q&A

Q1) In a team development effort there is great value to being able to pop into someone’s office and ask for help. And lots of decisions made for free – a discussion really quickly. When collaborative even a trivial decision can mean a 1 hr conference call. How do you deal with that.

A1 – Jon) In terms of the video project we take a couple of approaches – we use IRC channel and Microsoft Link for one-t0-one discussion as needed. We also have daily 15 min stand up meeting via telephone or video conference. And that agile approach with 2 week cycles means it’s not hugely costly to take the wrong approach or find we want to change something.

A1 – Tom) With conference calls we now feel if it takes an hour we shouldn’t make that decision. Move to IRC rather than email is a problem in different time zones. Email lets you really consider things through and that’s no bad thing.. one member of the Blacklight community is loquacious but often answers his own questions inside of an hour! you just learn how to work together.

A1 – Jonathan) We really live on Skype and that’s great. But I miss water cooler moments, tacit understandings that develop there. There’s no good substitute for that.

 

Topic: High North Research Documents – a new thematic and global service reusing all open sources
Speaker(s): Obiajulu Odu, Leif Longva

Our next speakers are from the University of Tromso. The High North Research Documents is a project we began about six months ago. You may think that you are high in the North but we are from far arctic Norway. This map gives a different perspective on the globe, on the north. We often think of the north as the North of America, of Asia etc. but the far north is really a region of it’s own.

The Norwegian government has emphasized the importance of northern areas and the north is also of interest on an international level – politically and strategically; environmental and climate change issues; resource utilization; the northern sea route to the Pacific. And our university, Tromso, is the northernmost university in the world and we are concerned with making sure we lead research in the north. And we are involved in many research projects but there can be access issues. The solution is Open Access research literature and we thought that it would be a great idea to look at the metadata to extract a set of documents concerned with High North research.

The whole world is available through aggregators like OAIster (OCLC) and BASE (University of Bielefeld) and they have been harvesting OA documents across the world. We don’t want to repeat that work. We contacted the guys a Bielefeldand they were very useful. We have been downloading their metadata local allowing us to do what we wanted to do to analyse the metadata.

Our hypothesis was if we selected a set of keywords and they are in the metadata then the thematic scope of the document can be identified. So we set up a set of filtering words (keywords) applied to the metadata of BASE records based on: geographic terms; species names; languages and folks (nations); other keywords. We have mainly looked for English and Norwegian words, but there is a bigger research world out there.

The quality of keywords is an issue – are their meanings unambiguous. Labrador for instance for us is about Northern Canada, it has a different meaning – farmer or peasant – in Spanish. Sami is a term for people but it is also a common given name in Turkey and Finland! So we have applied keywords filtering a selection of elements – so “sami AND language” or “sami AND people”. The filter process is applied only to selected metadata elements – title, description, subject. But it’s not perfect.

Looking at the model we have around 36 million documents from 2150 scholarly resources. These are filtered, extracted. And one subset of keywords go right into the High North Research Documents database. Another set of keywords we don’t trust as much so they go through a manual quality control first. Now over to my colleague Obiajulu.

Thank you Leif. We use a series of modules in the High North System model. The Documents service itself is DSpace. The import module gets metadata records and puts them in our MySQL database. After documents are imported we have the extraction module – applies the extraction criteria on the metadata. The Ingest module transforms metadata records relevant to the high north into DSpace XML format and imports them into a DSpace repository. And we have the option of addicting custom information – including use of facets.

Our Admin Module allows us to add, edit or display all filtering words (keywords). And it allows us to edit the status of a record or records – Blacklist/reject; approved; modified. So why do we use DSpace? Well we have used it for 8 or 9 years to date. It provides end use with both a regular search interface and faceted search/browsing. Our search and discovery interface is an extension of DSpace and it allows us to find out about any broken links in the system.

We are on High North RD v 1.1. 151,000 documents extracted from more than 50% of the sources appealing in BASE and from all over the world. Many different languages – even if we apply mainly English and Norwegian and Latin in the filtering process. Any subject but weight on the hard sciences. And we are developing the list of keywords as a priority so we have more and better keywords.

When we launched this we tried to get word out as far and wide as possible. Great feedback received so far. The data is really heterogeneous in quality, full text status etc. so feedback received has been great for finding any issues with access to full text documents.

Many use their repository for metadata only. That would be fine if we could identify where a record is metadata only. We could use the dc:rights but many people do not use this. How do we identify records without any full text documents. We need to weed out many non-OA records from High North RD – we only want OA documents, it’s not a bibliographic service we want to make. Looking at document types we have a large amount of text and articles/journals but also a lot of images (14-15% ish). The language distribution shows English. Much smaller percentage in French, Norwegian… and other languages.

So looking at the site (http://highnorth.uit.no/). It’s DSpace and everything in it is included in a single collection. So… if I search for pollution we see 2200 results and huge numbers of keywords that can be drilled down into. You can filter by document type, date, languages etc.

And if we look at an individual record we have a clear feedback button that lets users tell us what the problem is!

Q&A

Q1) You mentioned checking quality of keywords you don’t trust, and that you have improvements coming to keywords. Are you quality checking the “trusted” keywords.

A1) When we have a problem record we can track back over the keywords and see if one of those is giving is giving us problems, we have to do that this way.

We believe this to be a rather new method, to use keywords in this way to filter content. We haven’t come across it before, it’s simple but interesting. We’d love to hear about any other similar system if there are any. And it would be applicable to any topic.

Topic: International Image Interoperability Framework: Promoting an Ecosystem of Open Repositories and Open Tools for Global Scholarship
Speaker(s): Tom Cramer

I’m going to talk about IIIF but my colleagues here can also answer questions on this project. I think it would be great to get the open repositories community involved in this process and objectives.

There are huge amounts of image resources on the web – books, manuscripts, scrolls, etc. Loads of images and yet really excellent image delivery is hard, it’s slow, it’s expensive, it’s often very disjointed and often it’s too ugly. If you look at bright spots: CDragon, Google Arts, or other places with annotation or transcription it’s amazing to see what they are doing vs. what we do. Its like page turners a few years ago – there were loads, all mediocre. Can we do better?! And we – repositories, software developers, users, funders – all suffer because of this stuff.

So consider…

… a paleographer who would like to compare scribal hands from manuscripts at two different repositories – very different marks and annotations.

— an art and architecture instructor trying to assemble a teaching collection of images from multiple sources..

… a humanities scholar who would like to annotate a high resolution image of an historical map – lots of good tools but not all near those good resources.

… a repository manager who would like to drop a newspaper viewer with deep zoom into her site with no development of customization required

… a funder who would like to underwrite digitization of scholarly resources and decouple content hosting and delivery.

We started last September a year long project to look at this – a group of 6 of the worlds leading libraries and Stanford. Last September we looked at the range of different image interfaces. Across our 7 sites there were 15 to 20 interfaces, including Oxford it was more like 40 or 50 interfaces. Oxford seems to have lots of legacy humanities interfaces – lovely but highly varied – hence the increase in numbers.

So we want specialised tools but less specialised environment. So we have been working on Parker on the web project – mediaeval manuscripts project with KCL and Stanford. the La Munda Le Rose is similar in type. Every one of these many repositories is a silo – no interoperability. Every one is a one-off – big overhead to code and keep. And every user is forced to cope – many UIs, little integration. no way to compare one resource with another. They are great for researchers who fed into the design but much less useful for others.

Our problem is we have confused the role of responsibilities of the stakeholders here. We have scholars who want to find, use, analyze, annotate. they want to mix and match, they want best of breed tools. We have toolers – build useful tools and apps – want users and resources. And we have the repositories who want to host, preserve and enrich records.

So for the Parker project we had various elements managed via APIs. We have a TPEN transcription tool. We sent TPEN hard drive full of Tiffs to work on. Dictionary of Old English, they couldn’t take a big file of TIFFs but we gave them access to the database. We also had our own app. So our data fed into three applications here and we could have taken the data on some round trips – adding annotations before being fed into database. And by taking those APIs into a Framework and up into an Ecosystem we could enable much more flexible solutions – ways to view resources in the same environment.

So we began some DMS Tech work. We pulled together technologists from a dozen institutions to look at best tools to use, best adaptations to make etc. and we came up with basic building blocks for ecosystem: image delivery API (speced and built); data model for medieval manuscripts (M3/SharedCanvas) – we anticipate people wanting to page through documents – for this type of manuscript the page order, flyleafs, inserts etc. are quite challenging; support for authentication and authorization – it would be great if everything was open and free but realistically it’s not; reference implementations of load balanced, performant Djatoka server – this seemed to be everyone’s page turning software solution of choice; interactive open source page turning and image viewing application; OAC-compatible tools for Annotation (Digital Mappaemundi) and transcription (T-PEN).

We began the project last October, some work already available. the DMS Index pulls data from remote repositories and you can explore in a common way as the data is structured in a common way. You can also click to access annotation tools in DM, or to transcribe the page from TPEN etc. So one lets you explore and interact with this diverse collection of resources.

At the third DMS meeting we started wondering if, if this makes sense for manuscripts, doesn’t this make sense for other image materials. IIIF basically takes the work of DMS and looks how we can bring these to the wider world of images. We’ve spent the least 8 or 9 months putting together the basic elements. So there is a Restful interface to pic up an image from a remote location. We have a draft version of the specification available for comment here: http://library.stanford.edu/iiif/image-api. What’s great is the possibility to bring in functionality on images into your environment that you don’t already offer but would like to. Please do comment into 0.9 proclamation you have until 4pm Saturday (Edinburgh time).

The thing about getting images into a common environment is that you need metadata. We want and need to focus on just what the key metadata needs to be – labels, title, sequence, attribution etc. Based on http://shared-canvas.org (synthesis of OAC (open annon. collab) and DMS.

from a software perspective we are not doing software development but we hope to ferment lots of software development. So we have thought of this in terms of tiers for sharing images. Lots of interest in Djatoka, IIIF Image API and then sets of tools for deep panning, zooming, rotating etc. And then moving into domain and modality specific apps. And so we have a wish list for what we want to see developed.

This was a one year planning effort – Sept 2011 – Aug 2012. We will probably do something at DOF as well. We have had three workshops. We are keen to work with those who want to expose their data in this sort of way. Just those organisations in the group have millions of items that could be in here.

So… What is the collective image base of the Open Repository community? What would it take to support IIIF APIs natively from the open repository platforms? What applications do you have that could benefit from IIIF? What use cases can you identify that could and should drive IIIF? What should IIIF do next? Please do let us know what we could do or what you would like us to do.

Useful links: IIIF: http://lib.stanford.edu/iiif; DMS Interop: http://lib.stanford.edu/dmm; Shared-canvas: http://shared-canvas.org.

Q&A

Q1) Are any of those tools available, open source?

A2) T-Pen and DM are probably available. Both Open Source-y. Not sure if code distributed yet. Shared Canvas code is available but not easy to install.

Q2) What about Djatoka and improved non buggy version?

A2) There is a need for this. Any patches or improvements would be useful. There is a need and no-one has stepped up to the plate yet. We expect that as part of IIIF that we will publish. The national library of Norway rewrote some of the coding in C, which improved performance three-fold. They are happy to share this. It is probably open source but hard to find the code – theoretically open source.

And with that we are off to lunch…

 July 11, 2012  Posted by at 10:03 am LiveBlog, Updates Tagged with:

Sorry, the comment form is closed at this time.