Jul 112012

Today we are liveblogging from the OR2012 conference at Lecture Theatre 4 (LT4), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.

If you are following the event online please add your comment to this post or use the #or2012 hashtag.

This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.

Topic: Built to Scale?
Speaker(s): Edwin Shin

I’m going to talk about a project I recently worked on with a really high volume read vs. write. 250 million records – largest blacklight solr application. Only took a couple of days with Solr to index these but for reasonable query performance thresholds things get more complex. The records were staged in a relational database (Postgres). And around 1KB/record (bibliographic journal data). There are some great documented examples here that helped us. And we had a good environment – 3 servers each with 12 physical cores, 100GB RAM. We just moved all that data from Postgres – 80Gig compressed took a long time. The rate of ingest of the first 10K records if constant, suggested that all 250 million could be achieved in under a day but performance really slowed down after the first

Assign 32 GB of heap to JVM – we found RAM has more impact than CPU. Switch to Java 7. Adding documents in batches of 1000. We stopped forcing commits and only did this every 1 million documents. So in the end we indexed, to the level we wanted, 250 million documents in 2.5 days. We were pretty happy with that.

Querying – we were working with 5 facets (Format, Journal, Author, Year and Keywords) and 7 queryable fields. Worst Case was just under a minute. Too sow. So we optimised querying by running optimize after index. Added newSearcher and firstSearcher event handlers. But it was still slow. We started looking at sharing. 12 shards across 2 servers: 3 Tomcat instances per server, each Tomcat with 2 shards. This means splitting the index across machines and Solr is good at letting you do that and search all the shards at once. So our worst case query dropped from 77 seconds to 8 seconds.

But that’s till too slow. We noticed that filterCache wasn’t being used much, it needed to be bigger. Each shard had about 3 million unique keyword terms cached. We hadn’t changed default level at 512. We bumped it to about 40,000. We removed facets with large number of unique terms (e.g. keywords). The Worst case queries were now down to less than 2 seconds.

The general theme is that there was no one big thing we did or could do, it was about looking at the data we were dealing with and making the right measures for our set up.

We recently set up a Hydra installation, again with a huge volume of data to read. We needed to set up ingest/update queues with variable “worker” threads. It became clear that Fedora was the bottleneck for ingest. Fedora objects created programmatically rather than by FOXML documents – making it slower. The latter would have been fast but would have caused problems down the road, less flexibility etc. Solr performed well and wasn’t a bottleneck. But we got errors and data corruption in Fedora when we had 12-15 concurrent worker threads. What was pretty troublesome was that we could semi replicate this in staging for ingesting. But we couldn’t get a test case and never could get to the bottom of this. So we worked around it… and we decided to “shard” a standalone Fedora repository. It’s not natively supported so you have to do it separately. Sharding is handled by ActiveFedora using a simple hashing algorithm to shard things. We started with just 2 shards and use an algorithm much as Fedora uses internally for distributing files. We get on average a pretty even distribution across the Fedora repositories. This more or less doubled ingest performance without any negative impact.

project at the end of last year. A project with 20 million digital objects. 10-39 read transactions per second 24/7. High availability required, no downtime for reads. no more than 24 hours downtime for writes. Very challenging set up.

So, the traditional approach for high up time is the Fedora Journaling Module that allows you to ingest once to many “follower” installations. Journaling is proven, it’s a simple and straightforward design. Every follower is a full redundant node. But that’s also a weakness. Every follower is a full, redundant node – huge amounts of data and computationally expensive processes that happen on EVERY node, that’s expensive in terms of time, storage, traffic. And this approach assumes a Fedora-centric architecture. If you have a complex set up with other components this is more problematic still.

So we modeled the journaling and looked at what else we could do. So we set up the ingest that was replicated but then fed out to a Fedora shared file system and was fed into nodes but not doing FULL journaling.

But backups, upgrade and disaster recovery. But with 20 million digital objects. The classic argument for Fedora is that you can always rebuild. In a disaster that could take months here, though. But we found that most users used new materials – items from the last year – so we did some working around to make that disaster recovery process faster.

Overall the general moral of the story is you can only make these types of improvements if you really know

Q1) What was the garbage collector you mentioned?

A1) G1 Garbage collector that comes with Java 7

Q2) Have you played with the chaos monkey idea? Netflix copies to all its servers and it randomly stops machines to train the programming team to deal with that issue. It’s a neat idea.

A2) I haven’t played with it yet, I’ve yet to meet a client who would let me play with that but it is a neat idea.

Topic: Inter-repository Linking of Research Objects with Webtracks
Speaker(s): Shirley Ying Crompton, Brian Matthews, Cameron Neylon, Simon Coles

Shirley from STFC – Science and Technology Facilities Council. We run large facilities for researchers. We run manage a huge amount of data every year and my group runs the e-Infrastructure for these facilities – including the ICAT Data Catalogues, E-publications archive and Petabyte Data Store. We also contribute to data management, data preservation etc.

Webtracks is a joint programme between STFC and University of Southampton. This is a Web-scale link TRACKing for research data and publications. Science on the web increasingly involves the use of diverse data sources and services plus objects. And this ranges from raw data from experiments through to contextual information, lab books, derived data, research outputs such as publications, protein models etc. When data moves from research facility to home institution to web materials areas we lose that whole picture of the research process.

Linked data allows us to connect up all of these diverse areas. If we allow repositories to communication then we can capture the relationship between research resources in context. It will allow different types f resources to be linked within a discipline – linking a formal publication to on-line blog posts and commentary. Annotation can be added to facilitate intelligence linking. It allows researchers to annotate their own work with materials outside their own institution.

Related protocols here: Trackback (tracking distributed blog conversations, with fixed semantics), Semantic Pingback (RPC protocol using P2P).

In webtracks we took a two pronged approach: inter-repository communications pool and a Reslet Framework. The InteRCom protocol that allows repositories to connect and describe their relationship (cito: isCitedby). InteRCom is a two stage protocol like Trackback, first harvesting of resources and metadata. Then pinging process to post the link request. The architecture is based on the Restlet Framework (with a data layer access, app-spec config (security, encoding, tunneling), resource wrapper. This has to connect to many different institutional policies – whitelisting, pingback (and checking if this is a genuine request), etc. Lastly you have to implement the resource cloud to expose the appropriate links.

Webtracks uses a Resource Info Model. A repository connected to a resource, to a link and each link has subject, predicate and object. The link can be updated and tracked automatically using HTTP. We have two exemplars being used with WebTracks the ICAT investigation resource – DOI landing page, and HTML rep with RDFa – so a machine and human readable version. The other exemplar is EPubs set up much like ICAT.

InterCom Citation Linking – we can see on the ICAT DOI landing page linking to Epubs expression links pae. That ICAT DOI also links to ICAT Investigation links page and in turn that links to Epubs expression page. And that expression page feeds back into the Epubs Expression links page.

Using the Smart Research Framework we have integrated services to automate prescriptive research workflow – that attempt to preemptively catch all of the elements that make up the research project, including policy information, to allow the researcher to concentrate on their core work. That process will be triggered at STFC and will capture citation links in the process.

To summarise, Webtracks provides a simple but effective mechanism to facilitate propagation of citation links to provide a linked web of data. It links diverse types of digital research objects. To restore context to dispersed digital research outputs. No constraints on link semantics and metadata. It’s P2P, does not rely on centralised service. And it’s a highly flexible approach.

Topic: ResourceSync: Web-based Resource Synchronization
Speaker(s): Simeon Warner, Todd Carpenter, Bernhard Haslhofer, Martin Klein, Nettie Legace, Carl Lagoze, Peter Murray, Michael L. Nelson, Robert Sanderson, Herbert Van de Sompel

Simeon is going to be talk about resource synchronization. We are a big team here and have funding from the Sloan Foundation and from JISC. I’m going to be talk about discussions we’ve been having. We have been working on the ResourceSync project, looking at replication of web material… it sounds simple but…

So… synchronization of what? Well web resources – things with a URI that can be dereferenced and are cache-able. Hidden in that is something about support for different representations, for content negotiations. No dependency on underlying OS, technologies etc. For small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources). We want this to be properly scalable to large resources or large collections of resources. And then there is the factor of change – is it a slow change (weeks/month) for an institutional repository maybe, or very quickly (seconds) – like a set of linked data URIs, and where there needs to be latency there. And we want this to work on/via/native to the web.

Why do this? Well because lots of projects are doing  synchronization but do so case by case. The project teams are involved in these projects. Lots of us have experience with OAI-PMH, it’s widely used in repository but XML metadata only and web technologies have moved on hugely since 1999. But there are loads of use cases here with very different needs. We had lots of discussion and decided that some use cases were not but some were in scope. That out of scope for now list is: bidirectional syncronisation; destination-defined selective synchronization (query); special understanding of complex objects; bulk URI migration; Diffs (hooks?) – we understand this will be important for large objects but there is no way to do this without needing to know media types; intra-operation event tracking; content tracking.

So a use case: DBPedia Live duplication. 20 million entries updated once per second. We need push technology, we can be polling this all the time.

Another use case: arXiv mirroring  1 million article versions. about 800 created each day and updated at 8pm US eastern time. metadata and full text for each article. Accuracy very important. want low barrier for others to use. Works but currently use rsync and that’s specific to one authentication regime.

Terminology here:

  • Resource – inject to be synchronizes, a web resource
  • Source – system with the original or master resources
  • Destination – where synchronised to
  • Pull
  • Push
  • Metadata – information about resources such as URI, modification time, checksum etc. Not to be confused with metadata that ARE resources.

We believe there are 3 basic needs to meet for syncronisation. (1) baseline synchronisation – a destination must be able to perform an initial load or catch-up with a source (to avoid out of band setup, provide discovery). (2) Incremental synchronization – destination must have some way to keep up-to-date with changes at a source (subject to some latency; minimal; create/update/delete). (3) Audit – it should be possible to determine whether a destination is synchronised with a source (subject to some latency; want efficiency –> HTTP HEAD).

So two approaches here. We can get an inventory of resources then copy one by one via HTTP GET. Or we can get a dump of the data and extract metadata.  For auditing we could do new Baseline synchronization and compare but likely to be very inefficient. Can optimize by adding getting an inventory and compare copy with destination – using timestamp, digest etc. smartly, a latency issue here again to consider.  And then we can think about Incremental Synchronisation. The simplest method would be audit then copy all new/updated resources plus removal of deleted. Optimize this by changing communication – exchange ChangeSet listing only updates; Resource Transfer – exchange dumps for ChangeSets of even diffs; and Change Memory.

We decided for simplicity to use Pull but some applications may need Push. And we wanted to think about the simplest idea of all: SiteMaps as an inventory. So we have a framework based on Sitemaps. On level 0, the base level. Publish a sitemap and someone can grab all of your resources. A simple feed of URL and last modification date lets us track changes. Sitemap format was designed to allow extension. It’s deliberately simple and extensible. There is an issue about size. The structure is for a list of resources that handles up to 2.5 billion resources before further extension required. Should we try to make this looks like RDF we expect? We think no but map Sitemap RDF to RDA.

At the next level we look at a ChangeSet. This time we reuse Sitemap format but include information only for change events over a certain period. To get a sense of how this looks we tried this with ArXiv. Baseline synchronisation and Audit: 2.3 million resources (300GB); 46 sitemaps and 1 sitemapindex (50k resources/sitemap).

But what if I want Push application that will be quicker? We are trying out XMPP (as used by Twitter etc.) and lots of experience and libraries to work with for this standard. So this model is about rapid notification of change events via XMPP Push. They trialed his at LiveDBpedia. LANL Research Library ran a significant scale experiment of LiveDBPedia database from Los Alamos to two remote sites using XMPP to push notifications.

One thing we haven’t got to is dumps. Two thought so far… Zip file with Sitemap – simple and widely used format  – but custom solution. The other possibility is WARC – Web ARCiving format. Designed for just this purpose but not widely used. We may end up doing both.

Real soon now a rather extended and concrete version of what I’ve said will be made available. First draft of sitemap-based spec is coming July 2012. We will then publicize and want your feedback, revision and experiments etc in September 2012. And hopefully we will have a final specification in August.


Q1) Wouldn’t you need to make a huge index file for a site like ArXiv?

A1) Depends on what you do. I have a program to index ArXiv on my own machine and it takes an hour but it’s a simplified process. I tested the “dumb” way. I’d do it differently on the server. But ArXiv is in a Fedora repository so you already have that list of metadata to find changes.

Q2) I was wondering as you were going over the SiteMap XML… have you considered what to do for multiple representations of the same thing?

A2) It gets really complex. We concluded that multiple representations with same URI is out of scope really.

Q3) Can I make a comment – we will be soon publishing use cases probably on a Wiki and that will probably be on GitHub and I would ask people to look at that and give us feedback.

 July 11, 2012  Posted by at 8:26 am LiveBlog, Updates Tagged with:

Sorry, the comment form is closed at this time.