Today we are liveblogging from the OR2012 conference at Lecture Theatre 5 (LT5), Appleton Tower, part of the University of Edinburgh. Find out more by looking at the full program.
If you are following the event online please add your comment to this post or use the #or2012 hashtag.
This is a liveblog so there may be typos, spelling issues and errors. Please do let us know if you spot a correction and we will be happy to update the post.
Topic: Augmenting open repositories with a social functions ontology
Speaker(s): Jakub Jurkiewicz, Wojtek Sylwestrzak
The project began with ontologies, motivated by SYNAT project. The project requires specific social functions, and particular ontologies are used to analyze them as completely as possible.
This particular project started in 2001, aiming to create a platform for integrated digital libraries from a custom platform hosting the Virtual Library of Science. Bimetal was used so that different versions of metadata schema could be put into place.
The Virtual Library has 9.2 million articles, mostly full text from journals, but also traditional library content. That traditional content creates problems with search because it is not all digitized.
SYNAT brings together 16 leading Polish research institutions, and the platform aims to manage all of this data in a way that users can interact with well – all ultimately using BWMeta 2.
In Poland an open mandate initiative requires the project to have the capacity to host open licensed data, and allow authors to publish their works (papers, data, etc). Support for restricted access content is also included, with a ‘moving wall’ for embargoed works – content is stored in the repository, and it will switch from closed to open access after a pre-decided time.
Social functions of SYNAT…
Users can discuss the resources, organize amongst themselves into smaller groups, share reading lists, follow the activities of other users or organizations (published content, comments, conferences, etc). This is all part of the original project aims and goals.
Analysis of social functions was based upon some prior work for efficiency. Biblographic elements would use Dublin core and BIBO. Friend of a friend analysis as well. All of the different objects (users, metadata fields, and so on) have been mapped to particular ontologies.
People on the platform can be users, authors. The assessment makes particular note of the fact that persons and users are different – there are likely more users than people involved with the platform. Also, each can be connected to specific works and people or not depending on user preference. People will have published objects as well as unofficial posts (forum thread, comment). Published objects can be related to events based on whether they were published in or because of said event.
So, objects include user profiles, published objects, forum activity and posts, groups, events. These are all related to one another using predicates (of, by, with). This model then satisfies the requirements of the project aims and goals.
It is important to point out how previous work was reused from existing ontologies. It simplifies the analysis process and makes it more precise because of reiteration. There is also easy RFD export from the final system for comparability, though not for storing in the database.
In the future, implementation of these social analysis functions will be done.
Q: Has ontology work been shared with the content suppliers for whatever purpose? Do they think it will add value?
A: They aren’t disinterested, but it isn’t something they are interested in for themselves. They are glad it is offered as part of the service.
Topic: Microblogging Macrochallenges for Repositories
Speaker(s): Leslie Carr, Adam Field
This all comes about from work into the London riots in sociological work. He’d done analysis with interviews and, most importantly, videos posted on Twitter via YouTube by passersby. People took these videos offline quite quickly out of fear of retribution – this meant going back to gather data was difficult.
Les is running a web science and social media course now, with a lot of emphasis on Twitter. It provides a good understanding of group feeling, given the constraints of Twitter. Why not extend repositories to make Twitter useful in that area?
The team built a harvester, which connects to the Twitter search API for now. No need for authentication, no ‘real’ API per se, but it works alright. You can only go back 1500 tweets per search, but that has been enough. The Search API is hit every 20 minutes. This was to be preserved for the sake of the system itself, but other people came out to share their own harvested tweets. There are coding benefits and persistent resource benefits.
Tweets would be a document living under EPrint, those documents themselves XML files rendered into HTML. Unfortunately, doing this did not scale well. Thousands of XML and HTML files under one EPrint. When that system checked file sizes, it would take 30 minutes to render the information – which might as well be broken.
Tweets are quite structured data beyond just the text inside. Stored separately, the other fields make a very rich database. Treating tweets as first class objects in relation to a TweetStream makes them even more valuable.
Live demo on screen of EPrints analysing OR2012 tweets. Who’s been talking and how many things have they said. Which hashtags, which people are discussed? What links are shared? What frequency of tweets per day? All exportable as JSON, CSV, HTML.
There are limitations of the repository – EPrints is designed for publications, not millions of little objects.
Problems with harvesting. URLs are shortened with wrappers, now t.co. The system has trouble resolving all of these redirects, but where a link ends up is enormously important. Following a link, on average, takes 1 second. That’s a huge number with so much content. MySQL processing has also created some limitations, but those have been largely worked around – this took a great deal of optimization with complex understanding of the backend. A third problem was the Twilight problem: popular topics will spike to over 1500 tweets per harvest, and so a lot is missed. This could be overcome with the streaming API, but there are real time issues with using that. Quite complex.
The future. Dealing with URL wrappers. Dealing with the unboundedness of the data – there is so much that optimizations will not be able to keep up. A new strategy for the magnitude problem has to be puzzled out. Potential archival of other content: YouTube videos and comments, Google results over time.
This Twitter harvester is available online for EPrints – lightweight harvesting for nontechnical people.
There are large scale programs for this already, some people need smaller and more accessible tools (masters and doctoral students).
Q: Why are you still using EPrints? It seems like there are a lot of hacks, and you would have been better off using a small application directly over MySQL.
A: EPrints is a mature preservation platform. Easy processing now is not the best thing for the long term. Repositories are supposed to do that, so challenges should be met to overcome that.
Topic: Beyond Bibliographic Metadata: Augmenting the HKU IR
Speaker(s): David Palmer
At universities in Hong Kong, more knowledge exchange was desired to enable more discovery. Then, theoretically, innovation and education indicators would improve. The office of knowledge exchange chose the institutional repository, built on DSpace and developed with CILEA. The common comment on this work after it was first implemented was that part of the picture was missing.
Getting past thin and dirty metadata was a goal, along with augmenting metadata in general: profiles, patents, projects.
Publication data is pushed from HKUs central research database, the Research Output System, filled by authors or assistants. Needs much better metadata. Now trying to get users to work with Endnote, DOI, ISBN so that cleaner metadata comes in.
Bibliographic rectification via merges or split of author profiles with a user API and robots. This has worked quite well.
Search and scrape of the database start with numbers (DOI, ISBN, etc), then search for strings (title, author). Each entry pulls citations, impact factors if available.
Lots of people involved in making this all work, and work well.
Author profiles include achievements, grants, cited as (for alternate naming) and external metrics via web scrape. Also includes prizes (making architects happy) and supervision with titles with a link (making educators happy).
Previously, theses and dissertations (which are very popular) were stored in three separate silos. Now they all integrate with this system for better interactivity of content for tracking, jumping between items.
Grants and projects are tracked and displayed, too. This shows what is available to be applied for, or what has been done already – publications resulting from. Patent records included, with histories and appropriate sharing of information based on application status, publication and granting. Links to published, granted patents and abstracts in whichever countries they exist.
With all of this data, other things can be shown: articles with fastest rate of receiving citation, most citied, who publishes the most. Internal metrics show off locations of site views, views over time, and more. Visualizations are improving, so users can see charts (webs of coauthors for an author, for example) and graphs and things. The data is all in one place from other silos, which is great because on-the-fly charts would be otherwise impossible.
Has all of this increased visibility? Anecdotally, absolutely. People’s reputations are improving. Metrics show improvement as well. The hub is stickier – more pages per visit, more time per page, because everything is hyperlinked.
This work, done with CILEA, is going to be given back to the community in DSpace CRIS modules. Everything. For the sake of knowledge exchange. Mutual benefits will result, in terms of interoperability and co-development.
Q: Is there an API to access data your system has ground out of other sources?
A: A web service is in the works. The office of dentistry is scraping this data manually already, so it’s doable