Three publications describing the Open Citations Corpus

Last September, I attended the Fifth Annual Conference on Open Access Scholarly Publishing, held in Riga, at which I had been invited to give a paper entitled The Open Citations Corpus – freeing scholarly citation data.  A recording of my talk is available here, and my PowerPoint presentation is separately available here.  My own reflections on the major themes of the conference are given in a separate Semantic Publishing Blog post.

While in Riga preparing to give that talk about the importance of open citation data, I received an invitation from Sara Abdulla, Chief Commissioning Editor at Nature, to write a Comment piece for their forthcoming special issue on Impact.  My immediate reaction was that this should be on the same theme, an idea to which Sara readily agreed.  The deadline for delivery of the article was 10 days later!

As soon as the Riga conference was over, I first assembled all the material I had to hand that could be relevant to describing the Open Citations Corpus (OCC) in the context of conventional access to academic citation data from commercial sources.  That gave me a raw manuscript of some five thousand words, from which I had to distil an article of less than 1,300 words.  I then started editing, and asked my colleagues Silvio Peroni and Tanya Gray for their comments.

The end result, enriched by some imaginative art work by the Nature team, was published a couple of weeks later on 16th October [1], and presents both the intellectual argument for open citation data, and the practical obstacles to be overcome in achieving the goal of a substantial corpus of such data, as well as giving a general description of the Open Citations Corpus itself and of the development work we have planned for it.

Because of the drastic editing required to reduce the original draft to about a quarter of its size, all material not crucial to the central theme had to be cut.  I thus had the idea of developing the original draft subsequently into a full journal article that would include these additional themes, particularly Silvio’s work on the SPAR ontologies described in this Semantic Publishing Blog post [2], Tanya’s work on the CiTO Reference Annotation Tools described in this Semantic Publishing Blog post, and a wonderful analogy between the scholarly citation network and Venice devised by Silvio.  I also wanted to give authorship credit to Alex Dutton, who had undertaken almost all of the original software development work for the OCC.  For this reason, instead of assigning copyright to Nature for the Comment piece, I gave them a license to publish, retaining copyright to myself so I could re-use the text.  I am pleased to say that they accepted this without comment.

Silvio and I then set to work to develop the draft into a proper article.  The result was a ten-thousand word paper submitted to the Journal of Documentation a week before Christmas [3].  We await the referees’ comments!

 References

[1]     Shotton D. (2013).  Open citations.  Nature 502: 295–297. http://www.nature.com/news/publishing-open-citations-1.13937. doi:10.1038/502295a.

[2]     Peroni S and Shotton D (2012). FaBiO and CiTO: ontologies for describing bibliographic resources and citations. Web Semantics: Science, Services and Agents on the World Wide Web. 17: 33-34. doi:10.1016/j.websem.2012.08.001.

[3]     Peroni S, Dutton A, Gray T and Shotton D (2014). Setting our bibliographic references free: towards open citation data.  J. Documentation (submitted for publication).

Posted in Uncategorized | 1 Comment

Open Citations – Doing some graph visualisations

Ongoing work on the Open Citations extensions project is now reaching the point of visualising – at very much a prototype level at this stage – the outputs of our earlier efforts to import and index the PubMed Central Open Access subset and arXiv.

Earlier in this project I asked David to specify a list of questions that he thought researchers might hope to answer by querying our Open Citations Corpus; the aim was to use these questions to guide our developments, in the hope of providing a striking interface that also did something useful – there are too many visualisations of data that look very pretty but that do not actually add much to the data. So, considering that list of questions and how one might visualise the data to not only a pretty but functional end, I set myself the following problem:

Identify what it is in a dataset that is not easy to find in a textual representation, and make it useful for search

Based on our earlier text search demonstrator the answer pretty soon became – of course – interactions; whilst the properties of a result object are obvious in a textual result set, the interactions between those objects are not – and sometimes, it is the interactions that one wishes to use as search parameters.

What I did

Having found my purpose, I set about applying the superb D3.js library to the problem, using it to draw SVG representations of elasticsearch query results directly into the browser. After testing with a number of different result layouts, I settled upon using a zoomable-pannable force-directed network graph and combined it up with some code from my PhD work to build in some connections on the fly. This is as mentioned earlier still a work in progress, but results so far are pretty good.

Take the image above, for example: this is a static representation of the interactions between David Shotton and all other authors (purple) with whom he has published an article (green) in the PMC OA subset. The red dots are the journals these articles are in, and the brown are citations. As a static image this could be fairly informative when marked up with appropriate metadata, and it does look quite nice; but, more than that, it can act as part of a search interface to enable a much improved search experience.

So far, the production of a given image also reduces the result set size; so whilst viewing the above image, the available suggestion dropdowns are automatically restricted to the subset of values relevant to the currenlty displaying image – dropdown suggestions are listed in order of popularity count, then upon typing one letter they switch to alphabetical, and with multiple letters they become term searches. By typing in free text search values or choosing suggestions, this visual representation of the current subset of results combined with the automated restriction of further suggestions should offer a simple yet powerful search experience. It is also possible to switch back to “list” view at any time, to see the current result set in a more traditional form. Further work – described below – will bring enhancements that add functionality to the elements of the visualisation too.

Try it

Similarly to the search result list demonstrator, it is possible to embed the visual search tool in any web page. However, as it looks better with full screen real estate I have saved that particular trick for the time being, and simply made it available at http://occ.cottagelabs.com/graphview.

Now before you rush off to try it, given the prototype state, you will need some pointers. Taking the above image as example once more, in order to reproduce it, do the following:

  • Use a modern browser – Chrome renders javascript the fastest – on a reasonably decent machine to access http://occ.cottagelabs.com/graphview – a large screen resolution would be particularly nice
  • Choose authors as a search suggestion type
  • Start typing Shotton – click on Shotton David when it appears in the list
  • (If the error where it appears to return all results again appears – described below – just keep going)
  • tick the various display options to add authors, journals and citations objects to the display

The next step would be to click an author or other entity bubble then choose to add that to the search terms, or start a new search based on that bubble or perhaps a subset of the returned bubbles; however this is all still in development.

For a more complex example, try choosing keywords then type Malaria. Once displayed, increase resultset size to 400 so they are all displayed. Then try selecting the various authors, journals and citations tickboxes to add those objects; try increasing the sizes to see how many you can get before your computer melts… On my laptop, asking for more than about 1000 of each object results in poor performance. But here is an example of the output – all 383 articles with the Malaria keyword in the PMC OA, showing all 70 journals in which they are published, with links to the top 100 authors and citations. Which journal do you think is the large purple dot in the middle?

Outstanding issues

Interface

  • Numerous buttons have no action yet – clear / help / prev / next / + search / labels. Once these and other search action buttons are added, the visualisation can become a true part of the search experience rather than just a pretty picture
  • Searches are sent asyncrhonously and occasionally overlap, resulting in large query result sizes overwriting smaller ones. This needs a delay on user interactions added.
  • Some objects should become one – for example some citations are to the same article via both DOI and PMID, and some citations are also open access articles in our index, so they shold be linked up as such.
  • There is as yet no visual cue that results are still loading, so it feels a bit in limbo. Easy fix.
  • Some of the client-side processing can be shifted to the backend (already in progress)
  • The date slider at the bottom is twitchy and needs smoother implementation and better underlying data (see below)

Data quality

Apart from the above technical tasks, we will need to re-visit our data pipeline in order to answer more of the questions set by David. For example we have very little affiliation data at present, and we are also missing a large amount of date information. Also some data cleaning is necessary – for example, keywords should all be lowercased to ensure we do not have subsets due solely to capitalisation. There are also certain types of data that we have no idea about as yet – for example author location, h-index, ORCID. However, this is all as to be expected at this stage, and overall our ability to so easily spot these issues shows great progress.

More to come

There is still work to be done on this graph interface, and in addition, we have some more demonstrators on the way too. In combination with the work on improving the pipeline and data quality, we should soon be able to perform queries that will answer more of our set questions – then we will identify what needs done next to answer the remaining ones!

Posted in Open Citations | 4 Comments

Open Citations Corpus Import Process

As part of the Open Citations project, we have been asked to review and improve the process of importing data into the Open Citations Corpus, taking the scripts from the initial project as our starting point.

The current import procedure evolved from several disconnected processes and requires running multiple command line scripts and transforming the data into different intermediate formats. As a consequence, it is not very efficient and we will be looking to improve on the speed and reliability of the import procedure. Moreover, there are two distinct procedures depending on the source of the data (arXiv or PubMed Central); we are hoping to unify the common parts of these procedures into a single process which can be simplified and normalised to improve code re-use and comprehensibility.

The Workflow

As PubMed Central provides an OAI-PMH feed, this could be used to retrieve article metadata, and for some articles, full text. Using this feed, rather than an FTP download (as used currently) would allow the metadata import for both arXiv and PubMed Central to follow a near-identical process, as we are already using the OAI-PMH feed for arXiv.

Also, rather than have intermediate databases and information stores, it would be cleaner to import from the information source straight into a datastore. The datastore could then be queried, allowing matches and linking between articles to be performed in situ. The process would therefore become:

  1. Pull new metadata from arXiv (OAI-PMH) and PubMed Central (OAI-PMH) and insert new records into the Open Citations Corpus datastore
  2. Pull new full-text from arXiv and PubMed Central, extract citations, and match with article data in Open Citations server, creating links between these references and the metadata records for the cited articles. Store unmatched citations as nested records in the metadata for each article.
  3. On a scheduled basis (e.g. nightly), review each existing article’s unmatched citations and attempt to match these with existing bibliographic records of other articles.

In outline, this looks like this:

The Datastore

Neo4J is currently used as the final Open Citations Corpus datastore for the arXiv data, by the Related Work system. We propose instead to use BibServer as the final datastore, for its flexibility and scalability, and suitability for the Open Citations use cases.

The Data Structure

The data stored within BibServer as BibJSON will be a collection of linked bibliographic records describing articles. Associated with each record and stored as nested data will be a list of matched citations (i.e. those for which the Open Citations Corpus has a bibliographic record), a list of unmatched citations, and a list of authors.

Authors will not be stored as separate entities. De-coupling and de-duplicating authors and articles could form the basis of a future project, perhaps using proprietary identifiers (such as ORCHID, PubMed Author ID or arXiv Author ID) or email addresses, but this will not be considered further in this work package.

Overall Aim

The overall aim of this work is to provide a consistent, simple and re-usable import pipeline for data for the Open Citations Corpus. In the fullness of time we’d expect it to be possible to add new data sources with minimal additional complexity. By using an approach whereby data is imported into the datastore at as early a stage as possible in the import pipeline, we can use common tools for extracting, matching, deduplicating citations; the work for each datasource, then, is just to convert the source data format into BibJSON and store it in BibServer.

Posted in JISC, Open Citations | Tagged , , , , , , , | 2 Comments

Libraries and linked data #1: What are linked data?

The first of six blog posts about libraries and linked data, bearing this title, is to be found at

http://semanticpublishing.wordpress.com/2013/03/01/lld1-what-are-linked-data/.

A draft of that post, that erroneously appeared here in this blog, has been removed.

 

Posted in Uncategorized | 1 Comment

Open Citations – Indexing PubMed Central OA data

As part of our work on the Open Citations extensions project, I have recently been doing one of my favourite things – namely indexing large quantities of data then exploring it.

On this project we are interested in the PubMed Central Open Access subset, and more specifically, we are interested in what we can do with the citation data contained within the records that are in that subset – because, as they are open access, that citation data is public and freely available.

We are building a pipeline that will enable us to easily import data from the PMC OA and from other sources such as arXiv, so that we can do great things with it like explore it in a facetview, manage and edit it in a bibserver, visualise it, and stick it in the rather cool related-work prototype software. We are building on the earlier work of both the original Open Citations project, and of the Open Bibliography projects.

Work done so far

We have spent a few weeks getting to understand the original project software and clarifying some of the goals the project should achieve; we have put together a design for a processing pipeline to get the data from source right through to where we need it, in the shape that we need it. In the case of facetview / bibserver work, this means getting it into a wonderful elasticsearch index.

While Martyn continues work on the bits and pieces for managing the pipeline as a whole and pulling data from arXiv, I have built an automated and threadable toolchain for unpacking data out of the compressed file format it arrives in from the US National Institutes of Health, parsing the XML file format and converting it into BibJSON, and then bulk loading it into an elasticsearch index. This has gone quite well.

To fully browse what we have so far, check out http://occ.cottagelabs.com.

For the code: https://github.com/opencitations/OpenCitationsCorpus/tree/master/pipeline.

The indexing process

Whilst the toolchain is capable of running threaded, the server we are using only has 2 cores and I was not sure to what extent they would be utilised, so I ran the process singular. It took five hours and ten minutes to build an index of the PMC OA subset, and we now have over 500,000 records. We can full-text search them and facet browse them.

Some things of particular interest that I learnt – I have an article in the PMC OA! And also PMIDs are not always 8 digits long – they appear in fact to be incremental from 1.

What next

At the moment there is no effort made to create record objects for the citations we find within these records, however plugging that into the toolchain is relatively straightforward now.

The full pipeline is of course still in progress, and so this work will need a wee bit of wiring into it.

Improve parsing. There are probably improvements to the parsing that we can make too, and so one of the next tasks will be to look at a few choice records and decide how better to parse them. The best way to get a look at the records for now is to use a browser like Firefox or Chrome and install the JSONview plugin, then go to occ.cottagelabs.com and have a bit of a search, then click the small blue arrows at the start of a record you are interested in to see it in full JSON straight from the index. Some further analysis on a few of these records would be a great next step, and should allow for improvements to both the data we can parse and to our representation of it.

Finish visualisations. Now that we have a good test dataset to work with, the various bits and pieces of visualisation work will be pulled together and put up on display somewhere soon. These, in addition to the search functionality already available, will enable us to answer the questions set as representative of project goals earlier in January (thanks David for those).

Posted in JISC, Open Citations | Tagged , , , , , , , , , | Leave a comment

The start of something new

Previously, my blog posts relating to semantic publishing have appeared in this Open Citations Blog.

However, because of the merger of the Open Citations Project with the Related Work Project, described here, this Open Citations Blog has been renamed Open Citations and Related Work and has been opened to contributions from others involved in developing Open Citations and Related Work.

It thus makes sense that future blog posts concerned with semantic publishing be given a separate distinct home – a new Semantic Publishing Blog available at at http://semanticpublishing.wordpress.com/ – to which previous blog posts on the semantic publishing theme that originally appeared in this Open Citations Blog will be copied.

Those interested should follow both the Open Citations and Related Work Blog and the Semantic Publishing Blog, and may also be interested in the third more specialized blog concerning the on-line creation of research data management plans, entitled Creating data management plans online that is available at http://datamanagementplanning.wordpress.com/.

Posted in Open Citations, Semantic Publishing | Tagged , , , | Leave a comment