Open Citations Corpus Import Process

As part of the Open Citations project, we have been asked to review and improve the process of importing data into the Open Citations Corpus, taking the scripts from the initial project as our starting point.

The current import procedure evolved from several disconnected processes and requires running multiple command line scripts and transforming the data into different intermediate formats. As a consequence, it is not very efficient and we will be looking to improve on the speed and reliability of the import procedure. Moreover, there are two distinct procedures depending on the source of the data (arXiv or PubMed Central); we are hoping to unify the common parts of these procedures into a single process which can be simplified and normalised to improve code re-use and comprehensibility.

The Workflow

As PubMed Central provides an OAI-PMH feed, this could be used to retrieve article metadata, and for some articles, full text. Using this feed, rather than an FTP download (as used currently) would allow the metadata import for both arXiv and PubMed Central to follow a near-identical process, as we are already using the OAI-PMH feed for arXiv.

Also, rather than have intermediate databases and information stores, it would be cleaner to import from the information source straight into a datastore. The datastore could then be queried, allowing matches and linking between articles to be performed in situ. The process would therefore become:

Pull new metadata from arXiv (OAI-PMH) and PubMed Central (OAI-PMH) and insert new records into the Open Citations Corpus datastore
Pull new full-text from arXiv and PubMed Central, extract citations, and match with article data in Open Citations server, creating links between these references and the metadata records for the cited articles. Store unmatched citations as nested records in the metadata for each article.
On a scheduled basis (e.g. nightly), review each existing article’s unmatched citations and attempt to match these with existing bibliographic records of other articles.

In outline, this looks like this:

The Datastore

Neo4J is currently used as the final Open Citations Corpus datastore for the arXiv data, by the Related Work system. We propose instead to use BibServer as the final datastore, for its flexibility and scalability, and suitability for the Open Citations use cases.

The Data Structure

The data stored within BibServer as BibJSON will be a collection of linked bibliographic records describing articles. Associated with each record and stored as nested data will be a list of matched citations (i.e. those for which the Open Citations Corpus has a bibliographic record), a list of unmatched citations, and a list of authors.

Authors will not be stored as separate entities. De-coupling and de-duplicating authors and articles could form the basis of a future project, perhaps using proprietary identifiers (such as ORCHID, PubMed Author ID or arXiv Author ID) or email addresses, but this will not be considered further in this work package.

Overall Aim

The overall aim of this work is to provide a consistent, simple and re-usable import pipeline for data for the Open Citations Corpus. In the fullness of time we’d expect it to be possible to add new data sources with minimal additional complexity. By using an approach whereby data is imported into the datastore at as early a stage as possible in the import pipeline, we can use common tools for extracting, matching, deduplicating citations; the work for each datasource, then, is just to convert the source data format into BibJSON and store it in BibServer.

Postscript

David Shotton writes: This productive collaboration between Cottage Labs and the Open Citations Corpus came to an end when Jisc funding ran out. The corpus has more recently been given a new lease of life, as described here, with a new instantiation named OpenCitations hosted at the Department of Computer Science and Engineering of the University of Bologna, with Silvio Peroni as Co-Director.

2 Responses to Open Citations Corpus Import Process

Nicola Ghirardi says:

June 24, 2013 at 11:12 am

Some question:
Is the revised workflow available somewhere?
Did you use an open source software for full-text analysis / citation extraction?
I am very interested (also) in arxiv import.

- davidshotton says:
  
  July 12, 2013 at 4:19 pm
  
  Is the revised workflow available somewhere?
  Our original processing pipeline is described at https://opencitations.wordpress.com/2011/07/01/the-citation-processing-pipeline-and-the-open-citations-corpus/. All the code is available at http://opencitations.net/source-code/. We are now in the process of revising the import pipeline, as described above. Recent code is available at https://github.com/opencitations/OpenCitationsCorpus. This is work in progress.
  
  Did you use an open source software for full-text analysis / citation extraction?
  We used our own scripts to parse the incoming reference lists from PubMed Central, that were well marked up in XML using the NLM DTD.
  
  I am very interested (also) in arxiv import.
  We are in the process of including all the references from arXiv in the Open Citations Corpus, but this work is currently stalled for lack of funding.

	RTD TIG Week: Augmen… on Coverage of open citation data…
	UZH – Universität Zü… on Coverage of open citation data…
	UZH – Universität Zü… on OpenCitations and the Initiati…
	The Initiative for O… on Academia’s missing refer…
	Coverage of open cit… on Elsevier references dominate t…