Citations as First-Class Data Entities: The OpenCitations Corpus

Requirements for citations to be treated as First-Class Data Entities

In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities. The third of these requirements is that they must be storable, searchable and retrievable in an open database designed for bibliographic citations.

In this post, I describe the current status of the OpenCitations Corpus, a well-structured open database specifically developed by OpenCitations and designed to store information about bibliographic citations as Linked Open Data, encoded in RDF (specifically JSON-LD).

What is OpenCitations?

OpenCitations (http://opencitations.net) is an scholarly infrastructure organization that has created and is currently expanding the coverage of the Open Citations Corpus (OCC), an open repository of scholarly citation data made available under a Creative Commons CC0 public domain dedication, which provides in RDF accurate citation information (bibliographic references) harvested from the scholarly literature.

The Co-Directors of OpenCitations are David Shotton, Oxford e-Research Centre, University of Oxford (david.shotton@opencitations.net) and Silvio Peroni, Department of Computer Science and Engineering, University of Bologna (silvio.peroni@opencitations.net).

We are committed to open scholarship, open data, open access publication, and open source software. We espouse the FAIR data principles developed by Force11, of which David Shotton was a founding member, and the aim of the Initiative for OpenCitations (I4OC), of which David Shotton and Silvio Peroni were both founding members, to promote the availability of citation data that is structured, separable, and open.

The principal activity of OpenCitations to date has been the establishment and population of the OpenCitations Corpus.

Holdings of the OpenCitations Corpus

We have so far concentrated on ingesting into the OpenCitations Corpus bibliographic references from open access papers available at PubMed Central, the encoding of these data in RDF, and high-quality curation of the citation links they represent, involving metadata enrichment from the Crossref API and (for authors) the ORCID API.

To date (19th February 2018), the OCC has ingested the references from 302,758 citing bibliographic resources, and contains information about 12,830,347 citation links to 6,549,665 cited resources. Plans to expand the coverage of the OCC are outlined below.

User interfaces

The information within the OCC can be accessed via OSCAR, our new generic OpenCitations RDF Search Application (http://opencitations.net/search) [1], which can be used for textual searches over any triplestore presenting a SPARQL endpoint. Users can employ OSCAR to search the OCC for publication titles, author names, publication years, and identifiers (DOIs, PubMed IDs PubMed Central IDs, ORCIDs, and OCC corpus identifiers). Such a search returns details of all bibliographic resources within the OCC matching the search term, from which their references can be obtained, if known. In the near future, we will complement OSCAR with a browse interface named LUCINDA.

We also provide a SPARQL endpoint for directly querying the Blazegraph triplestore in which we store the OCC RDF, and we plan in the near future to supplement such programmatic access with a REST API. In addition, the contents of the entire triplestore, and of the various sub-databases within the Corpus, together with their provenance information, are downloadable from Figshare as monthly dumps. Once the REST API has been developed, we will turn our attention to developing user interfaces for the interactive visualization of citation graphs.

The OpenCitations Data Model

As described in the previous blog post, we have just completed a comprehensive revision of the OpenCitations Data Model (OCDM, available at https://doi.org/10.6084/m9.figshare.3443876), which we use to capture descriptions of all aspects of the OCC citations and their provenance. This model makes extensive use of our SPAR (Semantic Publishing and Referencing) Ontologies (http://www.sparontologies.net/), which we developed to describe all aspects of the scholarly publishing domain in RDF .

The OpenCitations Data Model is freely available for third parties to use when recording their own bibliographic and citation information in RDF, with the advantage that data so modelled will be immediately compatible with those within the OpenCitations Corpus, which can act as a publishing venue for such third-party data.

Future ingest rate and data sources

Since July 2016, the instantiation of the OpenCitations Corpus currently running at the University of Bologna has been ingesting reference lists from biomedical journal articles at the relatively slow rate of about 200,000 citing bibliographic resources per year. During February 2018, ingestion into the Corpus is suspended, while we move the system to a completely new and more powerful server, supplemented by thirty Raspberry Pi ingest engines that will work in parallel feeding ingested data to the server.

This will increase our ingestion rate ~30-fold to about six million citing bibliographic resources per year, equivalent to ~240 million citations per year at 40 references per paper (the current OCC value is 42.4 references per paper). We should then be able to complete ingestion of the ~1.4 million remaining OA resources at PubMed Central within about three months.

At that stage, we plan to start ingesting references from the ~17 million journal articles whose deposited references are now open at Crossref as a consequence of the Initiative for Open Citations. The scholarly world currently publishes about 2.5 million new journal articles each year, of which about half will be probably be open at Crossref (assuming Elsevier has not by then opened its references). So, by the end of 2020, Crossref will have ~650 million open references. In addition to ingesting new open Crossref references as they are made available, we will be able to eat into the backlog of existing Crossref open references at a catch-up rate of ~190 million per year. By the end of 2020, we anticipate that the OCC should contain ~650 million citations harvested from PMC and Crossref, roughly half the coverage of Web of Science. We are currently also considering ingest of references from other major bibliographic databases.

Our vision for OpenCitations

Our vision is that OpenCitations should become a comprehensive source of open citation information from all disciplines of scholarly endeavour encoded as Linked Open Data, a key component of the academic open infrastructure used on a daily basis without charge by scholars worldwide.

To be of maximum utility, it requires effective graphical user interfaces and analytical tools to interrogate and quantify the data contained within the OCC. Since these data are all open, we anticipate that such interface and tool development will best be undertaken collaboratively within the open scholarly community, and we invite developers interested in such collaboration to contact us at contact@opencitations.net.

Reference

[1] Ivan Heibi, Silvio Peroni and David Shotton (2018). OSCAR: A customisable tool for free-text search over SPARQL endpoints. Accepted to the 2018 International Workshop on Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination Workshop (https://save-sd.github.io/2018/, co-located with The Web Conference), 24 April 2018 – Lyon, France. Preprint available at https://w3id.org/people/essepuntato/papers/oscar-savesd2018.html

	RTD TIG Week: Augmen… on Coverage of open citation data…
	UZH – Universität Zü… on Coverage of open citation data…
	UZH – Universität Zü… on OpenCitations and the Initiati…
	The Initiative for O… on Academia’s missing refer…
	Coverage of open cit… on Elsevier references dominate t…