Citations as First-Class Data Entities: Open Citation Identifiers

Requirements for citations to be treated as First-Class Data Entities

In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities.  The fourth of these requirements is that they must be identifiable using a global persistent identifier scheme.

At the recent PIDapalooza Conference on persistent identifiers, held in Girona, Spain, I launched the Open Citation Identifier (abbreviated OCI, in line with DOI), the new persistent identifier for citations [1].

In this post, I describe the Open Citation Identifier scheme, created and operated by OpenCitations, which supports the assignment of Open Citation Identifiers not only to the citations present in the OpenCitations Corpus (OCC) but also to open citations present in other bibliographic databases.

Structure and syntax of the Open Citation Identifier

Each OCI has a simple structure: oci:number-number, where “oci:” is the identifier prefix.

OCIs for citations stored within the OpenCitations Corpus are constructed by combining the OpenCitations Corpus local identifiers for the citing and cited bibliographic resources, separating them with a dash.  (For definition of OCC local identifiers, see the OpenCitations Data Model).

For example, oci:2544384-7295288 is a valid OCI for the citation between two papers stored within the OpenCitations Corpus, the first number being the OCC local identifier for the citing bibliographic resource [2], and the second being the OCC local identifier for the cited bibliographic resource [3], these bibliographic resource local identifiers being unique within the OCC.  [Note: Supplier prefixes are omitted from OCC local identifiers of bibliographic resources ingested into the OpenCitations Corpus prior to February 2018, but will be included within all OCC local identifiers of bibliographic resources ingested into Corpus after that date.]

OCIs for external resources identifies by numerical identifiers

OCIs can also be created for bibliographic resources described in an external bibliographic database, if they are similarly identified there by identifiers having a unique numerical part.  For example, the OCI for the citation that exists between Wikidata resources Q27931310 (the citing resource, [4]) and Q22252312 (the cited resource, [5]) is oci:0102793131001022252312, where “010” is the assigned OCC supplier prefix for Wikidata.  (The colours here and below are added simply for clarity.)

The OCC supplier prefix consist of a positive number (following the pattern “nnn”, where “nnn” is a string of numerals of variable length which includes no zeros), enclosed between two zeros (e.g. “0420”).  The list of all assigned OCC supplier prefixes is given at https://github.com/opencitations/oci/blob/master/suppliers.csv.

OCIs for citations between resources identified by DOIs

OCIs can also be created for bibliographic resources described in external bibliographic database such as Crossref or DataCite where they are identified by alphanumeric Digital Object Identifiers (DOIs), rather than purely numerical strings.

To achieve this, each case-insensitive DOI is first normalized to lower case letters. Then, after omitting the initial “doi:10.” prefix, the alphanumeric string of the DOI is converted reversibly to a pure numerical string using the simple two-numeral lookup table for numerals, lower case letters and other characters presented at https://github.com/opencitations/oci/blob/master/lookup.csv. For example, using this lockup table, “1” becomes “01”, “2” becomes “02”, “a” becomes “10”, “b” becomes “11”, and “/” becomes “36”.  To the resulting number, the appropriate OCC supplier prefix is then added, to clearly identify its provenance.

A citation documented in Crossref exists between the two publications [3] and [6], which are there identified by the DOIs doi:10.1108/jd-12-2013-0166 and doi:10.1371/journal.pcbi.1000361.  We can thus create an OCI for this Crossref citation by using numerical representations of the two DOIs. These numerical representations are:

0200101000836191363010263020001036300010606

and

02001030701361924302723102137251211183701000000030601

where the initial “020” in each case is the assigned OCC supplier prefix for Crossref.

From these two numerical representations of DOIs, the OCI for the Crossref citation between these two paper is easily constructed, and is:

oci:0200101000836191363010263020001036300010606-02001030701361924302723102137251211183701000000030601

While this is long for an identifier, it should be remembered that it will be processed computationally, and is not intended for human readability.

In this way, Crossref OCIs can be assigned to all ~350 million open references within Crossref in which the cited paper as well as the citing paper has a DOI [7].

OCIs for the same citation recorded within different databases

If a citation is recorded in more than one bibliographic database, a separate OCI can be created for each instance, each OCI having a distinct supplier prefix and being specific to that database.

Thus, in addition to the Crossref OCI created from DOIs and described above for the citation from [3] to [6], a Wikidata OCI exists for the same citation recorded within Wikidata, having the form oci:01024260641-01021092566.

Upon resolution of an OCI, the Open Citation Identifier Resolution Service will pull metadata only from the database specified by the supplier prefix of the OCI.  Details of the Open Citation Identifier Resolution Service are given in the next blog post.

It is important to note that an OCI can only be used to specify a citation between a citing and a cited publication which is actually recorded within a bibliographic database.  For this reason, the OCI “oci:7295288-3962641” shown below the second diagram in the introductory blog post to this series is presently invalid.  While the OpenCitations Corpus has metadata describing both bibliographic resources [3] and [6], it has not yet ingested the reference list for the first bibliographic resource [3] (which has the OCC local identifier 7295288), having information about it only from a reference within a third paper, with no information about the references [3] itself contains.  As a result, at present OCC has no record that a citation actually exists between [3] and the second bibliographic resource [6] (which has the OCC local identifier 3962641).

Representing OCIs in RDF

To permit the description of OCIs in RDF, “oci” has been added as a new member of the class datacite:ResourceIdentifierScheme within the DataCite Ontology.

The resolvable URL for any citation identified by a OCI has the form “https://w3id.org/oc/virtual/ci/nnn-mmm”, where nnn-mmm represents the OCI with its “oci:” prefix removed. Currently, we are able to return the RDF description of all the citations contained in the OpenCitations Corpus and Wikidata. We are working to extend the coverage so as to include other datasets, e.g. Crossref.

References

[1]     David Shotton (2018). Citations as first-class data entities. Open Citation Identifiers.  Conference presentation. PIDapalooza 2018, Girona, 23-23 January 2018. https://doi.org/10.6084/m9.figshare.5844972

[2]     Armen Yuri Gasparyan, Marlen Yessirkepov et al. (2015). Preserving the integrity of citations and references by all stakeholders of science communication.  J. Korean Med. Sci. 30:1545-1552. (English.)  https://doi.org/10.3346/jkms.2015.30.11.1545

[3]     Silvio Peroni, Alexander Dutton, Tanya Gray and David Shotton (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71 (2): 253-277.  https://doi.org/10.1108/jd-12-2013-0166

[4]     Daniel K. Bricker, Eric B. Taylor et al. (2012). A Mitochondrial Pyruvate Carrier Required for Pyruvate Uptake in Yeast, Drosophila, and Humans. Science 337: 96-100.
https://doi.org/10.1126/science.1218099

[5]     Douglas Hanahan and Robert A. Weinberg (2011). Hallmarks of cancer: the next generation.  Cell 144: 646–674.  https://doi.org/10.1016/j.cell.2011.02.013

[6]     David Shotton, Katie Portwin, Graham Klyne and Alistair Miles (2009).  Adventures in semantic publishing: exemplar semantic enhancement of a research article. PLoS Computational Biology 5: e1000361. http://dx.doi.org/10.1371/journal.pcbi.1000361

[7]     Daniel Ecer (2017). Crossref Data Notebook (updated). Available at https://elifesci.org/crossref-data-notebook

 

Advertisements
This entry was posted in Bibliographic references, Citations as First-Class Data Entities, Open Citation Identifiers, Open Citations, Semantic Publishing and tagged , , , , . Bookmark the permalink.

4 Responses to Citations as First-Class Data Entities: Open Citation Identifiers

  1. Pingback: Citations as First-Class Data Entities: Introduction | OpenCitations

  2. Pingback: Citations as First-Class Data Entities: The Open Citation Identifier Resolution Service | OpenCitations

  3. Pingback: The Crossref Open Citation Index (COCI) | OpenCitations

  4. Pingback: COCI, the OpenCitations Index of Crossref open DOI-to-DOI references | OpenCitations

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s