Requirements for citations to be treated as First-Class Data Entities
In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities. The fourth of these requirements is that they must be identifiable using a global persistent identifier scheme.
At the recent PIDapalooza Conference on persistent identifiers, held in Girona, Spain, I launched the Open Citation Identifier (abbreviated OCI, in line with DOI), the new persistent identifier for citations .
In this post, I describe the Open Citation Identifier scheme, created and operated by OpenCitations, which supports the assignment of Open Citation Identifiers not only to the citations present in the OpenCitations Corpus (OCC) but also to open citations present in other bibliographic databases.
Structure and syntax of the Open Citation Identifier
Each OCI has a simple structure: oci:number-number, where “oci:” is the identifier prefix.
OCIs for citations stored within the OpenCitations Corpus are constructed by combining the OpenCitations Corpus local identifiers for the citing and cited bibliographic resources, separating them with a dash. (For definition of OCC local identifiers, see the OpenCitations Data Model).
For example, oci:2544384-7295288 is a valid OCI for the citation between two papers stored within the OpenCitations Corpus, the first number being the OCC local identifier for the citing bibliographic resource , and the second being the OCC local identifier for the cited bibliographic resource , these bibliographic resource local identifiers being unique within the OCC. [Note: Supplier prefixes are omitted from OCC local identifiers of bibliographic resources ingested into the OpenCitations Corpus prior to February 2018, but will be included within all OCC local identifiers of bibliographic resources ingested into Corpus after that date.]
OCIs for external resources identifies by numerical identifiers
OCIs can also be created for bibliographic resources described in an external bibliographic database, if they are similarly identified there by identifiers having a unique numerical part. For example, the OCI for the citation that exists between Wikidata resources Q27931310 (the citing resource, ) and Q22252312 (the cited resource, ) is oci:01027931310–01022252312, where “010” is the assigned OCC supplier prefix for Wikidata. (The colours here and below are added simply for clarity.)
The OCC supplier prefix consist of a positive number (following the pattern “nnn”, where “nnn” is a string of numerals of variable length which includes no zeros), enclosed between two zeros (e.g. “0420”). The list of all assigned OCC supplier prefixes is given at https://github.com/opencitations/oci/blob/master/suppliers.csv.
OCIs for citations between resources identified by DOIs
OCIs can also be created for bibliographic resources described in external bibliographic database such as Crossref or DataCite where they are identified by alphanumeric Digital Object Identifiers (DOIs), rather than purely numerical strings.
To achieve this, each case-insensitive DOI is first normalized to lower case letters. Then, after omitting the initial “doi:10.” prefix, the alphanumeric string of the DOI is converted reversibly to a pure numerical string using the simple two-numeral lookup table for numerals, lower case letters and other characters presented at https://github.com/opencitations/oci/blob/master/lookup.csv. For example, using this lockup table, “1” becomes “01”, “2” becomes “02”, “a” becomes “10”, “b” becomes “11”, and “/” becomes “36”. To the resulting number, the appropriate OCC supplier prefix is then added, to clearly identify its provenance.
A citation documented in Crossref exists between the two publications  and , which are there identified by the DOIs doi:10.1108/jd-12-2013-0166 and doi:10.1371/journal.pcbi.1000361. We can thus create an OCI for this Crossref citation by using numerical representations of the two DOIs. These numerical representations are:
where the initial “020” in each case is the assigned OCC supplier prefix for Crossref.
From these two numerical representations of DOIs, the OCI for the Crossref citation between these two paper is easily constructed, and is:
While this is long for an identifier, it should be remembered that it will be processed computationally, and is not intended for human readability.
In this way, Crossref OCIs can be assigned to all ~350 million open references within Crossref in which the cited paper as well as the citing paper has a DOI .
OCIs for the same citation recorded within different databases
If a citation is recorded in more than one bibliographic database, a separate OCI can be created for each instance, each OCI having a distinct supplier prefix and being specific to that database.
Thus, in addition to the Crossref OCI created from DOIs and described above for the citation from  to , a Wikidata OCI exists for the same citation recorded within Wikidata, having the form oci:01024260641-01021092566.
Upon resolution of an OCI, the Open Citation Identifier Resolution Service will pull metadata only from the database specified by the supplier prefix of the OCI. Details of the Open Citation Identifier Resolution Service are given in the next blog post.
It is important to note that an OCI can only be used to specify a citation between a citing and a cited publication which is actually recorded within a bibliographic database. For this reason, the OCI “oci:7295288-3962641” shown below the second diagram in the introductory blog post to this series is presently invalid. While the OpenCitations Corpus has metadata describing both bibliographic resources  and , it has not yet ingested the reference list for the first bibliographic resource  (which has the OCC local identifier 7295288), having information about it only from a reference within a third paper, with no information about the references  itself contains. As a result, at present OCC has no record that a citation actually exists between  and the second bibliographic resource  (which has the OCC local identifier 3962641).
Representing OCIs in RDF
To permit the description of OCIs in RDF, “oci” has been added as a new member of the class datacite:ResourceIdentifierScheme within the DataCite Ontology.
The resolvable URL for any citation identified by a OCI has the form “https://w3id.org/oc/virtual/ci/nnn-mmm”, where nnn-mmm represents the OCI with its “oci:” prefix removed. Currently, we are able to return the RDF description of all the citations contained in the OpenCitations Corpus and Wikidata. We are working to extend the coverage so as to include other datasets, e.g. Crossref.
 David Shotton (2018). Citations as first-class data entities. Open Citation Identifiers. Conference presentation. PIDapalooza 2018, Girona, 23-23 January 2018. https://doi.org/10.6084/m9.figshare.5844972
 Armen Yuri Gasparyan, Marlen Yessirkepov et al. (2015). Preserving the integrity of citations and references by all stakeholders of science communication. J. Korean Med. Sci. 30:1545-1552. (English.) https://doi.org/10.3346/jkms.2015.30.11.1545
 Silvio Peroni, Alexander Dutton, Tanya Gray and David Shotton (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71 (2): 253-277. https://doi.org/10.1108/jd-12-2013-0166
 Daniel K. Bricker, Eric B. Taylor et al. (2012). A Mitochondrial Pyruvate Carrier Required for Pyruvate Uptake in Yeast, Drosophila, and Humans. Science 337: 96-100.
 Douglas Hanahan and Robert A. Weinberg (2011). Hallmarks of cancer: the next generation. Cell 144: 646–674. https://doi.org/10.1016/j.cell.2011.02.013
 David Shotton, Katie Portwin, Graham Klyne and Alistair Miles (2009). Adventures in semantic publishing: exemplar semantic enhancement of a research article. PLoS Computational Biology 5: e1000361. http://dx.doi.org/10.1371/journal.pcbi.1000361
 Daniel Ecer (2017). Crossref Data Notebook (updated). Available at https://elifesci.org/crossref-data-notebook