COCI, the OpenCitations Index of Crossref open DOI-to-DOI references

In a previous series of blog posts we proposed the treatment of bibliographic citations as first-class data entities, permitting citations to be endowed with descriptive properties. In doing so, we outlined some specific requirements, namely that the citations should be machine readable, should conform to a specific data model (in this case the OpenCitations Data Model), should be stored in an accessible database under an open license, and should be identified using global persistent identifiers (specifically Open Citation Identifiers) which are resolvable using an identifier resolution service (namely the Open Citation Identifier Resolution Service).

In this blog post, we introduce COCI, the OpenCitations Index of Crossref open DOI-to-DOI references1, our first open citation index, in which we have applied the concept of citations as first-class data entities to index the contents of one of the major open databases of scholarly citation information, namely Crossref, and to render and make available this information in machine-readable RDF.

Crossref contains metadata about publications (mainly academic journal articles) that are identified using Digital Object Identifiers (DOIs).  For about half of these publications Crossref also stores the reference lists of these articles submitted by the publishers (for discussion of why this is not true for all the publications, see this previous blog post).  Many of these references are to other publications bearing DOIs that are also described in Crossref, while others are to publications that lack DOIs and do not have Crossref descriptions.

COCI is an index of all the open DOI-to-DOI citations present in Crossref, and presently includes more than 300 million citations, obtained by parsing the open reference lists of the articles deposited there. COCI is available at http://opencitations.net/index/coci, and is released under a CC0 waiver.  COCI does not index Crossref references that are not open, nor Crossref open references to entities that lack DOIs.

What is an open citation index?

A citation index is a bibliographic index recording citations between publications, allowing the user to establish which later documents cite earlier documents. Several citation indexes are already available, some of which are freely accessible but not downloadable (e.g. Google Scholar), while others can be accessed only by paying significant access fees (e.g. Web of Science and Scopus). An open citation index contains only data about open citations, as defined in [1].

OpenCitations is a scholarly infrastructure organization dedicated to the promotion of semantic publishing by the use of semantic web (linked data) technologies, and engaged in advocacy for semantic publishing and open citations. It provides the OpenCitations Data Model and the SPAR (Semantic Publishing and Referencing) Ontologies for encoding scholarly bibliographic and citation data in RDF, and open software of generic applicability for searching, browsing and providing REST APIs over RDF triplestores. It has developed the OpenCitations Corpus (OCC) of open downloadable bibliographic and citation data recorded in RDF, and a system and resolution service for Open Citation Identifiers (OCIs), and it is currently developing a number of Open Citation Indexes using the data openly available in third-party bibliographic databases.

These Open Citation Indexes have the following characteristics in common:

  1. The citations they contain are all open [1].
  2. The citations are treated as first-class data entities;
  3. Each citation is identified by an Open Citation Identifier (OCI), which has a simple structure: the lower-case letters “oci” followed by a colon, followed by two numbers separated by a dash (e.g. oci:1-18);
  4. The citation metadata are recorded in RDF, based on the OpenCitations Data Model [2];
  5. The RDF statements for each citation record the basic properties shown in the following figure, which is based on the Citation Typing Ontology (CiTO) for describing the data, and the Provenance Ontology (PROV-O) for the provenance information.
The data model used for describing the citation data included in any Open Citation Index.

The data model used for describing the citation data included in any Open Citation Index.

Parsing the Crossref collection

Over the past few months, we have parsed the entire Crossref bibliographic database to extract all the DOI-to-DOI citations included in the dataset, as well as additional information about each citation, specifically its creation date (i.e. the publication date of the citing entity) and the citation time span (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity, to an accuracy determined by these publication dates as recorded in Crossref). These data for the open citations are now made available in COCI.

Each citation is described as an individual of the class cito:Citation and is identified by an URL structured as follows:

https://w3id.org/oc/index/coci/ci/[[OCI]]

The parameter [[OCI]] refers to the numerical part of the Open Citation Identifier (OCI) assigned to the citation, i.e. two numbers separated by a dash, in which the first number identifies the citing work and the second number identifies the cited work. For instance:

https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-02001010806360107050663080702026305630301

For citations extracted from Crossref in which the citing and cited works are identified by DOIs, which includes all the COCI citations, the OCI is created in the following manner:

  1. Each case-insensitive DOI is first normalized to lower case letters.
  2. Then, after omitting the initial “doi:10.” prefix, the alphanumeric string of the DOI is converted reversibly to a pure numerical string using the simple two-numeral lookup table for numerals, lower case letters and other characters presented at https://github.com/opencitations/oci/blob/master/lookup.csv.
  3. Finally, each converted numeral is prefixes by a “020”, which indicates that Crossref is the supplier of the original metadata of the citation (as indicated at http://opencitations.net/oci)

Currently COCI contains 316,243,802 citations and 45,145,889 bibliographic resources. We plan to update COCI at least every six months as more open DOI-to-DOI citations appear in Crossref.

How to access the citation data in COCI

All the data in COCI are available for inspection, download and reuse in the following ways.

SPARQL endpoint

By querying the COCI SPARQL endpoint at https://w3id.org/oc/index/coci/sparql. If you access this URL with a browser, a GUI will be shown, in which is an editable text box that enables the user to compose and execute a SPARQL query. In addition, the COCI SPARQL endpoint can be queried using the REST protocol, e.g. (via curl):

curl -L -H "Accept: text/csv" "https://w3id.org/oc/index/coci/sparql?query=PREFIX%20cito%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fspar%2Fcito%2F%3E%0ASELECT%20%3Fcitation%20%3Fcreation%20%7B%20%3Fcitation%20a%20cito%3ACitation%20%3B%20cito%3AhasCitationCreationDate%20%3Fcreation%20%7D%20LIMIT%201"

The above GET call executes the following simple SPARQL query:

PREFIX cito: <http://purl.org/spar/cito/>
SELECT ?citation ?creation { 
    ?citation a cito:Citation ; 
        cito:hasCitationCreationDate ?creation 
} 
LIMIT 1

This query returns the IRI of one citation accompanied by its creation date in CSV format, as shown as follows:

citation,creation
https://w3id.org/oc/index/coci/ci/02001000002360105020963000103015801090909000259040238024003010138381018136310232701044203370037122439026315-02001000002361027293701070800030100060007,1999-02

Instead, the following SPARQL query should be used to get the information about a particular citation given its OCI:

PREFIX oci: <https://w3id.org/oc/index/coci/ci/>
PREFIX cito: <http://purl.org/spar/cito/>
SELECT DISTINCT ?citing ?cited ?creation ?timespan {
oci:02001010806360107050663080702026306630509-02001010806360107050663080702026305630301 a cito:Citation ;
    cito:hasCitingEntity ?citing ;
    cito:hasCitedEntity ?cited ;
    cito:hasCitationCreationDate ?creation ;
    cito:hasCitationTimeSpan ?timespan
}

In this case, the query will returns the DOI URLs of the citing and cited entities, accompanied by the creation date and the timespan of the citation:

citing,cited,creation,timespan
http://dx.doi.org/10.1186/1756-8722-6-59,http://dx.doi.org/10.1186/1756-8722-5-31,2013,P1Y

It is worth mentioning that the results can be also returned in JSON (using “Accept: application/json” in the header of the request) or XML (using “Accept: application/xml” in the header of the request). For example, accessing the long URL starting with “https” of the above request with a browser will return an XML document describing the same result.

REST API

Citation information may also be retrieved by using the COCI REST API, available and documented at https://w3id.org/oc/index/coci/api/v1, which has been implemented by means of RAMOSE (the Restful API Manager Over SPARQL Endpoints). Specifically, the COCI REST API makes available a mechanism for getting the citation data:

If you would like to suggest an additional operation to be included in this API, please use the issue tracker of the COCI API available on GitHub.

Search and browsing interfaces

An interface providing a free text search over the contents of COCI is available at http://opencitations.net/index/coci/search. It allows one to search for citation data according to the same operational principles implemented in the REST API discussed above. However, in this case, the result are returned in tabular form through a web interface implemented by means of OSCAR, the OpenCitations RDF Search Application, e.g. http://opencitations.net/index/coci/search?text=10.1186%2F1756-8722-6-59&rule=citingdoi.

Each OCI returned by the search interface is a clickable link that opens a new descriptive page detailing that citation. These web pages are created by LUCINDA, which is a Javascript-based RDF data browser developed for exposing the statements contained in an RDF triplestore as descriptive human-readable HTML pages, e.g. http://opencitations.net/index/coci/browser/ci/02001010806360107050663080702026306630509-02001000002361222131237020000060000020201.

Data dumps

The dump of all the citation data available in COCI, including their provenance information, is downloadable from Figshare. These data are available in CSV and N-Triples formats, and each dump has a DOI assigned so as to be citable. Download links are available at http://opencitations.net/download#coci. A new dump will be made each time COCI is udpated.

By content negotiation

Using the HTTP URI of the individual citations, it is possible to access their representations in different formats: HTML, RDF/XML, Turtle, and JSON-LD. This is possible through a content negotiation mechanism that disambiguates in which format the information about a citation should be returned, by looking at the “Accept” header declared in the request. For instance, accessing the URL https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-02001000002361222131237020000060000020201 will return the citation data in HTML, while the GET request below will return the same information in Turtle:

curl -L -H "Accept: text/turtle" "https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-0200100000236122213123702000"

Conclusions

In this blog post we have introduced COCI, the OpenCitations Index of Crossref open DOI-to-DOI references, a citation index which contains more than 310 million open citations created from the ‘Open’ references included in Crossref. We plan soon to extend COCI to additionally include those DOI-to-DOI citations extracted from the ‘Limited’ set of Crossref reference data.

References

  1. Silvio Peroni and David Shotton (2018). Open Citation: Definition. figshare. DOI: https://doi.org/10.6084/m9.figshare.6683855
  2. Silvio Peroni and David Shotton (2018). The OpenCitations Data Model. figshare. DOI: https://doi.org/10.6084/m9.figshare.3443876

Footnotes

  1. At Crossref’s request, we have changed the originally proposed description of COCI from “the Crossref Open Citation Index (COCI)” to “COCI, the OpenCitations Index of Crossref open DOI-to-DOI references”, to make clear that COCI is an OpenCitations index and to avoid any implication that COCI is a Crossref service. We apologize for the initial ambiguity of our original wording and any confusion this may have caused.
Advertisements
This entry was posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, Open Citation Identifiers, Open Citations. Bookmark the permalink.

3 Responses to COCI, the OpenCitations Index of Crossref open DOI-to-DOI references

  1. Pingback: An Introduction to the Crossref Open Citation Index (COCI) | LJ infoDOCKET

  2. Pingback: What we read this week (13 July 2018) – BMJ Digital

  3. Pingback: New release of COCI: 450M DOI-to-DOI citation links now available | OpenCitations

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s