New release of COCI: 450M DOI-to-DOI citation links now available

As introduced in a previous blog post, COCI is the OpenCitations Index of Crossref open DOI-to-DOI references, all released as CC0 material. It is our first OpenCitations Index of open citations, in which we have applied the concept of citations as first-class data entities to index the contents of one of the major databases ofopen scholarly citation information, namely Crossref, and to render and make available this information in machine-readable RDF.

We are now proud to announce a new release of COCI, the second, which now contains almost 450 million DOI-to-DOI citation links coming from both ‘the ‘Open’ and the ‘Limited’ sets of Crossref reference data.  This represents an increase of 42% in the number of indexed citations, compared with the initial release of COCI on 4th June 2018, which indexed 316,243,802 citations involving 45,145,889 bibliographic resources. In addition, the data model for COCI has now been extended so as to state directly the presence of journal self-citations and author self-citations.

Extended data model

The previous data model used for storing the citation data in COCI – which is itself a subset of the OpenCitations Data Model – has been extended so as to keep track of two particular types of self-citation, as shown in the following figure.

The new data model used in COCI for describing its citation data, which includes classes for describing two kinds of self-citations, i.e. journal self-citations and author self-citations.

Generally speaking, a self-citation is citation in which the citing and the cited entities have something significant in common with one another, over and beyond their subject matter. The two kinds of self-citations we are now tracking are:

  • journal self-citation (class cito:JournalSelfCitation), i.e. a citation in which the citing and the cited entities are published in the same journal. This information has been obtained by comparing the ISSNs of the journals where two journal articles related by a citation have been published, as provided by Crossref. If they share the same ISSN, then the citation is described as journal self-citation;
  • author self-citation (class cito:AuthorSelfCitation), i.e. a citation in which the citing and the cited entities have at least one author in common. This information has been obtained by comparing the ORCIDs associated to the authors of a citing bibliographic entity with the ORCIDs of the authors of the cited entity. In this case, if any ORCID is shared, then the citation is described as author self-citation.  This categorization excludes authors bearing the same name where the ORCIDs are not known, since, while these instances may be author self-citations, they may alternatively merely represent name coincidences of distinct individuals.

It is worth mentioning that, while the ISSN information are usually present in the data returned by Crossref, the presence of ORCID id data associated with the authors of the various paper represented in Crossref is presently very limited, so that the number of recorded author self-citations in COCI is likely to be a considerable underestimate.

In this new release, COCI contains 449,842,374 citations, of which 30,114,696 are recorded as journal self-citations and 251,699 are recorded as author self-citations.

Extended REST API

The REST API for querying COCI has been extended so as to return information about the aforementioned self-citations. In particular, the response to the operations “references” and “citations” now has two more fields, i.e. “journal_sc” and “author_sc”, that are set to “yes” if the citation returned is a journal self-citation or an author self-citation respectively, or “no” otherwise.

Using the capabilities of the REST API, it is also possible to keep in or exclude from the result set those citations that are (or are not) one of the aforementioned types of self-citation. For instance, the following call

https://w3id.org/oc/index/coci/api/v1/citations/10.1002/pol.1987.140251103?filter=journal_sc:yes

returns all the citations having the article with DOI “10.1002/pol.1987.140251103” that are journal self-citations.

Conclusions

In this blog post we have introduced the second release of COCI, the OpenCitations Index of Crossref open DOI-to-DOI references, a citation index which now contains almost 450 million open citations created from the ‘Open’ and ‘Limited’ references included within Crossref.

As a reminder, all the data in COCI:

We plan soon to extend the OpenCitations Indexes by adding indexes of citations coming from other source datasets, including Wikidata and DataCite.

Advertisements
Posted in Uncategorized | Leave a comment

COCI, the OpenCitations Index of Crossref open DOI-to-DOI references

In a previous series of blog posts we proposed the treatment of bibliographic citations as first-class data entities, permitting citations to be endowed with descriptive properties. In doing so, we outlined some specific requirements, namely that the citations should be machine readable, should conform to a specific data model (in this case the OpenCitations Data Model), should be stored in an accessible database under an open license, and should be identified using global persistent identifiers (specifically Open Citation Identifiers) which are resolvable using an identifier resolution service (namely the Open Citation Identifier Resolution Service).

In this blog post, we introduce COCI, the OpenCitations Index of Crossref open DOI-to-DOI references1, our first open citation index, in which we have applied the concept of citations as first-class data entities to index the contents of one of the major open databases of scholarly citation information, namely Crossref, and to render and make available this information in machine-readable RDF.

Crossref contains metadata about publications (mainly academic journal articles) that are identified using Digital Object Identifiers (DOIs).  For about half of these publications Crossref also stores the reference lists of these articles submitted by the publishers (for discussion of why this is not true for all the publications, see this previous blog post).  Many of these references are to other publications bearing DOIs that are also described in Crossref, while others are to publications that lack DOIs and do not have Crossref descriptions.

COCI is an index of all the open DOI-to-DOI citations present in Crossref, and presently includes more than 300 million citations, obtained by parsing the open reference lists of the articles deposited there. COCI is available at http://opencitations.net/index/coci, and is released under a CC0 waiver.  COCI does not index Crossref references that are not open, nor Crossref open references to entities that lack DOIs.

What is an open citation index?

A citation index is a bibliographic index recording citations between publications, allowing the user to establish which later documents cite earlier documents. Several citation indexes are already available, some of which are freely accessible but not downloadable (e.g. Google Scholar), while others can be accessed only by paying significant access fees (e.g. Web of Science and Scopus). An open citation index contains only data about open citations, as defined in [1].

OpenCitations is a scholarly infrastructure organization dedicated to the promotion of semantic publishing by the use of semantic web (linked data) technologies, and engaged in advocacy for semantic publishing and open citations. It provides the OpenCitations Data Model and the SPAR (Semantic Publishing and Referencing) Ontologies for encoding scholarly bibliographic and citation data in RDF, and open software of generic applicability for searching, browsing and providing REST APIs over RDF triplestores. It has developed the OpenCitations Corpus (OCC) of open downloadable bibliographic and citation data recorded in RDF, and a system and resolution service for Open Citation Identifiers (OCIs), and it is currently developing a number of Open Citation Indexes using the data openly available in third-party bibliographic databases.

These Open Citation Indexes have the following characteristics in common:

  1. The citations they contain are all open [1].
  2. The citations are treated as first-class data entities;
  3. Each citation is identified by an Open Citation Identifier (OCI), which has a simple structure: the lower-case letters “oci” followed by a colon, followed by two numbers separated by a dash (e.g. oci:1-18);
  4. The citation metadata are recorded in RDF, based on the OpenCitations Data Model [2];
  5. The RDF statements for each citation record the basic properties shown in the following figure, which is based on the Citation Typing Ontology (CiTO) for describing the data, and the Provenance Ontology (PROV-O) for the provenance information.
The data model used for describing the citation data included in any Open Citation Index.

The data model used for describing the citation data included in any Open Citation Index.

Parsing the Crossref collection

Over the past few months, we have parsed the entire Crossref bibliographic database to extract all the DOI-to-DOI citations included in the dataset, as well as additional information about each citation, specifically its creation date (i.e. the publication date of the citing entity) and the citation time span (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity, to an accuracy determined by these publication dates as recorded in Crossref). These data for the open citations are now made available in COCI.

Each citation is described as an individual of the class cito:Citation and is identified by an URL structured as follows:

https://w3id.org/oc/index/coci/ci/[[OCI]]

The parameter [[OCI]] refers to the numerical part of the Open Citation Identifier (OCI) assigned to the citation, i.e. two numbers separated by a dash, in which the first number identifies the citing work and the second number identifies the cited work. For instance:

https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-02001010806360107050663080702026305630301

For citations extracted from Crossref in which the citing and cited works are identified by DOIs, which includes all the COCI citations, the OCI is created in the following manner:

  1. Each case-insensitive DOI is first normalized to lower case letters.
  2. Then, after omitting the initial “doi:10.” prefix, the alphanumeric string of the DOI is converted reversibly to a pure numerical string using the simple two-numeral lookup table for numerals, lower case letters and other characters presented at https://github.com/opencitations/oci/blob/master/lookup.csv.
  3. Finally, each converted numeral is prefixes by a “020”, which indicates that Crossref is the supplier of the original metadata of the citation (as indicated at http://opencitations.net/oci)

Currently COCI contains 316,243,802 citations and 45,145,889 bibliographic resources. We plan to update COCI at least every six months as more open DOI-to-DOI citations appear in Crossref.

How to access the citation data in COCI

All the data in COCI are available for inspection, download and reuse in the following ways.

SPARQL endpoint

By querying the COCI SPARQL endpoint at https://w3id.org/oc/index/coci/sparql. If you access this URL with a browser, a GUI will be shown, in which is an editable text box that enables the user to compose and execute a SPARQL query. In addition, the COCI SPARQL endpoint can be queried using the REST protocol, e.g. (via curl):

curl -L -H "Accept: text/csv" "https://w3id.org/oc/index/coci/sparql?query=PREFIX%20cito%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fspar%2Fcito%2F%3E%0ASELECT%20%3Fcitation%20%3Fcreation%20%7B%20%3Fcitation%20a%20cito%3ACitation%20%3B%20cito%3AhasCitationCreationDate%20%3Fcreation%20%7D%20LIMIT%201"

The above GET call executes the following simple SPARQL query:

PREFIX cito: <http://purl.org/spar/cito/>
SELECT ?citation ?creation { 
    ?citation a cito:Citation ; 
        cito:hasCitationCreationDate ?creation 
} 
LIMIT 1

This query returns the IRI of one citation accompanied by its creation date in CSV format, as shown as follows:

citation,creation
https://w3id.org/oc/index/coci/ci/02001000002360105020963000103015801090909000259040238024003010138381018136310232701044203370037122439026315-02001000002361027293701070800030100060007,1999-02

Instead, the following SPARQL query should be used to get the information about a particular citation given its OCI:

PREFIX oci: <https://w3id.org/oc/index/coci/ci/>
PREFIX cito: <http://purl.org/spar/cito/>
SELECT DISTINCT ?citing ?cited ?creation ?timespan {
oci:02001010806360107050663080702026306630509-02001010806360107050663080702026305630301 a cito:Citation ;
    cito:hasCitingEntity ?citing ;
    cito:hasCitedEntity ?cited ;
    cito:hasCitationCreationDate ?creation ;
    cito:hasCitationTimeSpan ?timespan
}

In this case, the query will returns the DOI URLs of the citing and cited entities, accompanied by the creation date and the timespan of the citation:

citing,cited,creation,timespan
http://dx.doi.org/10.1186/1756-8722-6-59,http://dx.doi.org/10.1186/1756-8722-5-31,2013,P1Y

It is worth mentioning that the results can be also returned in JSON (using “Accept: application/json” in the header of the request) or XML (using “Accept: application/xml” in the header of the request). For example, accessing the long URL starting with “https” of the above request with a browser will return an XML document describing the same result.

REST API

Citation information may also be retrieved by using the COCI REST API, available and documented at https://w3id.org/oc/index/coci/api/v1, which has been implemented by means of RAMOSE (the Restful API Manager Over SPARQL Endpoints). Specifically, the COCI REST API makes available a mechanism for getting the citation data:

If you would like to suggest an additional operation to be included in this API, please use the issue tracker of the COCI API available on GitHub.

Search and browsing interfaces

An interface providing a free text search over the contents of COCI is available at http://opencitations.net/index/coci/search. It allows one to search for citation data according to the same operational principles implemented in the REST API discussed above. However, in this case, the result are returned in tabular form through a web interface implemented by means of OSCAR, the OpenCitations RDF Search Application, e.g. http://opencitations.net/index/coci/search?text=10.1186%2F1756-8722-6-59&rule=citingdoi.

Each OCI returned by the search interface is a clickable link that opens a new descriptive page detailing that citation. These web pages are created by LUCINDA, which is a Javascript-based RDF data browser developed for exposing the statements contained in an RDF triplestore as descriptive human-readable HTML pages, e.g. http://opencitations.net/index/coci/browser/ci/02001010806360107050663080702026306630509-02001000002361222131237020000060000020201.

Data dumps

The dump of all the citation data available in COCI, including their provenance information, is downloadable from Figshare. These data are available in CSV and N-Triples formats, and each dump has a DOI assigned so as to be citable. Download links are available at http://opencitations.net/download#coci. A new dump will be made each time COCI is udpated.

By content negotiation

Using the HTTP URI of the individual citations, it is possible to access their representations in different formats: HTML, RDF/XML, Turtle, and JSON-LD. This is possible through a content negotiation mechanism that disambiguates in which format the information about a citation should be returned, by looking at the “Accept” header declared in the request. For instance, accessing the URL https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-02001000002361222131237020000060000020201 will return the citation data in HTML, while the GET request below will return the same information in Turtle:

curl -L -H "Accept: text/turtle" "https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-0200100000236122213123702000"

Conclusions

In this blog post we have introduced COCI, the OpenCitations Index of Crossref open DOI-to-DOI references, a citation index which contains more than 310 million open citations created from the ‘Open’ references included in Crossref. We plan soon to extend COCI to additionally include those DOI-to-DOI citations extracted from the ‘Limited’ set of Crossref reference data.

References

  1. Silvio Peroni and David Shotton (2018). Open Citation: Definition. figshare. DOI: https://doi.org/10.6084/m9.figshare.6683855
  2. Silvio Peroni and David Shotton (2018). The OpenCitations Data Model. figshare. DOI: https://doi.org/10.6084/m9.figshare.3443876

Footnotes

  1. At Crossref’s request, we have changed the originally proposed description of COCI from “the Crossref Open Citation Index (COCI)” to “COCI, the OpenCitations Index of Crossref open DOI-to-DOI references”, to make clear that COCI is an OpenCitations index and to avoid any implication that COCI is a Crossref service. We apologize for the initial ambiguity of our original wording and any confusion this may have caused.
Posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, Open Citation Identifiers, Open Citations | 3 Comments

Early adopters of the OpenCitations Data Model

OpenCitations is very pleased to announce its collaboration with four new scholarly Research and Development projects that are early adopters of the recently updated OpenCitations Data Model, described in this blog post.

The four projects are similar, in that they each are independently using text mining and optical character recognition or PDF extraction techniques to extract citation information from the reference lists of published works, and are making these citations available as Linked Open Data. Three of the four will also use the OpenCitations Corpus as publication platform for their citation data.  The academic disciplines from which these citation data are being extracted are social science, humanities and economics.

1     Linked Open Citation Database (LOC-DB)

The Linked Open Citation Database, with partners in Mannheim, Stuttgart, Kiel, and Kaiserslautern (LOC-DB, https://locdb.bib.uni-mannheim.de/blog/en/), is the first of two German projects funded by the Deutsche Forschungsgemeinschaft (DFG) that are extracting citations from Social Science publications.  Dr. Annette Klein, Deputy Director of the Mannheim University Library, is the project manager.

The project is using Deep Neural Networks based approaches for reference detection and state-of-the-art methods for information extraction and semantic labelling of reference lists from electronic and print media with arbitrary layouts [3].  The raw data obtained will be manually checked against and linked with existing bibliographic metadata sources in an editorial system.  They will then be structured in RDF using the OpenCitations Data Model, and published in the Linked Open Citations Database under a CC0 waiver. Using its libraries’ own Social Science print holdings and licensed electronic journals as subject material, this project will demonstrate how these citation extraction processes can be applied to the holdings of individual academic libraries, and can be integrated with library catalogues [1, 2, 3].

References

[1]       Kai Eckert, Anne Lauscher and Akansha Bhardwaj (2017) LOC-DB: A Linked Open Citation Database provided by Libraries. Motivation and Challenges.  EXCITE Workshop 2017: “Challenges in Extracting and Managing References”.  https://locdb.bib.uni-mannheim.de/wordpress/wp-content/uploads/2016/11/LOC-DB@EXCITE.pdf

[2]      Lauscher, Anne; Eckert, Kai; Galke, Lukas; Scherp, Ansgar; Rizvi, Syed Tahseen Raza; Ahmed, Sheraz; Dengel, Andreas; Zumstein, Philipp; Klein, Annette  (2018) Linked Open Citation Database: Enabling libraries to contribute to an open and interconnected citation graph. Accepted for the JCDL 2018: Joint Conference on Digital Libraries 2018, June 3-6, 2018 in Fort Worth, Texas [Preprint of the conference publication].
https://locdb.bib.uni-mannheim.de/wordpress/wp-content/uploads/2018/04/LOCDB-JCDL2018-paper-camera-ready.pdf

[3]       Bhardwaj A., Mercier D., Dengel A., Ahmed S. (2017). DeepBIBX: deep learning for image based bibliographic data extraction. In: Liu D., Xie S., Li Y., Zhao D., El-Alfy ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science, vol 10635. Springer, Cham [Conference publication].

2     The EXCITE (Extraction of Citations from PDF Documents) Project

The EXCITE Project (http://west.uni-koblenz.de/en/research/excite/), run jointly at the University of Koblenz-Landau and GESIS (Leibniz Institute for Social Sciences), is the second project funded by the Deutsche Forschungsgemeinschaft (DFG) that is extracting citations from Social Science publications.  It is headed by Steffen Staab, head of the Institute for Web Science and Technologies at the University of Koblenz-Landau, and Philipp Mayr of GESIS.

Since the social sciences are given only marginal coverage in the main bibliographic databases, this project aims to make more citation data available to researchers, with a particular focus on the German language social sciences.  It has developed a set of algorithms for the extraction of reference information from PDF documents and for matching the reference entry strings thus obtained against bibliographic databases (see EXCITE git https://github.com/exciteproject/).  It is using as its data sources the following Social Science collections: full texts from SSOAR, the Gesis Social Science Open Access Repository (https://www.gesis.org/ssoar/home/) and scattered pdf stocks from other social science collections including SOLIS, Springer Online Journals and CSA Sociological Abstracts [4, 5].

The EXCITE project organized an international developer and researcher workshop “Challenges in Extracting and Managing References” in March 2017 in Cologne. http://west.uni-koblenz.de/en/research/excite/workshop-2017

EXCITE will then structure the extracted bibliographic and citation data in RDF using the OpenCitations Data Model, and will use the OpenCitations Corpus as its publication platform, employing the OCC EXCITE supplier prefix 0110, described here, to identify the provenance of these citations.

References

[4]       Martin Körner (2016). Extraction from social science research papers using conditional random fields and distant supervision, Master’s Thesis, University of Koblenz-Landau, 2016.

[5]       Körner, M., Ghavimi, B., Mayr, P., Hartmann, H., & Staab, S. (2017). Evaluating reference string extraction using line-based conditional random fields: a case study with german language publications. In M. Kirikova, K. Nørvåg, G. A. Papadopoulos, J. Gamper, R. Wrembel, J. Darmont, & S. Rizzi (Hrsg.), New Trends in Databases and Information Systems (Bd. 767, S. 137–145). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_15   Preprint: https://philippmayr.github.io/papers/Koerner-et-al2017.pdf

3    The Venice Scholar Index

The Venice Scholar Index is a citation index of literature on the history of Venice, indexing nearly 3000 volumes of scholarship from the mid 19th century to 2013, from which some 4 million bibliographic references have been extracted.

The Venice Scholar Index is the first prototype resulting from Linked Books Project (https://dhlab.epfl.ch/page-127959-en.html), a project spearheaded by Giovanni Colavizza and Matteo Romanello of the Digital Humanities Laboratory at EPFL (École Polytechnique Fédérale de Lausanne), with partners in Venice, Milan and Rome.

The project is exploring the history of Venice through references to scholarly literature as well as archival documents found within publications.  To achieve this goal, the project has developed a system to automatically extract bibliographic references found within a large set of digitized books and journals, which has then been applied to the publications on the history of Venice, its main use case [6].

The Linked Books Project is specifically interested in analysing the interplay between citations to primary (e.g. archival) documents and those to secondary sources (scholarly literature), and the citation profiles of publications through time.  To this end, it developed the Venice Scholar Index, a rich search interface to navigate through the resulting network of citations, with the final aim of interlinking digital archives and digital libraries.

The citation data underlying the Venice Scholar Index are modelled using the OpenCitations Data Model, and will use the OpenCitations Corpus as its publication platform, using the OCC Venice Scholar Index supplier prefix 0120 to identify the provenance of these citations.

Reference

[6] Giovanni Colavizza, Matteo Romanello, and Frédéric Kaplan (2017). The references of references: a method to enrich humanities library catalogs with citation data. In International Journal on Digital Libraries 18 (March 8, 2017): 1–11. https://doi.org/10.1007/s00799-017-0210-1.

4    CitEcCyr – Citations in Economics published in CyrillicCitEcCyr  (https://github.com/citeccyr/CitEcCyr) is an open repository of citation relationships obtained from research papers in the Russian language and Cyrillic script from Socionet (https://socionet.ru/) and RePEc (http://repec.org/) [7, 8].  The CitEcCyr project is headed by Oxana Medvedeva, is technically led by Sergey Parinov, and is funded by RANEPA (http://www.ranepa.ru/eng/), the Russian Presidential Academy of National Economy and Public. CitEcCyr is also developing a suite of open software for the citation content analysis of these papers.  This project intends to model its citations using the OpenCitations Data Model, and will use the OpenCitations Corpus as its publication platform, using the OCC CitEcCyr supplier prefix 0140 to identify the provenance of these citations.

However, since this is the first project from which OpenCitations will be importing bibliographic metadata and citations in a language other than English and in a script other than the Latin script, we at OpenCitations are going to have to crawl out of our comfortable ‘Western’ shells and learn to handle foreign languages and scripts other than Latin scripts.

For Russian language papers written using Cyrillic script, we at OpenCitations will to decide how best to handle Russian language written using Cyrillic script, Cyrillic script transliterated into Latin script, and Russian language translated into English and rendered using Latin script.  In particular, since in the OpenCitations Corpus our reference entry records are the uncorrected literal texts of the references in the reference lists of the citing papers, these will need to be recorded as given in Cyrillic.

We will need to develop a policy for when to provide Latin script translations of (for example) titles and abstracts, if these are not provided by the data supplier.  To facilitate use of the OpenCitations Corpus by Russian scholars, we will also need to modify the OpenCitations web site, so as to render the static information displayed in the web pages in the language and script appropriate to the language setting on the user’s web browser.

Unfortunately, all this will take time, so we do not anticipate publishing citation data from the CitEcCyr project within OCC any time soon.  However, this collaboration will be of tremendous value to OpenCitations as well as to CitEcCyr, since the lessons learned by our collaboration with the CitEcCyr project will enable the OpenCitations Corpus to handle citation data not just in Russian, but also in Arabic, Chinese, Japanese and other languages where the Latin script is not used, something that is not found in other major bibliographic databases.

Watch this space!

References

[7]       Jose Manuel Barrueco, Thomas Krichel, Sergey Parinov, Victor Lyapunov, Oxana Medvedeva and Varvara Sergeeva (2017).  Towards open data for the citation content analysis.    https://arxiv.org/abs/1710.00302

[8]       Thomas Krichel (2017). CitEc to CitEcCyr – A stab at distributed citation systems.  Presented at the 2017 EXCITE workshop. http://west.uni-koblenz.de/sites/default/files/research/projects/excite/workshop-2017/slides/excite-workshop-2017_krichel_citec-to-citeccyr.pdf

Posted in Bibliographic references, Citations as First-Class Data Entities, Open Citations, Open scholarship, Semantic Publishing | Tagged , , , , , | 2 Comments

Citations as First-Class Data Entities: The Open Citation Identifier Resolution Service

Requirements for citations to be treated as first-class data entities

In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities.  The fifth and final of these requirements is that there must be a Web-based identifier resolution service that takes the citation identifier as input and returns a description of the citation.

At the recent PIDapalooza Conference on persistent identifiers, held in Gerona, Spain, I described the Open Citation Identifier Resolution Service, the new resolution service for Open Citation Identifiers created and operated by OpenCitations [1].

In this post, I describe this Open Citation Identifier Resolution Service, which supports the resolution of Open Citation Identifiers not only of the citations documented in the OpenCitations Corpus (OCC), but also of open citations recorded in other bibliographic databases.

What is the Open Citation Identifier Resolution Service

The Open Citation Identifier Resolution Service runs on the OpenCitations server, presenting itself to the user as a web page with the URI http://opencitations.net/oci.

When a user enters a valid OCI and clicks the “Look up citation” button, this activates the resolution service, which, after a brief delay, returns information about the citation itself and about the citing and cited bibliographic resources, as shown in the following screen image (which for clarity omits the provenance data associated with this citation).

This information can optionally be returned to the user in a variety of other formats: RDF/XML, Turtle or JSON-LD.

Clicking on the links provided will return additional metadata held by the OpenCitations Corpus for the citing and the cited documents.  In the near future, this service will be integrated with LUCINDA, the forthcoming OCC browse interface, to present this information in a more user-friendly fashion.

Using the Resolution Service with citations in an external resource via a SPARQL endpoint

The Open Citation Identifier Resolution Service currently works for citations between bibliographic resources both within the OpenCitations Corpus and within external bibliographic databases, provided that the external service uses bibliographic resource identifiers having a unique numerical part, and provides a SPARQL endpoint to makes available information about bibliographic resources and the references they contain.

It can therefore resolve OCIs identifying citations within Wikidata, such as oci:01027931310-01022252312, where, as explained in the previous blog post, “010” is the assigned OCC supplier prefix for Wikidata.

Entering this OCI in the Open Citation Identifier Resolution Service pulls live data from the Wikidata SPARQL endpoint and returns the following information about that citation, as shown in the following screen image (which, again, omits for clarity the provenance data associated with that citation):

Clicking on the links provided here returns information about the relevant Wikidata entities.

Citing paper:

Cited paper:

How the Resolution Service works

The bibliographic database supplying the metadata for a particular citation identified by an OCI is specified by the assigned OCC supplier prefix that forms part of the OCI, as described in the previous blog post. Each OCI is thus specific for and unique within a particular bibliographic database.

The resolution service takes the OCI entered into the search box, recognises the supplier prefix specifying the bibliographic database holding the citation information, parses the OCI into the database identifiers for the citing and cited entities, and then sends an appropriate SPARQL query to interrogate the SPARQL endpoint of the relevant database. When that database has returned information about the citation itself and about the citing and cited bibliographic resources, this is displayed to the user as shown in screen images above – or in other RDF formats (Turtle, JSON-LD, RDF/XML) according to the request.

It is important to realize that no other databases are contacted during this resolution process, and that the quality and accuracy of the metadata retrieved by the Open Citation Identifier Resolution Service is the responsibility of the database hosting that citation.  The OCI Resolution Service does no more than retrieve this information, and does nothing to address possible errors or omissions in the metadata coming from the hosting database.

Using the Resolution Service with external citations via a REST API

While the resolution service presently works only to retrieve information from bibliographic databases having a SPARQL endpoint, we plan soon to extend this resolution service to work with information supplied by a bibliographic database via a REST API.

Coupled with the ability to create OCIs by numerical conversions of Digital Object Identifiers (DOIs), as explained in the previous blog post, the Open Citation Resolution Service could then be used to pull metadata live from the Crossref REST API for any of the ~350 million Crossref open references in which the cited paper as well as the citing paper has a DOI, and for which an OCI can thus be created.

Watch this space!

References

[1]     David Shotton (2018). Citations as first-class data entities. Open Citation Identifiers.  Conference presentation. PIDapalooza 2018, Girona, 23-23 January 2018. https://doi.org/10.6084/m9.figshare.5844972

Posted in Bibliographic references, Citations as First-Class Data Entities, Open Citation Identifiers, Open Citations, Semantic Publishing | Tagged , , , , , | 2 Comments

Citations as First-Class Data Entities: Open Citation Identifiers

Requirements for citations to be treated as First-Class Data Entities

In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities.  The fourth of these requirements is that they must be identifiable using a global persistent identifier scheme.

At the recent PIDapalooza Conference on persistent identifiers, held in Girona, Spain, I launched the Open Citation Identifier (abbreviated OCI, in line with DOI), the new persistent identifier for citations [1].

In this post, I describe the Open Citation Identifier scheme, created and operated by OpenCitations, which supports the assignment of Open Citation Identifiers not only to the citations present in the OpenCitations Corpus (OCC) but also to open citations present in other bibliographic databases.

Structure and syntax of the Open Citation Identifier

Each OCI has a simple structure: oci:number-number, where “oci:” is the identifier prefix.

OCIs for citations stored within the OpenCitations Corpus are constructed by combining the OpenCitations Corpus local identifiers for the citing and cited bibliographic resources, separating them with a dash.  (For definition of OCC local identifiers, see the OpenCitations Data Model).

For example, oci:2544384-7295288 is a valid OCI for the citation between two papers stored within the OpenCitations Corpus, the first number being the OCC local identifier for the citing bibliographic resource [2], and the second being the OCC local identifier for the cited bibliographic resource [3], these bibliographic resource local identifiers being unique within the OCC.  [Note: Supplier prefixes are omitted from OCC local identifiers of bibliographic resources ingested into the OpenCitations Corpus prior to February 2018, but will be included within all OCC local identifiers of bibliographic resources ingested into Corpus after that date.]

OCIs for external resources identifies by numerical identifiers

OCIs can also be created for bibliographic resources described in an external bibliographic database, if they are similarly identified there by identifiers having a unique numerical part.  For example, the OCI for the citation that exists between Wikidata resources Q27931310 (the citing resource, [4]) and Q22252312 (the cited resource, [5]) is oci:0102793131001022252312, where “010” is the assigned OCC supplier prefix for Wikidata.  (The colours here and below are added simply for clarity.)

The OCC supplier prefix consist of a positive number (following the pattern “nnn”, where “nnn” is a string of numerals of variable length which includes no zeros), enclosed between two zeros (e.g. “0420”).  The list of all assigned OCC supplier prefixes is given at https://github.com/opencitations/oci/blob/master/suppliers.csv.

OCIs for citations between resources identified by DOIs

OCIs can also be created for bibliographic resources described in external bibliographic database such as Crossref or DataCite where they are identified by alphanumeric Digital Object Identifiers (DOIs), rather than purely numerical strings.

To achieve this, each case-insensitive DOI is first normalized to lower case letters. Then, after omitting the initial “doi:10.” prefix, the alphanumeric string of the DOI is converted reversibly to a pure numerical string using the simple two-numeral lookup table for numerals, lower case letters and other characters presented at https://github.com/opencitations/oci/blob/master/lookup.csv. For example, using this lockup table, “1” becomes “01”, “2” becomes “02”, “a” becomes “10”, “b” becomes “11”, and “/” becomes “36”.  To the resulting number, the appropriate OCC supplier prefix is then added, to clearly identify its provenance.

A citation documented in Crossref exists between the two publications [3] and [6], which are there identified by the DOIs doi:10.1108/jd-12-2013-0166 and doi:10.1371/journal.pcbi.1000361.  We can thus create an OCI for this Crossref citation by using numerical representations of the two DOIs. These numerical representations are:

0200101000836191363010263020001036300010606

and

02001030701361924302723102137251211183701000000030601

where the initial “020” in each case is the assigned OCC supplier prefix for Crossref.

From these two numerical representations of DOIs, the OCI for the Crossref citation between these two paper is easily constructed, and is:

oci:0200101000836191363010263020001036300010606-02001030701361924302723102137251211183701000000030601

While this is long for an identifier, it should be remembered that it will be processed computationally, and is not intended for human readability.

In this way, Crossref OCIs can be assigned to all ~350 million open references within Crossref in which the cited paper as well as the citing paper has a DOI [7].

OCIs for the same citation recorded within different databases

If a citation is recorded in more than one bibliographic database, a separate OCI can be created for each instance, each OCI having a distinct supplier prefix and being specific to that database.

Thus, in addition to the Crossref OCI created from DOIs and described above for the citation from [3] to [6], a Wikidata OCI exists for the same citation recorded within Wikidata, having the form oci:01024260641-01021092566.

Upon resolution of an OCI, the Open Citation Identifier Resolution Service will pull metadata only from the database specified by the supplier prefix of the OCI.  Details of the Open Citation Identifier Resolution Service are given in the next blog post.

It is important to note that an OCI can only be used to specify a citation between a citing and a cited publication which is actually recorded within a bibliographic database.  For this reason, the OCI “oci:7295288-3962641” shown below the second diagram in the introductory blog post to this series is presently invalid.  While the OpenCitations Corpus has metadata describing both bibliographic resources [3] and [6], it has not yet ingested the reference list for the first bibliographic resource [3] (which has the OCC local identifier 7295288), having information about it only from a reference within a third paper, with no information about the references [3] itself contains.  As a result, at present OCC has no record that a citation actually exists between [3] and the second bibliographic resource [6] (which has the OCC local identifier 3962641).

Representing OCIs in RDF

To permit the description of OCIs in RDF, “oci” has been added as a new member of the class datacite:ResourceIdentifierScheme within the DataCite Ontology.

The resolvable URL for any citation identified by a OCI has the form “https://w3id.org/oc/virtual/ci/nnn-mmm”, where nnn-mmm represents the OCI with its “oci:” prefix removed. Currently, we are able to return the RDF description of all the citations contained in the OpenCitations Corpus and Wikidata. We are working to extend the coverage so as to include other datasets, e.g. Crossref.

References

[1]     David Shotton (2018). Citations as first-class data entities. Open Citation Identifiers.  Conference presentation. PIDapalooza 2018, Girona, 23-23 January 2018. https://doi.org/10.6084/m9.figshare.5844972

[2]     Armen Yuri Gasparyan, Marlen Yessirkepov et al. (2015). Preserving the integrity of citations and references by all stakeholders of science communication.  J. Korean Med. Sci. 30:1545-1552. (English.)  https://doi.org/10.3346/jkms.2015.30.11.1545

[3]     Silvio Peroni, Alexander Dutton, Tanya Gray and David Shotton (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71 (2): 253-277.  https://doi.org/10.1108/jd-12-2013-0166

[4]     Daniel K. Bricker, Eric B. Taylor et al. (2012). A Mitochondrial Pyruvate Carrier Required for Pyruvate Uptake in Yeast, Drosophila, and Humans. Science 337: 96-100.
https://doi.org/10.1126/science.1218099

[5]     Douglas Hanahan and Robert A. Weinberg (2011). Hallmarks of cancer: the next generation.  Cell 144: 646–674.  https://doi.org/10.1016/j.cell.2011.02.013

[6]     David Shotton, Katie Portwin, Graham Klyne and Alistair Miles (2009).  Adventures in semantic publishing: exemplar semantic enhancement of a research article. PLoS Computational Biology 5: e1000361. http://dx.doi.org/10.1371/journal.pcbi.1000361

[7]     Daniel Ecer (2017). Crossref Data Notebook (updated). Available at https://elifesci.org/crossref-data-notebook

 

Posted in Bibliographic references, Citations as First-Class Data Entities, Open Citation Identifiers, Open Citations, Semantic Publishing | Tagged , , , , | 4 Comments

Citations as First-Class Data Entities: The OpenCitations Corpus

Requirements for citations to be treated as First-Class Data Entities

In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities.  The third of these requirements is that they must be storable, searchable and retrievable in an open database designed for bibliographic citations.

In this post, I describe the current status of the OpenCitations Corpus, a well-structured open database specifically developed by OpenCitations and designed to store information about bibliographic citations as Linked Open Data, encoded in RDF (specifically JSON-LD).

What is OpenCitations?

OpenCitations (http://opencitations.net) is an scholarly infrastructure organization that has created and is currently expanding the coverage of the Open Citations Corpus (OCC), an open repository of scholarly citation data made available under a Creative Commons CC0 public domain dedication, which provides in RDF accurate citation information (bibliographic references) harvested from the scholarly literature.

The Co-Directors of OpenCitations are David Shotton, Oxford e-Research Centre, University of Oxford (david.shotton@opencitations.net) and Silvio Peroni, Department of Computer Science and Engineering, University of Bologna (silvio.peroni@opencitations.net).

We are committed to open scholarship, open data, open access publication, and open source software.  We espouse the FAIR data principles developed by Force11, of which David Shotton was a founding member, and the aim of the Initiative for OpenCitations (I4OC), of which David Shotton and Silvio Peroni were both founding members, to promote the availability of citation data that is structured, separable, and open.

The principal activity of OpenCitations to date has been the establishment and population of the OpenCitations Corpus.

Holdings of the OpenCitations Corpus

We have so far concentrated on ingesting into the OpenCitations Corpus bibliographic references from open access papers available at PubMed Central, the encoding of these data in RDF, and high-quality curation of the citation links they represent, involving metadata enrichment from the Crossref API and (for authors) the ORCID API.

To date (19th February 2018), the OCC has ingested the references from 302,758 citing bibliographic resources, and contains information about 12,830,347 citation links to 6,549,665 cited resources. Plans to expand the coverage of the OCC are outlined below.

User interfaces

The information within the OCC can be accessed via OSCAR, our new generic OpenCitations RDF Search Application (http://opencitations.net/search) [1], which can be used for textual searches over any triplestore presenting a SPARQL endpoint.  Users can employ OSCAR to search the OCC for publication titles, author names, publication years, and identifiers (DOIs, PubMed IDs PubMed Central IDs, ORCIDs, and OCC corpus identifiers). Such a search returns details of all bibliographic resources within the OCC matching the search term, from which their references can be obtained, if known. In the near future, we will complement OSCAR with a browse interface named LUCINDA.

We also provide a SPARQL endpoint for directly querying the Blazegraph triplestore in which we store the OCC RDF, and we plan in the near future to supplement such programmatic access with a REST API.  In addition, the contents of the entire triplestore, and of the various sub-databases within the Corpus, together with their provenance information, are downloadable from Figshare as monthly dumps.  Once the REST API has been developed, we will turn our attention to developing user interfaces for the interactive visualization of citation graphs.

The OpenCitations Data Model

As described in the previous blog post, we have just completed a comprehensive revision of the OpenCitations Data Model (OCDM, available at https://doi.org/10.6084/m9.figshare.3443876), which we use to capture descriptions of all aspects of the OCC citations and their provenance. This model makes extensive use of our SPAR (Semantic Publishing and Referencing) Ontologies (http://www.sparontologies.net/), which we developed to describe all aspects of the scholarly publishing domain in RDF .

The OpenCitations Data Model is freely available for third parties to use when recording their own bibliographic and citation information in RDF, with the advantage that data so modelled will be immediately compatible with those within the OpenCitations Corpus, which can act as a publishing venue for such third-party data.

Future ingest rate and data sources

Since July 2016, the instantiation of the OpenCitations Corpus currently running at the University of Bologna has been ingesting reference lists from biomedical journal articles at the relatively slow rate of about 200,000 citing bibliographic resources per year. During February 2018, ingestion into the Corpus is suspended, while we move the system to a completely new and more powerful server, supplemented by thirty Raspberry Pi ingest engines that will work in parallel feeding ingested data to the server.

This will increase our ingestion rate ~30-fold to about six million citing bibliographic resources per year, equivalent to ~240 million citations per year at 40 references per paper (the current OCC value is 42.4 references per paper).  We should then be able to complete ingestion of the ~1.4 million remaining OA resources at PubMed Central within about three months.

At that stage, we plan to start ingesting references from the ~17 million journal articles whose deposited references are now open at Crossref as a consequence of the Initiative for Open Citations.  The scholarly world currently publishes about 2.5 million new journal articles each year, of which about half will be probably be open at Crossref (assuming Elsevier has not by then opened its references).  So, by the end of 2020, Crossref will have ~650 million open references.  In addition to ingesting new open Crossref references as they are made available, we will be able to eat into the backlog of existing Crossref open references at a catch-up rate of ~190 million per year.  By the end of 2020, we anticipate that the OCC should contain ~650 million citations harvested from PMC and Crossref, roughly half the coverage of Web of Science.  We are currently also considering ingest of references from other major bibliographic databases.

Our vision for OpenCitations

Our vision is that OpenCitations should become a comprehensive source of open citation information from all disciplines of scholarly endeavour encoded as Linked Open Data, a key component of the academic open infrastructure used on a daily basis without charge by scholars worldwide.

To be of maximum utility, it requires effective graphical user interfaces and analytical tools to interrogate and quantify the data contained within the OCC.  Since these data are all open, we anticipate that such interface and tool development will best be undertaken collaboratively within the open scholarly community, and we invite developers interested in such collaboration to contact us at contact@opencitations.net.

Reference

[1]     Ivan Heibi, Silvio Peroni and David Shotton (2018).  OSCAR: A customisable tool for free-text search over SPARQL endpoints. Accepted to the 2018 International Workshop on Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination Workshop (https://save-sd.github.io/2018/, co-located with The Web Conference), 24 April 2018 – Lyon, France.  Preprint available at https://w3id.org/people/essepuntato/papers/oscar-savesd2018.html

Posted in Bibliographic references, Citations as First-Class Data Entities, Open Citations, Semantic Publishing | Tagged , , , , | 1 Comment