OpenCitations described

OpenCitations is an infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data. We at OpenCitations are proud to announce the publication, in the first issue of Quantitative Science Studies, of a canonical paper in which we introduce and describe OpenCitations and outline its achievements and goals [1].

Here, I outline the contents of our paper, and provide definitive links on the topics described. Many of these topics have been the subjects of earlier blog posts.

This paper appears in the first Special Issue of QSS, dedicated to the description of the bibliometric data sources that lie at the heart of scientometric research, which aims to characterize the most important data sources currently available and to show how they differ in various dimensions, for instance in the data they provide, their level of openness, and their support for making research reproducible. The first three papers in this special issue cover the most important commercial bibliographic data sources: Web of Science (Clarivate Analytics), Scopus (Elsevier), and Dimensions (Digital Science), while the remaining three articles describe open data sources: Microsoft Academic, Crossref and OpenCitations.

In the introduction to our own paper, we describe the origins of OpenCitations, discuss the growth and benefits of open science, and introduce the Semantic Web techniques used at OpenCitations for recording and publishing our data. We then go on to describe OpenCitations’ services and data, namely Open Citation Identifiers, the OpenCitations Data Model, the SPAR (Semantic Publishing and Referencing) Ontologies, the OpenCitations Corpus, and the OpenCitations Indexes of citation data, of which the first and largest is COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, that currently holds information on over 624 million citations. We conclude our survey of OpenCitations’ services and data by outlining the generic open source software developed at OpenCitations, including OSCAR, the OpenCitations RDF Search Application for searching over RDF datasets, LUCINDA, OSCAR’s associated OpenCitations RDF Resource Browser, and RAMOSE, OpenCitations’ application for creating REST APIs over SPARQL endpoints, thus opening Semantic Web datasets to those not familiar with SPARQL, the RDF query language.

In the second half of the paper, we describe OpenCitations as an organization in terms of its compliance with the principles for the sustainability of open infrastructures proposed by Bilder, Lin and Neylon (2015) [2], and report the selection of OpenCitations by the Global Sustainability Coalition for Open Science Services (SCOSS) as an open infrastructure organization worthy of crowd-funding support by the stakeholder community. We then provide usage statistics for our datasets and web site, and describe the adoption of OpenCitations data and services by the community, before concluding with a forward look at our proposed developments of OpenCitations activities.

References

[1] Silvio Peroni and David Shotton (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies 1 (1): 428-444. https://doi.org/10.1162/qss_a_00023

[2] Geoffrey Bilder, Jennifer Lin and Cameron Neylon (2015). Principles for open scholarly infrastructures. Figshare. https://doi.org/10.6084/m9.figshare.1314859

Posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, open access, Open Citation Identifiers, Open Citations, Open scholarship, Open Science, Uncategorized | Tagged , , , , , , , , , , , | Leave a comment

The first issue of Quantitative Science Studies

The memorable date 20/02/2020 saw the publication by MIT Press of the first issue of Volume One of a new journal, Quantitative Science Studies (QSS), the official open access journal of the International Society for Scientometrics and Informetrics (ISSI). QSS’s Editor in Chief is Ludo Waltman (CWTS, University of Leiden, Netherlands), Vincent Larivière (Université de Montréal, Montreal, Quebec, Canada) and Staša Milojević (Indiana University Bloomington, Bloomington, Indiana, USA) are its Associate Editors, and it has a large and distinguished editorial board.

What makes the launch of this new journal remarkable is the story of how it came into being. In 2019, the entire editorial team of the Journal of Informetrics (JOI), a leading journal in this field published by Elsevier, resigned en masse and decided to start an alternative journal, QSS, both because of Elsevier’s position on open citations, and because, in their opinion, the financial model used by Elsevier violates the scientific ethos.

Reproducibility in the field of scientometrics requires scientific metadata that are both of high-quality and open, particularly those relating to bibliographic citations. The JOI editorial board was deeply concerned by the refusal of Elsevier to join almost all other large scholarly publishers in supporting the Initiative for Open Citations (I4OC). As we have previously reported on this blog, Elsevier is the largest contributor of bibliographic references to Crossref, but insists that these data should be kept closed.

Elsevier’s position, driven by commercial interests (since it sells access to citation data through Scopus), flies in the face of the scientific community’s clear move towards open science, with hundreds of scientometricians having signed an ISSI open letter urging scholarly publishers to support I4OC.

Science is a self-governing system, and the editorial team held the view that the ultimate responsibility for a scholarly journal should fall with the scientific community, who serve as the gatekeepers, producers, and consumers of scientific content.

The editorial team also believed Elsevier’s subscription fees to be excessive, and its article processing charges (APCs) for open access publishing to be unfairly high, thus limiting both those who can afford to read Elsevier journals and those who can afford to publish in them, so that publishing with Elsevier inevitably places major limits on scholarship, harming both science and society. It was for all these reasons that they forsook JOI and started QSS.

We at OpenCitations congratulate the editorial team for their courage in deciding to make this journal flip, and wish them, together with the ISSI and MIT Press, every success for this important new journal. We also commend the Technische Informationsbibliothek (TIB) – Leibniz Information Centre for Science and Technology and the Communication, Information, Media Centre (KIM) of the University of Konstanz, who, in collaboration with the Fair Open Access Alliance (FOAA), have generously agreed to cover APCs for the first three years of the QSS journal.

Posted in Bibliographic references, open access, Open Citations, Open scholarship | Tagged , , , , , , , , , | Leave a comment

Introducing InTRePIDs – In-Text Reference Pointer Identifiers

Rationale

Readers of this blog will be familiar with Open Citation Identifiers (OCIs), described in an earlier post and formally defined in [1]. OCIs enable bibliographic citations, treated as first class information entities, to be uniquely identified and referenced, and are used to identify the >624 million individual citations indexed in the latest release of COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, as described in a recent post.

However, COCI and similar citation indexes do not provide any information about where within the citing paper a citation is generated, the textual contexts of the in-text reference pointers, or the reasons for including different in-text reference pointers denoting the same reference at different points within the text.

As explained in the preceding post describing the Open Biomedical Citations in Context Corpus funded by the Wellcome Trust and under development by OpenCitations, deep citation analysis requires a more nuanced approach to citations, which acknowledges that each in-text reference pointer that denotes a bibliographic reference in the reference list of a citing publication instantiates its own citation, as shown in Figure 1.

Figure 1. Citations between a citing paper and a cited paper instantiated both by the inclusion of a bibliographic reference within the reference list of the citing paper and by the inclusion within the text of the citing paper of one or more in-text reference pointers denoting that reference.

The pointer citations clearly involve the same cited publication as does the reference citation itself, but each has its own unique characteristics: the location and textual context of its in-text reference pointer within the text of the citing publication, and its particular rhetorical function which is determined by that context.

If the reference citation is open (as defined in [2]) and identified by an OCI, each in-text reference pointer related to that citation can be identified uniquely using an In-Text Reference Pointer Identifier (InTRePID).

InTRePIDs facilitate in-depth scholarship on in-text reference pointer locations and citation functions, and fine-grained analysis of the relationships between publications, by making it possible

  • to identify each in-text reference pointer with a unique PID,
  • to distinguish references that are cited only once from those that are cited multiple times,
  • to see which references are cited together (e.g. in the same sentence or within an in-text reference pointer list),
  • to determine from which section(s) of the article references are cited (e.g. Introduction, Methods, Discussion), and, potentially,
  • to determine the rhetorical function of the citations from analysis of their textual contexts, by the application of natural language processing, machine learning and artificial intelligence techniques to conduct sentiment analysis on the citation contexts.

Definition of an InTRePID

An InTRePID is composed of two parts separated by an oblique stroke

intrepid:<oci-numerals>/<ordinal><total>

where

  • <oci-numerals> is the numerical part of the OCI uniquely identifying the particular open citation to which the in-text reference pointer and its denoted bibliographic reference relate. Thus an InTRePID can be assigned for any in-text reference pointer that relates to an open citation for which a valid OCI has been assigned;
  • <ordinal> identifies the nth occurrence of an in-text reference pointer within the text of the citing paper relating to that citation; and
  • <total> defines the total number of in-text reference pointers denoting that bibliographic reference within the citing paper.

For example, intrepid:070433-070475/46 is a valid InTRePID for an in-text reference pointer defined within the OpenCitations Citations in Context Corpus.

A formal definition document for the InTRePID is given in [3].

Exemplar in-text reference pointers

Consider the following citing paper:

Zou, J. et al. (2020). Phenotypic and genotypic correlates of penicillin susceptibility in nontoxigenic Corynebacterium diphtheriae, British Columbia, Canada, 2015–2018. Emerging Infectious Diseases, 26: 97-103. https://doi.org/10.3201/eid2601.191241

This paper contains six in-text reference pointers denoting Reference 13 in the reference list:

13. Lowe, C. et al. (2011). Cutaneous diphtheria in the urban poor population of Vancouver, British Columbia, Canada: a 10-year review. J. Clinical Microbiology 49: 2664-2666. https://doi.org/10.1128/JCM.00362-11

The InTRePIDs for these pointers are recorded within the OpenCitations Biomedical Citations in Context Corpus, together with the corpus identifiers and DOIs of the citing and cited papers, as shown in the excerpt presented in Figure 2.

Figure 2. An excerpt from the OpenCitations Biomedical Citations in Context Corpus, showing highlighted the InTRePIDs for the six in-text reference pointers within Zou, J. et al. (2020) denoting Reference 13, the reference to Lowe, C. et al. (2011), together with the internal corpus identifiers for each in-text reference pointer, and the corpus identifiers and DOIs for the citing and cited papers.

Of these six in-text reference pointers, having InTRePIDs intrepid:070433-070475/1-6 to intrepid:070433-070475/6-6, the first and the fourth of these, together with their document locations, their embedding sentences, their in-text reference pointer lists, and their InTRePIDs, chosen as examples, are as follows:

Introduction. “Nontoxigenic strains have been shown to have epidemic potential, causing infections in persons afflicted by homelessness, alcohol abuse, and injection drug use (9,13–15).” (intrepid:070433-070475/1-6)

Discussion. “We also noted ST5 and ST32 in our review from downtown Vancouver during 1998–2007 (13).” (intrepid:070433-070475/4-6)

The first of these discusses those people most susceptible to diphtheria infection, while the other discusses which multilocus sequence types (STs) of C. diphtheriae were found, thus relating to the organism causing the infection rather than to the infected individuals. The rhetorical function of these two in-text reference pointers is quite distinct.

To permit this information to be recorded within the OpenCitations Citations in Context Corpus, extensions were required to the OpenCitations Data Model, a new extended version of which was recently published [4], as described in a related blog post.

The OpenCitations InTRePID Resolution Service

To support the use of InTRePIDs to identify in-text reference pointers, OpenCitations has recently developed an InTRePID Resolution Service (currently in ‘beta’ in its development cycle), which is running at http://opencitations.net/intrepid. A screenshot of this service is shown in Figure 3.

Figure 3. A screenshot of the user interface of the InTRePID Resolution Service.

In addition to using the Web user interface shown in Figure 3, InTRePIDs can be entered into this resolution service in the form of resolvable URIs, e.g.

http://opencitations.net/intrepid/070433-070475/4-6

As shown in Figure 4, the OpenCitations InTRePID Resolution service returns metadata concerning the in-text reference pointer identified by the InTRePID, and the bibliographic reference that it denotes, from which further information about the citation and the citing and cited publications may be obtained by following the links provided.

Figure 4. A screenshot of the Web page displaying metadata returned by the InTRePID Resolution Service.

Note that as well as rendering this information in HTML on a web page, the resolution service can also provide it in a variety of machine-readable formats.

Conclusion

InTRePIDs, which enable the identification of individual in-text reference pointers, and the InTRePID Resolution Service, are new services from OpenCitations that will facilitate scholarship on the textual contexts and rhetorical functions of such in-text reference pointers, and of the citations that they instantiate.

InTRePIDs were first announced on 30th January 2020 at PIDapalooza 2020 in Lisbon, the Open Festival of Persistent Identifiers.

References

[1] Silvio Peroni and David Shotton (2019): Open Citation Identifier: Definition. Figshare. https://doi.org/10.6084/m9.figshare.7127816.v2

[2] Silvio Peroni and David Shotton (2018). Open Citation: Definition. Figshare. https://doi.org/10.6084/m9.figshare.6683855

[3] David Shotton, Marilena Daquino and Silvio Peroni (2020). In-Text Reference Pointer Identifier: Definition. Figshare. https://doi.org/10.6084/m9.figshare.11674032

[4] Marilena Daquino, Silvio Peroni and David Shotton (2019). The OpenCitations Data Model. Version 2.0. Figshare. https://doi.org/10.6084/m9.figshare.3443876

Posted in Bibliographic references, Citations as First-Class Data Entities, Open Citation Identifiers, Open Citations | Tagged , , , , , , , , , , , , , , , | 1 Comment

The Open Biomedical Citations in Context Corpus: Progress Report

The creation of the Open Biomedical Citations in Context Corpus (CCC) is the goal of a one-year project funded by the Wellcome Trust. The aim is to create a new open corpus of bibliographic and citation data that contain detailed information about individual in-text reference pointers in biomedical journal articles. The project is led by Professor Silvio Peroni of the Research Centre for Open Scholarly Metadata (University of Bologna), is being undertaken by Dr Marilena Daquino (University of Bologna), and actively involves the Oxford e-Research Centre (University of Oxford), the École de Bibliothéconomie et des Sciences de l’Information (Université de Montréal), and the Centre for Science and Technology Studies (CWTS), (Leiden University).

An in-text reference pointer is a textual device (e.g. “[1]”, or “(Peroni and Shotton 2012)”) that appears in the main text of a citing work and denotes a bibliographic reference listed in the Bibliography section of the citing work. While a single in-text reference pointer uniquely denotes a single bibliographic reference, it can occur together with one or more other pointers, forming an in-text reference pointer list that denotes several references (e.g. “[5-13]”, or “(Peroni and Shotton 2012; Peroni and Shotton 2019)”). In-text reference pointers may appear in several places within the same citing publication (e.g. Introduction, Methods, Discussion), may occur within different document components (e.g. body text, figure captions, tables), and may address the cited publication for different purposes (e.g. as the source of an experimental protocol, as a data source, or for general background information).

Unfortunately, current citation indexes contain no information about in-text reference pointers, such as the number of times a particular work is referenced in the citing work, the text of the sentences in which they occur, or the rhetorical purpose of such citations.

Having data at the level of individual in-text reference pointers offers many new opportunities, enabling one: (1) to distinguish between works that are referenced just once in a citing publication and those that are referenced multiple times, and thereby (potentially) to distinguish when a citation is fundamental for the understanding or the development of the citing work, or merely incidental; (2) to see which in-text reference pointers occur together (e.g. in the same sentence or the same paragraph), thus, potentially, to infer similarities between the co-cited publications; and (3) to determine in which specific sections of the publication these in-text references occur (e.g. Introduction, Methods, Results), and thus, potentially, by means of textual analysis of the citation contexts, to retrieve the rhetorical functions of the citations – i.e. the reason why an author cites another work. 

The goal of the CCC Project is to provide stakeholders with an exemplar Linked Open Data corpus, created from the open access biomedical research literature, that is tailored for such deep citation analyses. The corpus will be a new member of the collection of OpenCitations datasets, and will be accompanied by services for accessing and querying data.

In the CCC Project, we have achieved or are currently dealing with the following developments:

  • Extending the OpenCitations Data Model (OCDM). The OpenCitations Data Model has been extended and enriched with new terms and relations to represent bibliographic entities related to in-text reference pointers, such as the in-text reference pointers themselves, in-text reference pointer lists, discourse elements (e.g. sections, paragraphs, sentences), and annotations on citations, bibliographic references and in-text reference pointers. In addition, the provenance layer of the data model has been revised to provide meaningful provenance information in a more compact way. A revised version of the OCDM including these terms was published on November 8, 2019, and it is available on Figshare [1].
  • Extending the OpenCitation harvesting and data re-engineering pipeline. The CCC Project leverages existing OpenCitations technologies for building this new corpus, using as input articles from the Open Access Subset of biomedical literature hosted by Europe PubMed Central (EPMC) and encoded in XML. The OpenCitations pipelines for knowledge extraction (i.e. the software called BEE) and for data re-engineering (i.e. the software called SPACIN) have been enhanced so as to harvest relevant information from the full-text of the XML sources provided by EPMC, rather than just the reference lists,  and to transform these data into RDF according to the revised OCDM. The source code of the new pipeline is available on GitHub.
  • Creating InTRePID, a new persistent identifier for in-text reference pointers. Different in-text reference pointers denoting the same bibliographic reference have distinct logical, rhetorical and textual contexts wherein they occur. To permit them to be identified individually and handled properly, we have recently developed a new persistent identifier, the In-Text Reference Pointer Identifier (InTRePID), for identifying individual in-text reference pointers relating to an open bibliographic citation. The InTRePID is based on the Open Citation Identifier (OCI), currently being used to identify the >624 million citations present in the new release of COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, described in the previous blog post. The formal definition of an InTRePID is available on Figshare [2]. In addition, an InTRePID Resolution Service has been developed (currently in beta-testing) to facilitate the retrieval of the metadata relating to in-text reference pointers. At the moment, a subset of the CCC corpus is available online for testing the InTRePID Resolution Service.
  • Development of services for accessing and querying the Citations in Context Corpus. Along with the development of the CCC itself, we are also developing services for querying data within the CCC. In particular, we are currently working to extend the RAMOSE software to provide an API for accessing the CCC triplestore. This CCC API will permit users to access the CCC corpus and retrieve detailed information about in-text reference pointers and their related annotations in a variety of human- and machine-readable formats. The source code of the API Manager is available on GitHub. The configuration file for querying the CCC corpus is still in the process of development.

Moreover, we are currently working to evaluate the content data quality of the CCC corpus and to develop reconciliation activities with information stored in Crossref. Specifically, by means of new validation methods, we are testing whether the extracted in-text reference pointers are complete (i.e. determining that all the in-text reference pointers for a particular bibliographic reference have been correctly extracted from the text), and that in-text reference pointer lists (e.g. “[5-13]”) have been correctly parsed to extract all the implicit pointers (in this case “[6]”, “[7]”, “[8]”, “[9]”, “[10]”, “[11]” and “[12]”), and to associate them correctly with the appropriate bibliographic references that they denote. This activity is fundamental, in order to address the diverse citation styles adopted by different journals and to overcome possible incoherencies in the publishers’ XML markup of the articles. Secondly, whenever a DOI is not specified for the citing or cited publications in the full-text of the citing publication, a text search using the Crossref API is performed in order to match possible candidates and supply the missing DOI. This reconciliation process itself can be error-prone since recommended matches are obtained by means of a non-transparent scoring mechanism. Therefore we are currently testing the application of a scoring threshold that will eliminate false positives and provide us only with correct results.

The deployment of the enhanced OpenCitations pipeline for populating the CCC corpus automatically is planned to start in the next weeks. For more details and to provide suggestions, please contact us!

References

[1] Marilena Daquino, Silvio Peroni and David Shotton (2019). The OpenCitations Data Model. Version 2.0. Figshare. DOI:  ​https://doi.org/10.6084/m9.figshare.3443876 

[2] David Shotton, Marilena Daquino and Silvio Peroni (2020). In-Text Reference Pointer Identifier: Definition. Figshare.  DOI: https://doi.org/10.6084/m9.figshare.11674032

Posted in Ontologies, Open Citation Identifiers | Tagged , , | 1 Comment

More than 624 million citations now available on COCI

COCI is the OpenCitations Index of Crossref open DOI-to-DOI citations, all released as CC0 material, and is described in the article

Heibi I, Peroni S, Shotton D (2019). Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Scientometrics 121(2): 1213-1228. https://doi.org/10.1007/s11192-019-03217-6

COCI is our first OpenCitations Index of open citations, in which we have applied the concept of citations as first-class data entities, each identified using a unique persistent Open Citation Identifier (OCI), to index the contents of one of the major databases of open scholarly citation information, namely Crossref, and to render and make available this information in machine-readable RDF.

We are now proud to announce the third release of COCI, which contains more than 624 million DOI-to-DOI citation links coming from both ‘the ‘Open’ and the ‘Limited’ sets of Crossref reference data. This represents an increase of 40% in the number of indexed citations, compared with the second release of COCI on 12th November 2018, which indexed more than 445 million citations. The data model used for this third release of COCI is the updated revision of the OpenCitation Data Model, published on 8 November 2019 and available at https://doi.org/10.6084/m9.figshare.3443876.

This new release of COCI has been created using new software developed specifically for this purpose, which is available on our GitHub repository under an open ISC license. This software automates the process of creating an OpenCitations Index compliant with the OpenCitations Data Model and creates the citation data and related provenance information in three different formats: CSV, N-Triples (RDF), and Scholix. The support for Scholix – a high-level interoperability framework supported by Crossref, DataCite, Europe PubMed Central, OpenAIRE and others  – has recently been added to provide an additional format for the exchange of information about the links between scholarly literature and datasets.

A great advantage of the new software is that it will now enable us to extend COCI (and any other OpenCitations Index) by means of incremental additions, rather than having to re-create the entire index at each update. This should enable us to release index updates more frequently than hitherto, thus keeping the index more closely in synchrony with the latest reference data released by Crossref. Note that we are currently run the software on previous dumps of Crossref data so as to retrieve all the citations that involve references in citing articles that were in the ‘Limited’ set when we downloaded it, but that currently appear in the Crossref ‘Closed’ data set due to more recent restrictive policy decisions taken by their publishers.

Finally, we wish to remind you that all the bibliographic and citation data in COCI:

Posted in Citations as First-Class Data Entities, Data publication, Open Citation Identifiers, Open Citations | Tagged , , , | 3 Comments

OpenCitations selected for SCOSS second funding cycle

The Global Sustainability Coalition for Open Science Services (SCOSS) is launching its second funding cycle, and OpenCitations is one of three open science infrastructure organizations whose services have been evaluated and selected for presentation to the international scholarly community for crowd-sourced sustainability funding, along with the Public Knowledge Project (PKP) and the Directory of Open Access Books (DOAB).

OpenCitations is an innovative infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data concerning academic publications as Linked Open Data using Semantic Web technologies, thereby providing a disruptive alternative to traditional proprietary citation indexes. It also undertakes related advocacy work, particularly as a founding member of the Initiative for Open Citations (I4OC).

OpenCitations developed the OpenCitations Corpus (OCC), a database of open downloadable bibliographic and citation data recorded in RDF and released under a Creative Commons CC0 public domain waiver, which currently contains information about 14 million citation links to over 7.5 million cited resources. In addition and separately, OpenCitations is currently developing a number of Open Citation Indexes, using the data openly available in third-party bibliographic databases. The first and largest of these is COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, which presently contains information encoded in RDF on more than 445 million citations, released under a CC0 waiver.

OpenCitations structures its data according to the OpenCitations Data Model (OCDM), that may also be employed by third parties, either for their own use or to structure their data for submission to and publication by OpenCitations. This model uses OpenCitations’ suite of SPAR (Semantic Publishing and Referencing) Ontologies developed to describe all aspects of the scholarly publishing domain. OpenCitations has also published open software of generic applicability for searching, browsing and providing REST APIs over RDF triplestores.

OpenCitations fully supports the founding principles of Open Science. It complies with the FAIR data principles proposed by Force11 that data should be findable, accessible, interoperable and re-usable, and it complies with the recommendations of I4OC that citation data, in particular, should be structured, separable and open. OpenCitations has published a formal definition of an open citation, and has launched a system for globally unique and persistent identifiers (PIDs) for bibliographic citations – the Open Citation Identifiers (OCIs) – for which it maintains an OCI resolution service.

OpenCitations has the potential to be a game-changer in the scholarly information landscape, giving institutions and individuals the ability to analyse and reuse publication citations in other infrastructures, in library collections, and in research. Open citation data are particularly valuable for bibliometric analysis, increasing the reproducibility of large-scale analyses by enabling the publication of the source data upon which analytical results are based. Since citation data are also crucial to evaluating research performance, such access to open, transparent citation data sources is a priority for Open Science.

SCOSS was formed in early 2017 with the purpose of providing a new coordinated and targeted crowd-sourcing and cost-sharing framework to enable the Open Access and Open Science communities to support the open infrastructure services on which they depend. In its first funding cycle, more than 1.5 million euros was pledged by more than 200 institutions worldwide to help fund and sustain the Directory of Open Access Journals (DOAJ) and SHERPA/RoMEO.

With the launch of its second funding cycle, SCOSS is appealing to academic institutions and their libraries, research institutes, publishers, funding organisations, national and regional governments, international organisations, learned societies and service providers worldwide —  everyone who is invested in Open Access and Open Science — to support one or more of these three new selected open infrastructure services through a three-year commitment.

For more details about the services, suggested funding levels, and how you can help support OpenCitations, please see https://sparceurope.org/download/7913/ or contact us at donations@opencitations.net.

Posted in Open Citations, Open scholarship, Open Science | Tagged , , , , , | Leave a comment

COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations

Abstract

In this paper, we present COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations (http://opencitations.net/index/coci). COCI is the first open citation index created by OpenCitations, in which we have applied the concept of citations as first-class data entities, and it contains more than 445 million DOI-to-DOI citation links derived from the data available in Crossref. These citations are described in RDF by means of the new extended version of the OpenCitations Data Model (OCDM). We introduce the workflow we have developed for creating these data, and also show the additional services that facilitate the access to and querying of these data by means of different access points: a SPARQL endpoint, a REST API, bulk downloads, Web interfaces, and direct access to the citations via HTTP content negotiation. Finally, we present statistics regarding the use of COCI citation data, and we introduce several projects that have already started to use COCI data for different purposes.

Introduction

The availability of open scholarly citations [21] is a public good, of significant value to the academic community and the general public. In fact, citations not only serve as an acknowledgment medium [16], but also can be characterised topologically (by defining the connected graph between citing and cited entities and its evolution over time [19]), sociologically (such as for identifying odd conduct within or elitist access paths to scientific research [18]), quantitatively by creating citation-based metrics for evaluating the impact of an idea or a person [17], and financially by defining the scholarly value of a researcher within his/her own academic community [20]. The Initiative for Open Citations (I4OC, https://i4oc.org) has dedicated the past two years to persuading publishers to provide open citation data by means of the Crossref platform (https://crossref.org), obtaining the release of the reference lists of more than 43 million articles (as of February 2019), and it is this change of behaviour by the majority of academic publishers that has permitted COCI to be created.

OpenCitations (http://opencitations.net) is a scholarly infrastructure organization dedicated to open scholarship and the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies, and is a founding member of I4OC. It has created and maintains the SPAR (Semantic Publishing and Referencing) Ontologies (http://www.sparontologies.net) [22] for encoding scholarly bibliographic and citation data in RDF, and has previously developed the OpenCitations Corpus (OCC) of open downloadable bibliographic and citation data recorded in RDF [4].

In this paper, we introduce a new dataset made available a few months ago by OpenCitations, namely COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations (https://w3id.org/oc/index/coci). This dataset, launched in July 2018, is the first of the indexes proposed by OpenCitations (https://w3id.org/oc/index), in which citations are exposed as first-class data entities with accompanying properties (i.e. individuals of the class cito:Citation as defined in CiTO [7]) instead of being defined simply as relations among two bibliographic resources (via the property cito:cites). Currently COCI, contains more than 445 million DOI-to-DOI citation links made available under a Creative Commons CC0 public domain waiver, that can be accessed and queried through a SPARQL endpoint, an HTTP REST API, by means of searching/browsing Web interfaces, by bulk download in different formats (CSV and N-Triples), or by direct access via HTTP content negotiation.

The rest of the paper is organized as follows. In Section 2 we introduce some of the main RDF datasets containing scholarly bibliographic metadata and citations. In Section 3, we provide some details on the rationale and the technologies used to describe citations as first-class data entities, which are the main foundation of the development of COCI. In Section 4, we present COCI, including the workflow process developed for ingesting and exposing the open citation data available and other tools used for accessing these data. In Section 5, we show the scale of the community uptake of COCI since its launch by means of quantitative statistics on the use of its related services and by listing existing projects that are using it for specific purposes. Finally, in Section 6, we conclude the paper sketching out related and upcoming projects.

Related works

We have noticed a recent growing interest within the Semantic Web community for creating and making available RDF datasets concerning the metadata of scholarly resources, particularly bibliographic resources. In this section, we briefly introduce some of the most relevant ones.

ScholarlyData (http://www.scholarlydata.org) [1] is a project that refactors the Semantic Web Dog Food so as to keep the dataset growing in good health. It uses the Conference Ontology, an improvement version of the Semantic Web Conference Ontology, to describe metadata of documents (5,415, as of March 31, 2019), people (more than 1,100), and data about academic events (592) where such documents have been presented.

Another important source of bibliographic data in RDF is OpenAIRE (https://www.openaire.eu) [3]. Created by funding from the European Union, its RDF dataset makes available data for around 34 million research products created in the context of around 2.5 million research projects.

While important, these aforementioned datasets do not provide citation links between publications as part of their RDF data. In contrast, the following datasets do include citation data as part of the information they make available.

In 2017, Springer Nature announced SciGraph (https://scigraph.springernature.com) [2], a Linked Open Data platform aggregating data sources from Springer Nature and other key partners managing scholarly domain data. It contains data about journal articles (around 8 millions, as of March 31, 2019) and book chapters (around 4.5 millions), including their related citations, and information on around 7 million people involved in the publishing process.

The OpenCitations Corpus (OCC, https://w3id.org/oc/corpus) [4] is a collection of open bibliographic and citation data created by ourselves, harvested from the open access literature available in PubMed Central. As of March 31, 2019, it contains information about almost 14 million citation links to more than 7.5 million cited bibliographic resources.

WikiCite (https://meta.wikimedia.org/wiki/WikiCite) is a proposal, with a related series of workshops, which aims at building a bibliographic database in Wikidata [10] to serve all Wikimedia projects. Currently Wikidata hosts (as of March 29, 2019) more than 170 million citations.

Biotea (https://biotea.github.io) [5] is an RDF datasets containing information about some of the articles available in the Open Access subset of PubMed Central, that have been enhanced with specialized annotation pipelines. The last released dataset includes information extracted from 2,811 articles, including data on their citations.

Finally, Semantic Lancet [6] proposes to build a dataset of scholarly publication metadata and citations (including the specification of the citation functions) starting from articles published by Elsevier. To date it includes bibliographic metadata, abstract and citations of 291 articles published in the Journal of Web Semantics.

Indexing citations as first-class data entities

Citations are normally defined simply as links between published entities (from a citing entity to a cited entity). However, an alternative richer view is to regard each citation as a data entity in its own right, as illustrated in Figure 1. This alternative approach permits us to endow a citation with descriptive properties, such as those ones introduced in Table 11.

Figure 1. Two different ways of describing citations: as a relation between two bibliographic entities (top), or as an individual first-class data entitiy in its own right where the citing entity and the cited entity are among its attributed data.


The advantages of treating citations as first-class data entities are:

  • all the information regarding each citation is available in one place, since such information is defined as attributes of the citation itself;
  • citations become easier to describe, distinguish, count and process, and it becomes possible to distinguish separate citations within the citing entity to the cited entity, enabling one to count how many times, from which sections of the citing entity, and (in principle) for what purposes a particular cited entity is cited within the source paper;
  • if available in aggregate, citations described in this manner are easier to analyse using bibliometric methods, for example to determine how citation time spans vary by discipline.

We have appropriately extended the OpenCitations Data Model (OCDM, http://opencitations.net/model) [23] so as to define each citation as a first-class entity in machine-readable manner. In particular, we have used the class cito:Citation defined in the revised and expanded Citation Typing Ontology (CiTO, http://purl.org/spar/cito) [7], which is part of the SPAR Ontologies [22]. This class allows us to define a permanent conceptual directional link from the citing bibliographic entity to a cited bibliographic entity, that can be accompanied by additional ontological terms for defining specific attributes, as introduced in Table 1.

Characteristic Description CiTO entity
citing entity The bibliographic entity which acts as source for the citation. Object property cito:hasCitingEntity.
cited entity The bibliographic entity which acts as target for the citation. Object property cito:hasCitedEntity.
citation creation date The date on which the citation was created. This has the same numerical value as the publication date of the citing bibliographic resource, but is a property of the citation itself. When combined with the citation time span, it permits that citation to be located in history. Data property cito:hasCitationCreationDate, one of xsd:date, xsd:gYearMonth, or xsd:gYear as datatype value.
citation timespan The temporal characteristic of a citation, namely the interval between the publication date of the cited entity and the publication date of the citing entity. Data property cito:hasCitationTimespan, xsd:duration as datatype value.
type A classification of the citation according to particular dimensions, e.g. whether or not it is a self-citation. Property rdf:type associated with one or more subclasses of cito:Citation – in particular, for example cito:AuthorSelfCitation (i.e. citing and the cited entities have at least one author in common) and cito:JournalSelfCitation (i.e. citing and the cited entities are published in the same journal).

Table 1. List of characteristics that can be associated with a citation when it is described as first-class data entity, using the properties and classes available in CiTO for their definition in RDF.

So as to identify each citation precisely, when described as first-class data entity and included in an open dataset, we have also developed the Open Citation Identifier (OCI) [24], which is a new globally unique persistent identifier for citations. OCIs are registered in the Identifiers.org platform (https://identifiers.org/oci) and recognized as persistent identifiers for citations by the EU FREYA Project (https://www.project-freya.eu) [25]. Each OCI has a simple structure: the lower-case letters oci followed by a colon, followed by two sequences of numerals separated by a dash, where the first sequence is the identifier for the citing bibliographic resource and the second sequence is the identifier for the cited bibliographic resource. For example, oci:0301-03018 is a valid OCI for a citation defined within the OpenCitations Corpus, while oci:02001010806360107050663080702026306630509-02001010806360107050663080702026305630301 is a valid OCI for a citation included in Crossref. It is worth mentioning that OCIs are not opaque identifiers, since they explicitly encode directional relationships between identified citing and cited entities, the provenance of the citation, i.e. the database that contains it, and the type of identifiers used in that database to identify the citing and cited entities. In addition, we have created the Open Citation Identifier Resolution Service (http://opencitations.net/oci), which is a resolution service for OCIs based on the Python application oci.py available at https://github.com/opencitations/oci. Given a valid OCI as input, this resolution service is able to retrieve citation data in RDF (either as RDF/XML, Turtle or JSON-LD), or in Scholix, JSON or CSV formats. A more detailed explanation of OCIs and related material is available in [24].

At OpenCitations, we define an open citation index as a dataset containing citations that complies with the following requirements:

  • the citations contained are all open, according to the definition provided in [21];
  • the citations are all treated as first-class data entities;
  • each citation is identified by an Open Citation Identifier (OCI) [24];
  • the citation data are recorded in RDF according to the OpenCitations Data Model (OCDM) [23], where the OCI of a citation is embedded in the IRI defining it in RDF;
  • each citation defines the attributes shown in Table 1.

COCI: ingestion workflow, data, and services

COCI, the OpenCitations Index of Crossref open DOI-to-DOI references, is the first citation index to be published by OpenCitations, in which we have applied the concept of citations as first-class data entities, introduced in the previous section, to index the contents of one of the major open databases of scholarly citation information, namely Crossref (https://crossref.org), and to render and make available this information in machine-readable RDF under a CC0 waiver. Crossref contains metadata about publications (mainly academic journal articles) that are identified using Digital Object Identifiers (DOIs). Out of more than 100 million publications recorded in Crossref, Crossref also stores the reference lists of more than 43 million publications deposited by the publishers. Many of these references are to other publications bearing DOIs that are also described in Crossref, while others are to publications that lack DOIs and do not have Crossref descriptions. Crossref organises such publications with associated reference lists according to three categories: closed, limited and open. These categories to publications for which the reference lists are not visible to anyone outside the Crossref Cited-by membership, are visible only to them and to Crossref Metadata Plus members, or are visible to all users, respectively2.

Figure 2. The diagram of the data model adopted to define the new class for defining citations as first-class data entities, which forms part of the OpenCitations Data Model. This model uses terms from the Citation Typing Ontology (CiTO, http://purl.org/spar/cito) for describing the data, and from the Provenance Ontology (PROV-O, http://www.w3.org/ns/prov) to define the citation’s provenance.

Followed the first release of COCI on June 4, 2018, the most recent version of COCI, released on November 12, 2018, contains more that 445 million DOI-to-DOI citations included in the open and the limited datasets of Crossref reference data3. All the citation data in COCI and their provenance information, described according the Graffoo diagram [27] presented in Figure 2, are included in two distinct graphs – https://w3id.org/oc/index/coci/ and https://w3id.org/oc/index/coci/prov/ respectively – released under a CC0 waiver, and compliant with the FAIR data principles [26].

An example of a citation included in COCI is shown in the following excerpt (in Turtle), where the OCI is embedded as part of the IRI of the citation (without the oci: prefix) after the ci/ (meaning citation according to the OpenCitations Data Model [23]):

@prefix cito: <http://purl.org/spar/cito/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-02001010806360107050663080702026305630301>
  a 
    cito:Citation,
    cito:JournalSelfCitation ;
  cito:hasCitationCreationDate "2013"^^xsd:gYear ;
  cito:hasCitationTimeSpan "P1Y"^^xsd:duration ;
  cito:hasCitingEntity <http://dx.doi.org/10.1186/1756-8722-6-59> ;
  cito:hasCitedEntity <http://dx.doi.org/10.1186/1756-8722-5-31> ;
  prov:generatedAtTime "2018-11-01T05:47:54+00:00"^^xsd:dateTime ;
  prov:hadPrimarySource <https://api.crossref.org/works/10.1186/1756-8722-6-59> ;
  prov:wasAttributedTo <https://w3id.org/oc/index/coci/prov/pa/1> .

In the following subsections we introduce the ingestion workflow developed for creating COCI, we provides some figures on the citations it contains, and we list the resources and services we have made available to permit access to and querying of the dataset.

Ingestion workflow

We processed all the data included in the October 2018 JSON dump of Crossref data, available to all the Crossref Metadata Plus members. The ingestion workflow, summarised in Figure 3, was organised in four distinct phases, and all the related scripts developed and used are released as open source code according to the ISC License and downloadable from the official GitHub repository of COCI at https://github.com/opencitations/coci.

Figure 3. A flowchart scheme describing the workflow to build COCI. It is divided in four phases: (1) global data generation, (2) CSV generation, (3) conversion into RDF, and (4) updating the triplestore.

Phase 1: global data generation. We parse and process the entire Crossref bibliographic database to extract all the publications having a DOI and their available list of references. Through this process three datasets are generated, which are used in the next phase:

  • Dates, the publication dates of all the bibliographic entities in Crossref and of all their references if they explicitly specify a DOI and a publication date as structured data – e.g. see the fields DOI and year in the array reference in https://api.crossref.org/works/10.1007/978-3-030-00668-6_8. Where the same DOI is encountered multiple times, e.g. as a proper item indexed in Crossref and also as a reference in the reference list of another article deposited in the Crossref, we use the full publication date defined in the indexed item.
  • ISSN: the ISSN (if any) and publication type (journal-article, book-chapter, etc.) of each bibliographic entity identified by a DOI indexed in Crossref.
  • ORCID: the ORCIDs (if any) associated with the authors of each bibliographic entity identified by a DOI indexed in Crossref.

Phase 2: CSV generation. We generate a CSV file such that each row represents a particular citation between a citing entity and a cited entity according to the data available in the Crossref dump, by looking at the DOI identifying the citing entity and all the DOIs specified in the reference list of such a citing entity according to the Crossref data. In particular, we execute the following four steps for each citation identified:

  1. We generate the OCI for the citation by encoding the DOIs of the citing and cited entities into numerical sequences using the lookup table available at https://github.com/opencitations/oci/blob/master/lookup.csv, which are prefixed by the supplier prefix 020 to indicate Crossref as the source of the citation.
  2. We retrieve the publication date of the citing entity from the Dates dataset and assign it as citation creation date.
  3. We retrieve the publication date of the cited entity (from the Dates dataset) and we use it, together with the publication date of the citing entity retrieved in the previous step, to calculate the citation timespan.
  4. We use the data contained in the ISSN and ORCID datasets to establish whether the citing and cited entity have been published in the same journal and/or have at least one author in common, and in these cases we assign the appropriate self-citation type(s) to the citation.

Simultaneously with the creation of the CSV file of citation data, we generate a second CSV file containing the provenance information for each citation (identified by its OCI generated in the aforementioned Step 1). These provenance data include the agent responsible for the generation of the citation, the Crossref API call that refers to the data of the citing bibliographic entity containing the reference used to create the citation, and the creation date of the citation.

Phase 3: converting into RDF. The CSV files generated in the previous phase are then converted into RDF according to the N-Triples format, following the OWL model introduced in Figure 2, where the DOIs of the citing and cited entities become DOI URLs starting with http://dx.doi.org/4, while the IRI of the citation includes its OCI (without the oci: prefix), as illustrated in the example given in the previous section.

Phase 4: updating the triplestore. The final RDF files generated in Phase 3 are used to update the triplestore used for the OpenCitations Indexes.

Data

COCI was first created and released on July 4, 2018, and most recently updated on November 12, 2018. Currently, it contains 445,826,118 citations between 46,534,705 bibliographic entities. These are stored by means of 2,259,134,894 RDF statements (around 5 RDF statements per citation) for describing the citation data, and 1,337,478,354 RDF statements (3 statements per citation) for describing the related provenance information. Of the citations stored, 29,755,045 (6.7%) are journal self-citations, while 250,991 (0.06%) are author self-citations. The number of identified author self-citations, based on author ORCIDs, is a significant underestimate of the true number, mainly due to the sparsity of the data concerning the ORCID author identifiers within the Crossref dump. Journal entities (i.e. journals, volumes, issues, and articles) are the type of the bibliographic entities that are mostly cited, with over 420 million citations.

We also classify the cited documents according to their publishers – Table 2 shows the ten top publishers of citing and cited documents, calculated by looking at the DOI prefixes of the entities involved in each citation. As we can see, Elsevier is by far the publisher having the majority of cited documents. It is also the largest publisher that is not participating in the Initiative for Open Citations by making its publications’ reference lists open at Crossref – which is highlighted by the very limited amount of outgoing citations recorded in COCI. Its present refusal to open its article reference lists in Crossref, contrary to the practice of most of the major scholarly publishers, is contributing significantly to the invisibility of Elsevier’s own publications within the corpora of open citation data such as COCI that are increasingly being used by the scholarly community for discovery, citation network visualization and bibliometric analysis, as we introduce below in the section entitled Section 5.

Publisher Outgoing citations Incoming citations
Springer Nature 79,860,827 52,257,862
Wiley 76,819,685 48,174,542
Elsevier 2,853,739 96,310,027
Informa UK Limited 41,433,917 14,975,989
Institute of Electrical and Electronics Engineers (IEEE) 30,114,985 20,940,703
American Physical Society (APS) 15,729,297 16,065,862
SAGE Publications 15,933,805 7,915,082
Ovid Technologies (Wolters Kluwer Health) 9,971,274 12,840,293
Oxford University Press (OUP) 9,891,000 11,466,659
AIP Publishing 10,130,022 8,455,097

Table 2. A classification of the COCI citations according to the publishers of the cited (incoming citations) and citing (outgoing citations) documents. The table shows the top ten publishers by the overall amount of incoming and outgoing to/from their published works. Those publishers shown in italics are not participating in the Initiative for Open Citations by making their publications’ reference lists open at Crossref – see https://i4oc.org for additional information.

Resources and services

The citation data in COCI can be accessed in a variety of convenient ways, listed as follows.

Open Citation Index SPARQL endpoint. We have made available a SPARQL endpoint for all the indexes released by OpenCitations, including COCI, which is available at https://w3id.org/oc/index/sparql. When accessed with a browser, it shows a SPARQL endpoint editor GUI generated with YASGUI [8]. Of course, this SPARQL endpoint can additionally be queried using the REST HTTP protocol, e.g. via curl. In order to access to COCI data, the graph https://w3id.org/oc/index/coci/ must be specified in the SPARQL query.

COCI REST API. Citation data in COCI can be retrieved by using the COCI REST API, available at https://w3id.org/oc/index/coci/api/v1. The rationale of making a REST API available in addition to the SPARQL endpoint was to provide convenient access to the the citation data included in COCI for Web developers and users who are not necessarily experts in Semantic Web technologies. This REST API, as are all the other REST APIs made available by OpenCitations, has been implemented by means of RAMOSE, the Restful API Manager Over SPARQL Endpoints (https://github.com/opencitations/ramose), which is a Python application that allows one to simply create a REST API over any SPARQL endpoint by means of a simple configuration file that execute a SPARQL query dependently of the particular API call specified. The configuration file for the COCI API is available at https://github.com/opencitations/api/blob/master/coci_v1.hf. Currently, the COCI REST API makes available four operations, that will retrieve either (a) the citation data for all the references of a given DOI (operation: references), or (b) the citation data for all the citations received by a given DOI (operation: citations), or (c) the citation data for the citation identified by an OCI (operation: citation), or (d) the metadata for the articles identified by the specified DOIs (operation: metadata). It is worth mentioning that the latter operation strictly depends on live API calls to external services, namely the Crossref API (https://api.crossref.org), the DataCite API (https://api.datacite.org), and the Unpaywall API (http://api.unpaywall.org), to gather the metadata of the requested articles, such as the title, the authors, and the journal name, that are not explicitly included within the OpenCitations Index triplestore.

Searching and browsing interfaces. We have additionally developed a user-friendly text search interface (https://w3id.org/oc/index/search), and a browsing interface (e.g. https://w3id.org/oc/index/browser/coci/ci/02001010806360107050663080702026306630509-02001010806360107050663080702026305630301), that can be used to search citation data in all the OpenCitations Indexes, including COCI, and to visualise and browse them, respectively. These two interfaces have been developed by means of OSCAR, the OpenCitations RDF Search Application (https://github.com/opencitations/oscar) [9], and LUCINDA, the OpenCitations RDF Resource Browser (https://github.com/opencitations/lucinda), that provide a configurable layer over SPARQL endpoints that permit one easily to create Web interfaces for querying and visualising the results of SPARQL queries.

Data dumps. All the citation data and provenance information in COCI are available as dumps stored in Figshare (https://figshare.com) in both CSV and N-Triples formats, while a dump of the whole triplestore is available on The Internet Archive (https://archive.org). The links to these dumps are available on the download page of the OpenCitations website (http://opencitations.net/download#coci).

Direct HTTP access. All the citation data in COCI can be accessed directly by means of the HTTP IRIs of the stored resources (via content negotiation, e.g. https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-02001010806360107050663080702026305630301).

Quantifying the use of COCI citation data

In the past months, we have monitored the accesses to COCI data since its launch in July 2018. The statistics and graphics we show in this section highlight two different aspects: the quantification of the use of COCI data – and related services – and the community uptake, i.e. the use of COCI data for specific reuses within cross-community projects and studies. All the data of the charts described in this section are freely available for download from Figshare [15].

Quantitative analysis

Figure 4 shows the number of accesses made between July 2018 and February 2019 (inclusive) to the various COCI services described above – the search/browse interfaces, the REST API, SPARQL queries, and others (e.g. direct HTTP access to particular citations and visits to COCI webpages in the OpenCitations website). We have excluded from all these counts all accesses made by automated agents and bots. As shown, the REST API is, by far, the most used service, with extensive usage recorded in the last four months, following the announcement of the second release of COCI. This is reasonable, considering that the REST API has been developed exactly for accommodating the needs of generic Web users and developers, including (and in particular) those who are not expert in Semantic Web technologies. There is just one exception in November 2018, where the SPARQL endpoint was used to retrieve quite a large amount of citation data. After further investigation, we noticed a large proportion of the SPARQL calls were coming from a single source (according to the IP data stored in our log), which probably collected citation data for a specific set of entities.

Figure 4. The number of accesses to COCI-related services since July 2018 to February 2019. The scale used in the y-axis is logarithmic.

Figure 5 shows a particular cut of the figures given in Figure 4, which focuses on the REST API accesses only. In particular, we analysed which operations of the API were used the most. According to these figures, the most used operation is metadata (which was first introduced in the API in August 2018) which allows one to retrieve all the metadata describing certain publications. In contrast to the other API operations, this metadata search accepts one or more DOIs as input. The least used operation was citation, which allows one to retrieve citation data given an OCI, which should not be surprising, considering the currently limited knowledge of this new identifier system for citations.

Figure 5. The number of access made to each different COCI REST-API operation since the release of COCI on July 2018. Classified into 4 categories (requested resource): references, citations, citation, and metadata, as defined in the text.Note again the logarithmic scale of the y-axis.

In addition, we have also retrieved data about the views and downloads (as of March 29, 2019) of all the dumps uploaded to Figshare and to the Internet Archive. The CSV data dump received 1,321 views and 454 downloads, followed by the N-Triples data dump with 316 views and 93 downloads. The CSV provenance information dump has 166 and 127 downloads, while the N-Triples provenance information dump had 95 views and 34 downloads. Finally, the least accessed dump was that of the entire triplestore available in the Internet Archive, uploaded for the very first time in November 2018, that had only 3 views.

Community uptake

The data in COCI has been already used in various projects and initiatives. In this section, we list all the tools and studies doing this of which we are aware.

VOSviewer (http://www.vosviewer.com) [11] is a software tool, developed at the Leiden University’s Centre for Science and Technology Studies (CWTS), for constructing and visualizing bibliometric networks, which may include journals, researchers, or individual publications, and may be constructed based on citation, bibliographic coupling, co-citation, and co-authorship relations. Starting from version 1.6.10 (released on January 10, 2019), VOSviewer can now directly use citation data stored in COCI, retrieved by means of the COCI REST API.

Citation Gecko (http://citationgecko.com) is a novel literature mapping tool that allows one to map a research citation network using some initial seed articles. Citation Gecko is able to leverage citation links between seed papers and other papers to highlight papers of possible interest to the user, for which it uses COCI data (accessed via the REST API) to generate the citation network.

OCI Graphe (https://dossier-ng.univ-st-etienne.fr/scd/www/oci/OCI_graphe_accueil.html) is a Web tool that allows one to search articles by means of the COCI REST API, that are then visualised in a graph showing citations to the retrieved articles. It enriches this visualisation by adding additional information about the publication venues, publication dates, and other related metadata.

Zotero [12] is a free, easy-to-use tool to help users collect, organize, cite, and share research. Recently, the Open Citations Plugin for Zotero (https://github.com/zuphilip/zotero-open-citations) has been released, which allows users to retrieve open citation data extracted from COCI (via its REST API) for one or more articles included in a Zotero library.

COCI data, downloaded from the CSV dump available on Figshare, have been also used in at least two bibliometric studies. In particular, during the LIS Bibliometrics 2019 Event, Stephen Pearson presented a study (https://blog.research-plus.library.manchester.ac.uk/2019/03/04/using-open-citation-data-to-identify-new-research-opportunities/) run on publications by scholars at the University of Manchester which used COCI to retrieve citations between these publications so as to investigate possible cross-discipline and cross-department potential collaborations. Similarly, COCI data were used to conduct an experiment on the latest Italian Scientific Habilitation [13] (the national exercise that evaluates whether a scholar is appropriate to receive an Associate/Full Professorship position in an Italian university), which aimed at trying to replicate part of the outcomes of this evaluation exercise for the Computer Science research field by using only open scholarly data, including the citations available in COCI, rather than citation data from subscription services.

Conclusions

In this paper, we have introduced COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. After an initial introduction of the notion of citations as first-class data entities, we have presented the ingestion workflow that has been implemented to create COCI, have detailed the data COCI contains, and have described the various services and resources that we have made available to access COCI data. Finally, we have presented some statistics about the use of COCI data, and have mentioned the tools and studies that have adopted COCI in recent months.

COCI is just the first open citations index that OpenCitations will make available. Using the experience we have gathered by creating it, we now plan the release of additional indexes, so as to extend the coverage of open citations available through the OpenCitations infrastructure. The first of these, recently released, is CROCI (https://w3id.org/oc/index/croci) [14], the Crowdsourced Open Citations Index, which contains citations deposited by individuals. CROCI is designed to permit scholars proactively to fill the open citations gap in COCI resulting from four causes: (a) the failure of many publishers using Crossref DOIs to deposit reference lists of their publications at Crossref, (b) the failure of some publishers that do deposit their reference lists to make these reference lists open, in accordance with the recommendations of the Initiative for Open Citations; (c) the absence from ~11% of Crossref reference metadata of the DOIs for cited articles which in fact have been assigned DOIs (https://www.crossref.org/blog/underreporting-of-matched-references-in-crossref-metadata/), a problem that Crossref are currently working hard to rectify; and (d) the existence of citations to published entities that lack Crossref DOIs. In the near future, we plan to extend the number of indexes by harvesting citations from other open datasets including Wikidata (https://www.wikidata.org), DataCite (https://datacite.org), and Dryad (https://datadryad.org). In addition, we plan to extend and generalise the current software developed for COCI, so as to facilitate most frequent updates of the indexes.

Acknowledgements

We gratefully acknowledge the financial support provided to us by the Alfred P. Sloan Foundation for the OpenCitations Enhancement Project (grant number G‐2017‐9800).

References

  1. Nuzzolese, A. G., Gentile, A. L., Presutti, V., Gangemi, A. (2016). Conference Linked Data: The ScholarlyData project. In Proceedings of the 15th International Semantic Web Conference (ISWC 2015): 150-158. DOI: https://doi.org/10.1007/978-3-319-46547-0_16
  2. Hammond, T., Pasin, M., & Theodoridis, E. (2017). Data integration and disintegration: Managing Springer Nature SciGraph with SHACL and OWL. In International Semantic Web Conference (Posters, Demos & Industry Tracks). http://ceur-ws.org/Vol-1963/paper493.pdf
  3. Alexiou, G., Vahdati, S., Lange, C., Papastefanatos, G., Lohmann, S. (2016). OpenAIRE LOD services: scholarly communication data as linked data. In Semantics, Analytics, Visualization. Enhancing Scholarly Data: 45-50. DOI: https://doi.org/10.1007/978-3-319-53637-8_6
  4. Peroni, S., Shotton, D., Vitali, F. (2017). One year of the OpenCitations Corpus – releasing RDF-based scholarly citation data into the public domain. In Proceedings of the 16th International Semantic Web Conference (ISWC 2017): 184-192. DOI: https://doi.org/10.1007/978-3-319-68204-4_19
  5. Garcia, A., Lopez, F., Garcia, L., Giraldo, O., Bucheli, V., Dumontier, M. (2018). Biotea: semantics for Pubmed Central. PeerJ, 6: e4201. DOI: https://doi.org/10.7717/peerj.4201
  6. Bagnacani, A., Ciancarini, P., Di Iorio, A., Nuzzolese, A. G., Peroni, S., Vitali, F. (2014). The Semantic Lancet Project: A Linked Open Dataset for Scholarly Publishing. In EKAW 2014 Satellite Events: 101-105. DOI: https://doi.org/10.1007/978-3-319-17966-7_10
  7. Silvio Peroni, David Shotton (2012). FaBiO and CiTO: ontologies for describing bibliographic resources and citations. Web Semantics, 17: 33-34. DOI: https://doi.org/10.1016/j.websem.2012.08.001
  8. Rietveld, L., Hoekstra, R. (2017). The YASGUI family of SPARQL clients Semantic Web, 8(3): 373-383. DOI: https://doi.org/10.3233/SW-150197
  9. Heibi, I., Peroni, S., Shotton, D. (2018). OSCAR: A Customisable Tool for Free-Text Search over SPARQL Endpoints. In Semantics, Analytics, Visualization: 121-137. DOI: https://doi.org/10.1007/978-3-030-01379-0_9
  10. Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandečić, D. (2014). Introducing Wikidata to the linked data web. In Proceedings of the 13th International Semantic Web Conference (ISWC 2013): 50-65. DOI: https://doi.org/10.1007/978-3-319-11964-9_4
  11. van Eck, N., & Waltman, L. (2009). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523-538. DOI: https://doi.org/10.1007/s11192-009-0146-3
  12. Ahmed, K. M., Al Dhubaib, B. (2011). Zotero: A bibliographic assistant to researcher. Journal of Pharmacology and Pharmacotherapeutics, 2(4), 303. DOI: https://doi.org/
    10.4103/0976-500X.85940
  13. Di Iorio, A., Peroni, S., Poggi, F. (2019). Open data to evaluate academic researchers: an experiment with the Italian Scientific Habilitation. (To appear) Proceedings of the 17th International Conference on Scientometrics and Informetrics (ISSI 2019). https://arxiv.org/abs/1902.03287
  14. Heibi, I., Peroni, S., Shotton, D. (2019). Crowdsourcing open citations with CROCI – An analysis of the current status of open citations, and a proposal. (To appear) Proceedings of the 17th International Conference on Scientometrics and Informetrics (ISSI 2019). https://arxiv.org/abs/1902.02534
  15. Heibi, I., Peroni, S., Shotton, D. (2019). Usage statistics of COCI data. Figshare. DOI: https://doi.org/10.6084/m9.figshare.7873559
  16. Newton, I. (1675). Isaac Newton letter to Robert Hooke – Cambridge, 5 February 1675. https://digitallibrary.hsp.org/index.php/Detail/objects/9792 (last visited 23 March 2019)
  17. Schiermeier, Q. (2017). Initiative aims to break science’s citation paywall. Nature. DOI: https://doi.org/10.1038/nature.2017.21800
  18. Sugimoto, C. R., Waltman, L., Larivière, V., van Eck, N. J, Boyack, K. W., Wouters, P., de Rijcke, S. (2017). Open citations: A letter from the scientometric community to scholarly publishers. ISSI Society. http://issi-society.org/open-citations-letter (last visited 23 March 2019)
  19. Chawla, D. S. (2017). Now free: citation data from 14 million papers, and more might come. Science. https://www.sciencemag.org/news/2017/04/now-free-citation-data-14-million-papers-and-more-might-come (last visited 23 March 2019)
  20. Molteni, M. (2017). Tearing Down Science’s Citation Paywall, One Link at a Time. Wired. https://www.wired.com/2017/04/tearing-sciences-citation-paywall-one-link-time/ (last visited 23 March 2019)
  21. Peroni, S., Shotton, D.. (2018). Open Citation: Definition. Figshare. DOI: https://doi.org/10.6084/m9.figshare.6683855
  22. Peroni, S., Shotton, D. (2018). The SPAR Ontologies. In Proceedings of the 17th International Semantic Web Conference (ISWC 2018): 119-136. DOI: https://doi.org/10.1007/978-3-030-00668-6_8
  23. Peroni, S., Shotton, D. (2018). The OpenCitations Data Model. Figshare. DOI: https://doi.org/10.6084/m9.figshare.3443876
  24. Peroni, S., Shotton, D. (2019). Open Citation Identifier: Definition. Figshare. DOI: https://doi.org/10.6084/m9.figshare.7127816
  25. Ferguson, C., McEntrye, J., Bunakov, V., Lambert, S., van der Sandt, S., Kotarski, R., … McCafferty, S. (2018). Survey of Current PID Services Landscape (Deliverable No. D3.1). Retrieved from FREYA project (EC Grant Agreement No 777523) website: https://www.project-freya.eu/en/deliverables/freya_d3-1.pdf
  26. Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. DOI: https://doi.org/10.1038/sdata.2016.18
  27. Falco, R., Gangemi, A., Peroni, S., Shotton, D., Vitali, F. (2014). Modelling OWL Ontologies with Graffoo. In The Semantic Web: ESWC 2014 Satellite Events: 320–325. DOI: https://doi.org/10.1007/978-3-319-11955-7_42

Footnotes

1. An in-depth description about the definition and use of citations as first-class data entities can be found at https://opencitations.wordpress.com/2018/02/19/citations-as-first-class-data-entities-introduction/. [back]
2. Additional information on this classification of Crossref reference lists is available at https://www.crossref.org/reference-distribution/. [back]
3. We have access to the limited dataset since we are members of the Crossref Metadata Plus plan. [back]
4. We are aware that the current practice for DOI URLs is to use the base https://doi.org/ instead of http://dx.doi.org/. However, when one tries to resolve a DOI URL owned by Crossref by specifying an RDF format (e.g. Turtle) in the accept header of the request, the bibliographic entity is actually defined using the old URL structure starting with http://dx.doi.org/. For this reason, since COCI is derived entirely from Crossref data, we decided to stay with the approach currently used by Crossref. [back]
Posted in Citations as First-Class Data Entities, Data publication, Ontologies, Open Citation Identifiers, Open Citations | Tagged , , , | Leave a comment