OpenCitations selected for SCOSS second funding cycle

The Global Sustainability Coalition for Open Science Services (SCOSS) is launching its second funding cycle, and OpenCitations is one of three open science infrastructure organizations whose services have been evaluated and selected for presentation to the international scholarly community for crowd-sourced sustainability funding, along with the Public Knowledge Project (PKP) and the Directory of Open Access Books (DOAB).

OpenCitations is an innovative infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data concerning academic publications as Linked Open Data using Semantic Web technologies, thereby providing a disruptive alternative to traditional proprietary citation indexes. It also undertakes related advocacy work, particularly as a founding member of the Initiative for Open Citations (I4OC).

OpenCitations developed the OpenCitations Corpus (OCC), a database of open downloadable bibliographic and citation data recorded in RDF and released under a Creative Commons CC0 public domain waiver, which currently contains information about 14 million citation links to over 7.5 million cited resources. In addition and separately, OpenCitations is currently developing a number of Open Citation Indexes, using the data openly available in third-party bibliographic databases. The first and largest of these is COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, which presently contains information encoded in RDF on more than 445 million citations, released under a CC0 waiver.

OpenCitations structures its data according to the OpenCitations Data Model (OCDM), that may also be employed by third parties, either for their own use or to structure their data for submission to and publication by OpenCitations. This model uses OpenCitations’ suite of SPAR (Semantic Publishing and Referencing) Ontologies developed to describe all aspects of the scholarly publishing domain. OpenCitations has also published open software of generic applicability for searching, browsing and providing REST APIs over RDF triplestores.

OpenCitations fully supports the founding principles of Open Science. It complies with the FAIR data principles proposed by Force11 that data should be findable, accessible, interoperable and re-usable, and it complies with the recommendations of I4OC that citation data, in particular, should be structured, separable and open. OpenCitations has published a formal definition of an open citation, and has launched a system for globally unique and persistent identifiers (PIDs) for bibliographic citations – the Open Citation Identifiers (OCIs) – for which it maintains an OCI resolution service.

OpenCitations has the potential to be a game-changer in the scholarly information landscape, giving institutions and individuals the ability to analyse and reuse publication citations in other infrastructures, in library collections, and in research. Open citation data are particularly valuable for bibliometric analysis, increasing the reproducibility of large-scale analyses by enabling the publication of the source data upon which analytical results are based. Since citation data are also crucial to evaluating research performance, such access to open, transparent citation data sources is a priority for Open Science.

SCOSS was formed in early 2017 with the purpose of providing a new coordinated and targeted crowd-sourcing and cost-sharing framework to enable the Open Access and Open Science communities to support the open infrastructure services on which they depend. In its first funding cycle, more than 1.5 million euros was pledged by more than 200 institutions worldwide to help fund and sustain the Directory of Open Access Journals (DOAJ) and SHERPA/RoMEO.

With the launch of its second funding cycle, SCOSS is appealing to academic institutions and their libraries, research institutes, publishers, funding organisations, national and regional governments, international organisations, learned societies and service providers worldwide —  everyone who is invested in Open Access and Open Science — to support one or more of these three new selected open infrastructure services through a three-year commitment.

For more details about the services, suggested funding levels, and how you can help support OpenCitations, please see https://sparceurope.org/download/7913/ or contact us at donations@opencitations.net.

Posted in Open Citations, Open scholarship, Open Science | Tagged , , , , , | Leave a comment

COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations

Abstract

In this paper, we present COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations (http://opencitations.net/index/coci). COCI is the first open citation index created by OpenCitations, in which we have applied the concept of citations as first-class data entities, and it contains more than 445 million DOI-to-DOI citation links derived from the data available in Crossref. These citations are described in RDF by means of the new extended version of the OpenCitations Data Model (OCDM). We introduce the workflow we have developed for creating these data, and also show the additional services that facilitate the access to and querying of these data by means of different access points: a SPARQL endpoint, a REST API, bulk downloads, Web interfaces, and direct access to the citations via HTTP content negotiation. Finally, we present statistics regarding the use of COCI citation data, and we introduce several projects that have already started to use COCI data for different purposes.

Introduction

The availability of open scholarly citations [21] is a public good, of significant value to the academic community and the general public. In fact, citations not only serve as an acknowledgment medium [16], but also can be characterised topologically (by defining the connected graph between citing and cited entities and its evolution over time [19]), sociologically (such as for identifying odd conduct within or elitist access paths to scientific research [18]), quantitatively by creating citation-based metrics for evaluating the impact of an idea or a person [17], and financially by defining the scholarly value of a researcher within his/her own academic community [20]. The Initiative for Open Citations (I4OC, https://i4oc.org) has dedicated the past two years to persuading publishers to provide open citation data by means of the Crossref platform (https://crossref.org), obtaining the release of the reference lists of more than 43 million articles (as of February 2019), and it is this change of behaviour by the majority of academic publishers that has permitted COCI to be created.

OpenCitations (http://opencitations.net) is a scholarly infrastructure organization dedicated to open scholarship and the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies, and is a founding member of I4OC. It has created and maintains the SPAR (Semantic Publishing and Referencing) Ontologies (http://www.sparontologies.net) [22] for encoding scholarly bibliographic and citation data in RDF, and has previously developed the OpenCitations Corpus (OCC) of open downloadable bibliographic and citation data recorded in RDF [4].

In this paper, we introduce a new dataset made available a few months ago by OpenCitations, namely COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations (https://w3id.org/oc/index/coci). This dataset, launched in July 2018, is the first of the indexes proposed by OpenCitations (https://w3id.org/oc/index), in which citations are exposed as first-class data entities with accompanying properties (i.e. individuals of the class cito:Citation as defined in CiTO [7]) instead of being defined simply as relations among two bibliographic resources (via the property cito:cites). Currently COCI, contains more than 445 million DOI-to-DOI citation links made available under a Creative Commons CC0 public domain waiver, that can be accessed and queried through a SPARQL endpoint, an HTTP REST API, by means of searching/browsing Web interfaces, by bulk download in different formats (CSV and N-Triples), or by direct access via HTTP content negotiation.

The rest of the paper is organized as follows. In Section 2 we introduce some of the main RDF datasets containing scholarly bibliographic metadata and citations. In Section 3, we provide some details on the rationale and the technologies used to describe citations as first-class data entities, which are the main foundation of the development of COCI. In Section 4, we present COCI, including the workflow process developed for ingesting and exposing the open citation data available and other tools used for accessing these data. In Section 5, we show the scale of the community uptake of COCI since its launch by means of quantitative statistics on the use of its related services and by listing existing projects that are using it for specific purposes. Finally, in Section 6, we conclude the paper sketching out related and upcoming projects.

Related works

We have noticed a recent growing interest within the Semantic Web community for creating and making available RDF datasets concerning the metadata of scholarly resources, particularly bibliographic resources. In this section, we briefly introduce some of the most relevant ones.

ScholarlyData (http://www.scholarlydata.org) [1] is a project that refactors the Semantic Web Dog Food so as to keep the dataset growing in good health. It uses the Conference Ontology, an improvement version of the Semantic Web Conference Ontology, to describe metadata of documents (5,415, as of March 31, 2019), people (more than 1,100), and data about academic events (592) where such documents have been presented.

Another important source of bibliographic data in RDF is OpenAIRE (https://www.openaire.eu) [3]. Created by funding from the European Union, its RDF dataset makes available data for around 34 million research products created in the context of around 2.5 million research projects.

While important, these aforementioned datasets do not provide citation links between publications as part of their RDF data. In contrast, the following datasets do include citation data as part of the information they make available.

In 2017, Springer Nature announced SciGraph (https://scigraph.springernature.com) [2], a Linked Open Data platform aggregating data sources from Springer Nature and other key partners managing scholarly domain data. It contains data about journal articles (around 8 millions, as of March 31, 2019) and book chapters (around 4.5 millions), including their related citations, and information on around 7 million people involved in the publishing process.

The OpenCitations Corpus (OCC, https://w3id.org/oc/corpus) [4] is a collection of open bibliographic and citation data created by ourselves, harvested from the open access literature available in PubMed Central. As of March 31, 2019, it contains information about almost 14 million citation links to more than 7.5 million cited bibliographic resources.

WikiCite (https://meta.wikimedia.org/wiki/WikiCite) is a proposal, with a related series of workshops, which aims at building a bibliographic database in Wikidata [10] to serve all Wikimedia projects. Currently Wikidata hosts (as of March 29, 2019) more than 170 million citations.

Biotea (https://biotea.github.io) [5] is an RDF datasets containing information about some of the articles available in the Open Access subset of PubMed Central, that have been enhanced with specialized annotation pipelines. The last released dataset includes information extracted from 2,811 articles, including data on their citations.

Finally, Semantic Lancet [6] proposes to build a dataset of scholarly publication metadata and citations (including the specification of the citation functions) starting from articles published by Elsevier. To date it includes bibliographic metadata, abstract and citations of 291 articles published in the Journal of Web Semantics.

Indexing citations as first-class data entities

Citations are normally defined simply as links between published entities (from a citing entity to a cited entity). However, an alternative richer view is to regard each citation as a data entity in its own right, as illustrated in Figure 1. This alternative approach permits us to endow a citation with descriptive properties, such as those ones introduced in Table 11.

Figure 1. Two different ways of describing citations: as a relation between two bibliographic entities (top), or as an individual first-class data entitiy in its own right where the citing entity and the cited entity are among its attributed data.


The advantages of treating citations as first-class data entities are:

  • all the information regarding each citation is available in one place, since such information is defined as attributes of the citation itself;
  • citations become easier to describe, distinguish, count and process, and it becomes possible to distinguish separate citations within the citing entity to the cited entity, enabling one to count how many times, from which sections of the citing entity, and (in principle) for what purposes a particular cited entity is cited within the source paper;
  • if available in aggregate, citations described in this manner are easier to analyse using bibliometric methods, for example to determine how citation time spans vary by discipline.

We have appropriately extended the OpenCitations Data Model (OCDM, http://opencitations.net/model) [23] so as to define each citation as a first-class entity in machine-readable manner. In particular, we have used the class cito:Citation defined in the revised and expanded Citation Typing Ontology (CiTO, http://purl.org/spar/cito) [7], which is part of the SPAR Ontologies [22]. This class allows us to define a permanent conceptual directional link from the citing bibliographic entity to a cited bibliographic entity, that can be accompanied by additional ontological terms for defining specific attributes, as introduced in Table 1.

Characteristic Description CiTO entity
citing entity The bibliographic entity which acts as source for the citation. Object property cito:hasCitingEntity.
cited entity The bibliographic entity which acts as target for the citation. Object property cito:hasCitedEntity.
citation creation date The date on which the citation was created. This has the same numerical value as the publication date of the citing bibliographic resource, but is a property of the citation itself. When combined with the citation time span, it permits that citation to be located in history. Data property cito:hasCitationCreationDate, one of xsd:date, xsd:gYearMonth, or xsd:gYear as datatype value.
citation timespan The temporal characteristic of a citation, namely the interval between the publication date of the cited entity and the publication date of the citing entity. Data property cito:hasCitationTimespan, xsd:duration as datatype value.
type A classification of the citation according to particular dimensions, e.g. whether or not it is a self-citation. Property rdf:type associated with one or more subclasses of cito:Citation – in particular, for example cito:AuthorSelfCitation (i.e. citing and the cited entities have at least one author in common) and cito:JournalSelfCitation (i.e. citing and the cited entities are published in the same journal).

Table 1. List of characteristics that can be associated with a citation when it is described as first-class data entity, using the properties and classes available in CiTO for their definition in RDF.

So as to identify each citation precisely, when described as first-class data entity and included in an open dataset, we have also developed the Open Citation Identifier (OCI) [24], which is a new globally unique persistent identifier for citations. OCIs are registered in the Identifiers.org platform (https://identifiers.org/oci) and recognized as persistent identifiers for citations by the EU FREYA Project (https://www.project-freya.eu) [25]. Each OCI has a simple structure: the lower-case letters oci followed by a colon, followed by two sequences of numerals separated by a dash, where the first sequence is the identifier for the citing bibliographic resource and the second sequence is the identifier for the cited bibliographic resource. For example, oci:0301-03018 is a valid OCI for a citation defined within the OpenCitations Corpus, while oci:02001010806360107050663080702026306630509-02001010806360107050663080702026305630301 is a valid OCI for a citation included in Crossref. It is worth mentioning that OCIs are not opaque identifiers, since they explicitly encode directional relationships between identified citing and cited entities, the provenance of the citation, i.e. the database that contains it, and the type of identifiers used in that database to identify the citing and cited entities. In addition, we have created the Open Citation Identifier Resolution Service (http://opencitations.net/oci), which is a resolution service for OCIs based on the Python application oci.py available at https://github.com/opencitations/oci. Given a valid OCI as input, this resolution service is able to retrieve citation data in RDF (either as RDF/XML, Turtle or JSON-LD), or in Scholix, JSON or CSV formats. A more detailed explanation of OCIs and related material is available in [24].

At OpenCitations, we define an open citation index as a dataset containing citations that complies with the following requirements:

  • the citations contained are all open, according to the definition provided in [21];
  • the citations are all treated as first-class data entities;
  • each citation is identified by an Open Citation Identifier (OCI) [24];
  • the citation data are recorded in RDF according to the OpenCitations Data Model (OCDM) [23], where the OCI of a citation is embedded in the IRI defining it in RDF;
  • each citation defines the attributes shown in Table 1.

COCI: ingestion workflow, data, and services

COCI, the OpenCitations Index of Crossref open DOI-to-DOI references, is the first citation index to be published by OpenCitations, in which we have applied the concept of citations as first-class data entities, introduced in the previous section, to index the contents of one of the major open databases of scholarly citation information, namely Crossref (https://crossref.org), and to render and make available this information in machine-readable RDF under a CC0 waiver. Crossref contains metadata about publications (mainly academic journal articles) that are identified using Digital Object Identifiers (DOIs). Out of more than 100 million publications recorded in Crossref, Crossref also stores the reference lists of more than 43 million publications deposited by the publishers. Many of these references are to other publications bearing DOIs that are also described in Crossref, while others are to publications that lack DOIs and do not have Crossref descriptions. Crossref organises such publications with associated reference lists according to three categories: closed, limited and open. These categories to publications for which the reference lists are not visible to anyone outside the Crossref Cited-by membership, are visible only to them and to Crossref Metadata Plus members, or are visible to all users, respectively2.

Figure 2. The diagram of the data model adopted to define the new class for defining citations as first-class data entities, which forms part of the OpenCitations Data Model. This model uses terms from the Citation Typing Ontology (CiTO, http://purl.org/spar/cito) for describing the data, and from the Provenance Ontology (PROV-O, http://www.w3.org/ns/prov) to define the citation’s provenance.

Followed the first release of COCI on June 4, 2018, the most recent version of COCI, released on November 12, 2018, contains more that 445 million DOI-to-DOI citations included in the open and the limited datasets of Crossref reference data3. All the citation data in COCI and their provenance information, described according the Graffoo diagram [27] presented in Figure 2, are included in two distinct graphs – https://w3id.org/oc/index/coci/ and https://w3id.org/oc/index/coci/prov/ respectively – released under a CC0 waiver, and compliant with the FAIR data principles [26].

An example of a citation included in COCI is shown in the following excerpt (in Turtle), where the OCI is embedded as part of the IRI of the citation (without the oci: prefix) after the ci/ (meaning citation according to the OpenCitations Data Model [23]):

@prefix cito: <http://purl.org/spar/cito/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-02001010806360107050663080702026305630301>
  a 
    cito:Citation,
    cito:JournalSelfCitation ;
  cito:hasCitationCreationDate "2013"^^xsd:gYear ;
  cito:hasCitationTimeSpan "P1Y"^^xsd:duration ;
  cito:hasCitingEntity <http://dx.doi.org/10.1186/1756-8722-6-59> ;
  cito:hasCitedEntity <http://dx.doi.org/10.1186/1756-8722-5-31> ;
  prov:generatedAtTime "2018-11-01T05:47:54+00:00"^^xsd:dateTime ;
  prov:hadPrimarySource <https://api.crossref.org/works/10.1186/1756-8722-6-59> ;
  prov:wasAttributedTo <https://w3id.org/oc/index/coci/prov/pa/1> .

In the following subsections we introduce the ingestion workflow developed for creating COCI, we provides some figures on the citations it contains, and we list the resources and services we have made available to permit access to and querying of the dataset.

Ingestion workflow

We processed all the data included in the October 2018 JSON dump of Crossref data, available to all the Crossref Metadata Plus members. The ingestion workflow, summarised in Figure 3, was organised in four distinct phases, and all the related scripts developed and used are released as open source code according to the ISC License and downloadable from the official GitHub repository of COCI at https://github.com/opencitations/coci.

Figure 3. A flowchart scheme describing the workflow to build COCI. It is divided in four phases: (1) global data generation, (2) CSV generation, (3) conversion into RDF, and (4) updating the triplestore.

Phase 1: global data generation. We parse and process the entire Crossref bibliographic database to extract all the publications having a DOI and their available list of references. Through this process three datasets are generated, which are used in the next phase:

  • Dates, the publication dates of all the bibliographic entities in Crossref and of all their references if they explicitly specify a DOI and a publication date as structured data – e.g. see the fields DOI and year in the array reference in https://api.crossref.org/works/10.1007/978-3-030-00668-6_8. Where the same DOI is encountered multiple times, e.g. as a proper item indexed in Crossref and also as a reference in the reference list of another article deposited in the Crossref, we use the full publication date defined in the indexed item.
  • ISSN: the ISSN (if any) and publication type (journal-article, book-chapter, etc.) of each bibliographic entity identified by a DOI indexed in Crossref.
  • ORCID: the ORCIDs (if any) associated with the authors of each bibliographic entity identified by a DOI indexed in Crossref.

Phase 2: CSV generation. We generate a CSV file such that each row represents a particular citation between a citing entity and a cited entity according to the data available in the Crossref dump, by looking at the DOI identifying the citing entity and all the DOIs specified in the reference list of such a citing entity according to the Crossref data. In particular, we execute the following four steps for each citation identified:

  1. We generate the OCI for the citation by encoding the DOIs of the citing and cited entities into numerical sequences using the lookup table available at https://github.com/opencitations/oci/blob/master/lookup.csv, which are prefixed by the supplier prefix 020 to indicate Crossref as the source of the citation.
  2. We retrieve the publication date of the citing entity from the Dates dataset and assign it as citation creation date.
  3. We retrieve the publication date of the cited entity (from the Dates dataset) and we use it, together with the publication date of the citing entity retrieved in the previous step, to calculate the citation timespan.
  4. We use the data contained in the ISSN and ORCID datasets to establish whether the citing and cited entity have been published in the same journal and/or have at least one author in common, and in these cases we assign the appropriate self-citation type(s) to the citation.

Simultaneously with the creation of the CSV file of citation data, we generate a second CSV file containing the provenance information for each citation (identified by its OCI generated in the aforementioned Step 1). These provenance data include the agent responsible for the generation of the citation, the Crossref API call that refers to the data of the citing bibliographic entity containing the reference used to create the citation, and the creation date of the citation.

Phase 3: converting into RDF. The CSV files generated in the previous phase are then converted into RDF according to the N-Triples format, following the OWL model introduced in Figure 2, where the DOIs of the citing and cited entities become DOI URLs starting with http://dx.doi.org/4, while the IRI of the citation includes its OCI (without the oci: prefix), as illustrated in the example given in the previous section.

Phase 4: updating the triplestore. The final RDF files generated in Phase 3 are used to update the triplestore used for the OpenCitations Indexes.

Data

COCI was first created and released on July 4, 2018, and most recently updated on November 12, 2018. Currently, it contains 445,826,118 citations between 46,534,705 bibliographic entities. These are stored by means of 2,259,134,894 RDF statements (around 5 RDF statements per citation) for describing the citation data, and 1,337,478,354 RDF statements (3 statements per citation) for describing the related provenance information. Of the citations stored, 29,755,045 (6.7%) are journal self-citations, while 250,991 (0.06%) are author self-citations. The number of identified author self-citations, based on author ORCIDs, is a significant underestimate of the true number, mainly due to the sparsity of the data concerning the ORCID author identifiers within the Crossref dump. Journal entities (i.e. journals, volumes, issues, and articles) are the type of the bibliographic entities that are mostly cited, with over 420 million citations.

We also classify the cited documents according to their publishers – Table 2 shows the ten top publishers of citing and cited documents, calculated by looking at the DOI prefixes of the entities involved in each citation. As we can see, Elsevier is by far the publisher having the majority of cited documents. It is also the largest publisher that is not participating in the Initiative for Open Citations by making its publications’ reference lists open at Crossref – which is highlighted by the very limited amount of outgoing citations recorded in COCI. Its present refusal to open its article reference lists in Crossref, contrary to the practice of most of the major scholarly publishers, is contributing significantly to the invisibility of Elsevier’s own publications within the corpora of open citation data such as COCI that are increasingly being used by the scholarly community for discovery, citation network visualization and bibliometric analysis, as we introduce below in the section entitled Section 5.

Publisher Outgoing citations Incoming citations
Springer Nature 79,860,827 52,257,862
Wiley 76,819,685 48,174,542
Elsevier 2,853,739 96,310,027
Informa UK Limited 41,433,917 14,975,989
Institute of Electrical and Electronics Engineers (IEEE) 30,114,985 20,940,703
American Physical Society (APS) 15,729,297 16,065,862
SAGE Publications 15,933,805 7,915,082
Ovid Technologies (Wolters Kluwer Health) 9,971,274 12,840,293
Oxford University Press (OUP) 9,891,000 11,466,659
AIP Publishing 10,130,022 8,455,097

Table 2. A classification of the COCI citations according to the publishers of the cited (incoming citations) and citing (outgoing citations) documents. The table shows the top ten publishers by the overall amount of incoming and outgoing to/from their published works. Those publishers shown in italics are not participating in the Initiative for Open Citations by making their publications’ reference lists open at Crossref – see https://i4oc.org for additional information.

Resources and services

The citation data in COCI can be accessed in a variety of convenient ways, listed as follows.

Open Citation Index SPARQL endpoint. We have made available a SPARQL endpoint for all the indexes released by OpenCitations, including COCI, which is available at https://w3id.org/oc/index/sparql. When accessed with a browser, it shows a SPARQL endpoint editor GUI generated with YASGUI [8]. Of course, this SPARQL endpoint can additionally be queried using the REST HTTP protocol, e.g. via curl. In order to access to COCI data, the graph https://w3id.org/oc/index/coci/ must be specified in the SPARQL query.

COCI REST API. Citation data in COCI can be retrieved by using the COCI REST API, available at https://w3id.org/oc/index/coci/api/v1. The rationale of making a REST API available in addition to the SPARQL endpoint was to provide convenient access to the the citation data included in COCI for Web developers and users who are not necessarily experts in Semantic Web technologies. This REST API, as are all the other REST APIs made available by OpenCitations, has been implemented by means of RAMOSE, the Restful API Manager Over SPARQL Endpoints (https://github.com/opencitations/ramose), which is a Python application that allows one to simply create a REST API over any SPARQL endpoint by means of a simple configuration file that execute a SPARQL query dependently of the particular API call specified. The configuration file for the COCI API is available at https://github.com/opencitations/api/blob/master/coci_v1.hf. Currently, the COCI REST API makes available four operations, that will retrieve either (a) the citation data for all the references of a given DOI (operation: references), or (b) the citation data for all the citations received by a given DOI (operation: citations), or (c) the citation data for the citation identified by an OCI (operation: citation), or (d) the metadata for the articles identified by the specified DOIs (operation: metadata). It is worth mentioning that the latter operation strictly depends on live API calls to external services, namely the Crossref API (https://api.crossref.org), the DataCite API (https://api.datacite.org), and the Unpaywall API (http://api.unpaywall.org), to gather the metadata of the requested articles, such as the title, the authors, and the journal name, that are not explicitly included within the OpenCitations Index triplestore.

Searching and browsing interfaces. We have additionally developed a user-friendly text search interface (https://w3id.org/oc/index/search), and a browsing interface (e.g. https://w3id.org/oc/index/browser/coci/ci/02001010806360107050663080702026306630509-02001010806360107050663080702026305630301), that can be used to search citation data in all the OpenCitations Indexes, including COCI, and to visualise and browse them, respectively. These two interfaces have been developed by means of OSCAR, the OpenCitations RDF Search Application (https://github.com/opencitations/oscar) [9], and LUCINDA, the OpenCitations RDF Resource Browser (https://github.com/opencitations/lucinda), that provide a configurable layer over SPARQL endpoints that permit one easily to create Web interfaces for querying and visualising the results of SPARQL queries.

Data dumps. All the citation data and provenance information in COCI are available as dumps stored in Figshare (https://figshare.com) in both CSV and N-Triples formats, while a dump of the whole triplestore is available on The Internet Archive (https://archive.org). The links to these dumps are available on the download page of the OpenCitations website (http://opencitations.net/download#coci).

Direct HTTP access. All the citation data in COCI can be accessed directly by means of the HTTP IRIs of the stored resources (via content negotiation, e.g. https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-02001010806360107050663080702026305630301).

Quantifying the use of COCI citation data

In the past months, we have monitored the accesses to COCI data since its launch in July 2018. The statistics and graphics we show in this section highlight two different aspects: the quantification of the use of COCI data – and related services – and the community uptake, i.e. the use of COCI data for specific reuses within cross-community projects and studies. All the data of the charts described in this section are freely available for download from Figshare [15].

Quantitative analysis

Figure 4 shows the number of accesses made between July 2018 and February 2019 (inclusive) to the various COCI services described above – the search/browse interfaces, the REST API, SPARQL queries, and others (e.g. direct HTTP access to particular citations and visits to COCI webpages in the OpenCitations website). We have excluded from all these counts all accesses made by automated agents and bots. As shown, the REST API is, by far, the most used service, with extensive usage recorded in the last four months, following the announcement of the second release of COCI. This is reasonable, considering that the REST API has been developed exactly for accommodating the needs of generic Web users and developers, including (and in particular) those who are not expert in Semantic Web technologies. There is just one exception in November 2018, where the SPARQL endpoint was used to retrieve quite a large amount of citation data. After further investigation, we noticed a large proportion of the SPARQL calls were coming from a single source (according to the IP data stored in our log), which probably collected citation data for a specific set of entities.

Figure 4. The number of accesses to COCI-related services since July 2018 to February 2019. The scale used in the y-axis is logarithmic.

Figure 5 shows a particular cut of the figures given in Figure 4, which focuses on the REST API accesses only. In particular, we analysed which operations of the API were used the most. According to these figures, the most used operation is metadata (which was first introduced in the API in August 2018) which allows one to retrieve all the metadata describing certain publications. In contrast to the other API operations, this metadata search accepts one or more DOIs as input. The least used operation was citation, which allows one to retrieve citation data given an OCI, which should not be surprising, considering the currently limited knowledge of this new identifier system for citations.

Figure 5. The number of access made to each different COCI REST-API operation since the release of COCI on July 2018. Classified into 4 categories (requested resource): references, citations, citation, and metadata, as defined in the text.Note again the logarithmic scale of the y-axis.

In addition, we have also retrieved data about the views and downloads (as of March 29, 2019) of all the dumps uploaded to Figshare and to the Internet Archive. The CSV data dump received 1,321 views and 454 downloads, followed by the N-Triples data dump with 316 views and 93 downloads. The CSV provenance information dump has 166 and 127 downloads, while the N-Triples provenance information dump had 95 views and 34 downloads. Finally, the least accessed dump was that of the entire triplestore available in the Internet Archive, uploaded for the very first time in November 2018, that had only 3 views.

Community uptake

The data in COCI has been already used in various projects and initiatives. In this section, we list all the tools and studies doing this of which we are aware.

VOSviewer (http://www.vosviewer.com) [11] is a software tool, developed at the Leiden University’s Centre for Science and Technology Studies (CWTS), for constructing and visualizing bibliometric networks, which may include journals, researchers, or individual publications, and may be constructed based on citation, bibliographic coupling, co-citation, and co-authorship relations. Starting from version 1.6.10 (released on January 10, 2019), VOSviewer can now directly use citation data stored in COCI, retrieved by means of the COCI REST API.

Citation Gecko (http://citationgecko.com) is a novel literature mapping tool that allows one to map a research citation network using some initial seed articles. Citation Gecko is able to leverage citation links between seed papers and other papers to highlight papers of possible interest to the user, for which it uses COCI data (accessed via the REST API) to generate the citation network.

OCI Graphe (https://dossier-ng.univ-st-etienne.fr/scd/www/oci/OCI_graphe_accueil.html) is a Web tool that allows one to search articles by means of the COCI REST API, that are then visualised in a graph showing citations to the retrieved articles. It enriches this visualisation by adding additional information about the publication venues, publication dates, and other related metadata.

Zotero [12] is a free, easy-to-use tool to help users collect, organize, cite, and share research. Recently, the Open Citations Plugin for Zotero (https://github.com/zuphilip/zotero-open-citations) has been released, which allows users to retrieve open citation data extracted from COCI (via its REST API) for one or more articles included in a Zotero library.

COCI data, downloaded from the CSV dump available on Figshare, have been also used in at least two bibliometric studies. In particular, during the LIS Bibliometrics 2019 Event, Stephen Pearson presented a study (https://blog.research-plus.library.manchester.ac.uk/2019/03/04/using-open-citation-data-to-identify-new-research-opportunities/) run on publications by scholars at the University of Manchester which used COCI to retrieve citations between these publications so as to investigate possible cross-discipline and cross-department potential collaborations. Similarly, COCI data were used to conduct an experiment on the latest Italian Scientific Habilitation [13] (the national exercise that evaluates whether a scholar is appropriate to receive an Associate/Full Professorship position in an Italian university), which aimed at trying to replicate part of the outcomes of this evaluation exercise for the Computer Science research field by using only open scholarly data, including the citations available in COCI, rather than citation data from subscription services.

Conclusions

In this paper, we have introduced COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. After an initial introduction of the notion of citations as first-class data entities, we have presented the ingestion workflow that has been implemented to create COCI, have detailed the data COCI contains, and have described the various services and resources that we have made available to access COCI data. Finally, we have presented some statistics about the use of COCI data, and have mentioned the tools and studies that have adopted COCI in recent months.

COCI is just the first open citations index that OpenCitations will make available. Using the experience we have gathered by creating it, we now plan the release of additional indexes, so as to extend the coverage of open citations available through the OpenCitations infrastructure. The first of these, recently released, is CROCI (https://w3id.org/oc/index/croci) [14], the Crowdsourced Open Citations Index, which contains citations deposited by individuals. CROCI is designed to permit scholars proactively to fill the open citations gap in COCI resulting from four causes: (a) the failure of many publishers using Crossref DOIs to deposit reference lists of their publications at Crossref, (b) the failure of some publishers that do deposit their reference lists to make these reference lists open, in accordance with the recommendations of the Initiative for Open Citations; (c) the absence from ~11% of Crossref reference metadata of the DOIs for cited articles which in fact have been assigned DOIs (https://www.crossref.org/blog/underreporting-of-matched-references-in-crossref-metadata/), a problem that Crossref are currently working hard to rectify; and (d) the existence of citations to published entities that lack Crossref DOIs. In the near future, we plan to extend the number of indexes by harvesting citations from other open datasets including Wikidata (https://www.wikidata.org), DataCite (https://datacite.org), and Dryad (https://datadryad.org). In addition, we plan to extend and generalise the current software developed for COCI, so as to facilitate most frequent updates of the indexes.

Acknowledgements

We gratefully acknowledge the financial support provided to us by the Alfred P. Sloan Foundation for the OpenCitations Enhancement Project (grant number G‐2017‐9800).

References

  1. Nuzzolese, A. G., Gentile, A. L., Presutti, V., Gangemi, A. (2016). Conference Linked Data: The ScholarlyData project. In Proceedings of the 15th International Semantic Web Conference (ISWC 2015): 150-158. DOI: https://doi.org/10.1007/978-3-319-46547-0_16
  2. Hammond, T., Pasin, M., & Theodoridis, E. (2017). Data integration and disintegration: Managing Springer Nature SciGraph with SHACL and OWL. In International Semantic Web Conference (Posters, Demos & Industry Tracks). http://ceur-ws.org/Vol-1963/paper493.pdf
  3. Alexiou, G., Vahdati, S., Lange, C., Papastefanatos, G., Lohmann, S. (2016). OpenAIRE LOD services: scholarly communication data as linked data. In Semantics, Analytics, Visualization. Enhancing Scholarly Data: 45-50. DOI: https://doi.org/10.1007/978-3-319-53637-8_6
  4. Peroni, S., Shotton, D., Vitali, F. (2017). One year of the OpenCitations Corpus – releasing RDF-based scholarly citation data into the public domain. In Proceedings of the 16th International Semantic Web Conference (ISWC 2017): 184-192. DOI: https://doi.org/10.1007/978-3-319-68204-4_19
  5. Garcia, A., Lopez, F., Garcia, L., Giraldo, O., Bucheli, V., Dumontier, M. (2018). Biotea: semantics for Pubmed Central. PeerJ, 6: e4201. DOI: https://doi.org/10.7717/peerj.4201
  6. Bagnacani, A., Ciancarini, P., Di Iorio, A., Nuzzolese, A. G., Peroni, S., Vitali, F. (2014). The Semantic Lancet Project: A Linked Open Dataset for Scholarly Publishing. In EKAW 2014 Satellite Events: 101-105. DOI: https://doi.org/10.1007/978-3-319-17966-7_10
  7. Silvio Peroni, David Shotton (2012). FaBiO and CiTO: ontologies for describing bibliographic resources and citations. Web Semantics, 17: 33-34. DOI: https://doi.org/10.1016/j.websem.2012.08.001
  8. Rietveld, L., Hoekstra, R. (2017). The YASGUI family of SPARQL clients Semantic Web, 8(3): 373-383. DOI: https://doi.org/10.3233/SW-150197
  9. Heibi, I., Peroni, S., Shotton, D. (2018). OSCAR: A Customisable Tool for Free-Text Search over SPARQL Endpoints. In Semantics, Analytics, Visualization: 121-137. DOI: https://doi.org/10.1007/978-3-030-01379-0_9
  10. Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandečić, D. (2014). Introducing Wikidata to the linked data web. In Proceedings of the 13th International Semantic Web Conference (ISWC 2013): 50-65. DOI: https://doi.org/10.1007/978-3-319-11964-9_4
  11. van Eck, N., & Waltman, L. (2009). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523-538. DOI: https://doi.org/10.1007/s11192-009-0146-3
  12. Ahmed, K. M., Al Dhubaib, B. (2011). Zotero: A bibliographic assistant to researcher. Journal of Pharmacology and Pharmacotherapeutics, 2(4), 303. DOI: https://doi.org/
    10.4103/0976-500X.85940
  13. Di Iorio, A., Peroni, S., Poggi, F. (2019). Open data to evaluate academic researchers: an experiment with the Italian Scientific Habilitation. (To appear) Proceedings of the 17th International Conference on Scientometrics and Informetrics (ISSI 2019). https://arxiv.org/abs/1902.03287
  14. Heibi, I., Peroni, S., Shotton, D. (2019). Crowdsourcing open citations with CROCI – An analysis of the current status of open citations, and a proposal. (To appear) Proceedings of the 17th International Conference on Scientometrics and Informetrics (ISSI 2019). https://arxiv.org/abs/1902.02534
  15. Heibi, I., Peroni, S., Shotton, D. (2019). Usage statistics of COCI data. Figshare. DOI: https://doi.org/10.6084/m9.figshare.7873559
  16. Newton, I. (1675). Isaac Newton letter to Robert Hooke – Cambridge, 5 February 1675. https://digitallibrary.hsp.org/index.php/Detail/objects/9792 (last visited 23 March 2019)
  17. Schiermeier, Q. (2017). Initiative aims to break science’s citation paywall. Nature. DOI: https://doi.org/10.1038/nature.2017.21800
  18. Sugimoto, C. R., Waltman, L., Larivière, V., van Eck, N. J, Boyack, K. W., Wouters, P., de Rijcke, S. (2017). Open citations: A letter from the scientometric community to scholarly publishers. ISSI Society. http://issi-society.org/open-citations-letter (last visited 23 March 2019)
  19. Chawla, D. S. (2017). Now free: citation data from 14 million papers, and more might come. Science. https://www.sciencemag.org/news/2017/04/now-free-citation-data-14-million-papers-and-more-might-come (last visited 23 March 2019)
  20. Molteni, M. (2017). Tearing Down Science’s Citation Paywall, One Link at a Time. Wired. https://www.wired.com/2017/04/tearing-sciences-citation-paywall-one-link-time/ (last visited 23 March 2019)
  21. Peroni, S., Shotton, D.. (2018). Open Citation: Definition. Figshare. DOI: https://doi.org/10.6084/m9.figshare.6683855
  22. Peroni, S., Shotton, D. (2018). The SPAR Ontologies. In Proceedings of the 17th International Semantic Web Conference (ISWC 2018): 119-136. DOI: https://doi.org/10.1007/978-3-030-00668-6_8
  23. Peroni, S., Shotton, D. (2018). The OpenCitations Data Model. Figshare. DOI: https://doi.org/10.6084/m9.figshare.3443876
  24. Peroni, S., Shotton, D. (2019). Open Citation Identifier: Definition. Figshare. DOI: https://doi.org/10.6084/m9.figshare.7127816
  25. Ferguson, C., McEntrye, J., Bunakov, V., Lambert, S., van der Sandt, S., Kotarski, R., … McCafferty, S. (2018). Survey of Current PID Services Landscape (Deliverable No. D3.1). Retrieved from FREYA project (EC Grant Agreement No 777523) website: https://www.project-freya.eu/en/deliverables/freya_d3-1.pdf
  26. Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. DOI: https://doi.org/10.1038/sdata.2016.18
  27. Falco, R., Gangemi, A., Peroni, S., Shotton, D., Vitali, F. (2014). Modelling OWL Ontologies with Graffoo. In The Semantic Web: ESWC 2014 Satellite Events: 320–325. DOI: https://doi.org/10.1007/978-3-319-11955-7_42

Footnotes

1. An in-depth description about the definition and use of citations as first-class data entities can be found at https://opencitations.wordpress.com/2018/02/19/citations-as-first-class-data-entities-introduction/. [back]
2. Additional information on this classification of Crossref reference lists is available at https://www.crossref.org/reference-distribution/. [back]
3. We have access to the limited dataset since we are members of the Crossref Metadata Plus plan. [back]
4. We are aware that the current practice for DOI URLs is to use the base https://doi.org/ instead of http://dx.doi.org/. However, when one tries to resolve a DOI URL owned by Crossref by specifying an RDF format (e.g. Turtle) in the accept header of the request, the bibliographic entity is actually defined using the old URL structure starting with http://dx.doi.org/. For this reason, since COCI is derived entirely from Crossref data, we decided to stay with the approach currently used by Crossref. [back]
Posted in Citations as First-Class Data Entities, Data publication, Ontologies, Open Citation Identifiers, Open Citations | Tagged , , , | Leave a comment

Using the ORCID Public API for author disambiguation in the OpenCitations Corpus

Among the external services used, the ORCID Public API is of crucial importance for the task of author disambiguation. During the OCC ingestion workflow, the main metadata of an article are usually retrieved from the Crossref API. While the JSON schema used by Crossref to return the information requested by its APIs includes a field for specifying the ORCID for each of the authors of an article, this field is usually blank, since such information is commonly not available in the data provided by publishers. We therefore routinely use the ORCID Public API to try to retrieve ORCIDs for all authors and editors named in the Crossref metadata for a given DOI.

The process is organised as follows. Once we get back from Crossref the metadata about an article, we call the ORCID Public API and search for ORCIDs associated with the family names returned by Crossref of all the authors and editors (‘agents’) associated with that particular DOI. For instance, using the Crossref metadata about the article with DOI “10.1108/jd-12-2013-0166” (API call: https://api.crossref.org/works/10.1108/jd-12-2013-0166), we extract all the agents’ family names and call the ORCID Public API as follows:

https://pub.orcid.org/v2.1/search?q=(doi-self:10.1108/JD-12-2013-0166%20OR%20doi-self:10.1108/jd-12-2013-0166)%20AND%20(family-name:Peroni%20OR%20family-name:Dutton%20OR%20family-name:Gray%20OR%20family-name:Shotton)

The result of this query returned by ORCID is as follows:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<search:search num-found="2" 
  xmlns:search="http://www.orcid.org/ns/search" 
  xmlns:common="http://www.orcid.org/ns/common">
  <search:result>
    <common:orcid-identifier>
      <common:uri>https://orcid.org/0000-0003-0530-4305</common:uri>
      <common:path>0000-0003-0530-4305</common:path>
      <common:host>orcid.org</common:host>
    </common:orcid-identifier>
  </search:result>
  <search:result>
    <common:orcid-identifier>
      <common:uri>https://orcid.org/0000-0003-1448-3114</common:uri>
      <common:path>0000-0003-1448-3114</common:path>
      <common:host>orcid.org</common:host>
    </common:orcid-identifier>
  </search:result>
</search:search>

Then, for each ORCID returned, we call again the ORCID Public API, shown as follows for ORCID “0000-0003-0530-4305”, so as to get the full personal details of the agent with that ORCID:

https://pub.orcid.org/v2.1/0000-0003-0530-4305/personal-details

The result of this query is shown as follows:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<personal-details:personal-details
  path="/0000-0003-0530-4305/personal-details"
  xmlns:personal-details="http://www.orcid.org/ns/personal-details"
  ...>
  <personal-details:name 
    visibility="public" path="0000-0003-0530-4305">
    ...
    <personal-details:given-names>
      Silvio
    </personal-details:given-names>
    <personal-details:family-name>
      Peroni
    </personal-details:family-name>
  </personal-details:name>
  ...
</personal-details:personal-details>

Then, two possible alternative situations exist:

  • If the OpenCitations Corpus has already recorded the personal details and ORCID of that agent, we associate that agent with the new bibliographic resource identified by the input DOI; otherwise,
  • If the personal details and ORCID of that agent have not been previously recorded in the OpenCitation Corpus, we create a new agent record with that ORCID as external identifier, specified by means of the DataCite Ontology, and we associate this new agent with the new bibliographic resource identified by the input DOI.
  • This process is repeated for all ORCIDs associated with that DOI.

Software reuse in different applications

While the OCC ingestion workflow explained above regulates the ingestion of new citation data directly into the OpenCitations Corpus, the particular software library that implements this ingestion is generic in form, and is being reused in another application that we have recently released in prototype, namely BCite (sources available on GitHub). BCite is a Web application that enables users such as journal editors, starting with the ‘raw’ reference text strings supplied by the author as items in an article’s reference list, to obtain ‘clean’ verified and enriched bibliographic reference text strings, for inclusion in the reference list of the citing article they have in hand, so that accurate rather than erroneous references can be published in the version of record.  Additionally, these references are at the same time transformed into RDF data compliant with the OpenCitation Data Model, including ORCIDs where available, thereby (in principle, although not yet in practice) permitting inclusion of the metadata for these cited works, and the citations for which they are the targets, into the OpenCitations Corpus itself.

Posted in Open Citations | Tagged , , , | Leave a comment

Crowdsourcing open citations with CROCI

Crowdsourcing open citations with CROCI
An analysis of the current status of open citations, and a proposal

Author(s)
Ivan Heibiivan.heibi2@unibo.it
Digital Humanities Advanced Research Centre (DHARC), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
Silvio Peronisilvio.peroni@unibo.it
Digital Humanities Advanced Research Centre (DHARC), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
David Shottondavid.shotton@oerc.ox.ac.uk
Oxford e-Research Centre, University of Oxford, Oxford, United Kingdom

Keywords: Open citations; COCI; CROCI; Crossref; I4OC; OpenCitations

Copyright notice: This work is licensed under a Creative Commons Attribution 4.0 International License. You are free to share (i.e. copy and redistribute the material in any medium or format) and adapt (e.g. remix, transform, and build upon the material) for any purpose, even commercially, under the following terms: attribution, i.e. you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. The licensor cannot revoke these freedoms as long as you follow the license terms.

Notes: submitted as research-in-progress paper to the 17th International Conference on Scientomentrics and Informetrics (ISSI 2019, https://www.issi2019.org). PDF version available on arXiv.

Abstract

In this paper, we analyse the current availability of open citations data in one particular dataset, namely COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations; http://opencitations.net/index/coci) provided by OpenCitations. The results of these analyses show a persistent gap in the coverage of the currently available open citation data. In order to address this specific issue, we propose a strategy whereby the community (e.g. scholars and publishers) can directly involve themselves in crowdsourcing open citations, by uploading their citation data via the OpenCitations infrastructure into our new index, CROCI, the Crowdsourced Open Citations Index.

Introduction

The availability of open scholarly citations is a public good, which is of intrinsic value to the academic world as a whole (Shotton, 2013; Peroni et al. 2015; Shotton, 2018), and is particularly crucial for the scientometrics and informetrics community, since it supports reproducibility (Sugimoto et al., 2017) and enables fairness in research by removing such citation data from behind commercial paywalls (Schiermeier, 2017). Despite the positive early outcome of the Initiative for Open Citations (I4OC, https://i4oc.org), namely that almost all major scholarly publishers now release their publication reference lists, with the result that more than 500 million citations are now open via the Crossref API (https://api.crossref.org), and despite the related ongoing efforts of sister infrastructures and initiatives such as OpenCitations (http://opencitations.net) and WikiCite/Wikidata (https://www.wikidata.org), many scholarly citations are not freely available. While these initiatives have the potential to disrupt the traditional landscape of citation availability, which for the past half-century has been dominated by commercial interests, the present incomplete coverage of open citation data is one of the most significant impediments to open scholarship (van Eck et al., 2018).

In this work, we analyse the current availability of open citations data (Peroni and Shotton, 2018) within one particular dataset, namely COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations; http://opencitations.net/index/coci). This dataset is provided by OpenCitations, a scholarly infrastructure organization dedicated to open scholarship and the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies. Launched in July 2018, COCI is the first of the Indexes proposed by OpenCitations (http://opencitations.net/index) in which citations are exposed as first-class data entities with accompanying properties. It has already seen widespread usage (over nine hundred thousands API calls since launch, with half of these in January 2019), and has been adopted by external services such as VOSviewer (van Eck and Waltman, 2010).

In particular, in this paper we address the following research questions (RQs):

  1. What is the ratio between open citations vs. closed citations within each category of scholarly entities included in COCI (i.e. journals, books, proceedings, datasets, and others)?
  2. Which are the top twenty publishers in terms of the number of open citations received by their own publications, according to the citation data available in COCI?
  3. To what degree are the publishers highlighted in the previous analysis themselves contributing to the open citations movement, according to the data available in Crossref?

The results of these analyses show a persistent gap in the coverage of the currently available open citation data. To address this specific issue, we have developed a novel strategy whereby members of the community of scholars, authors, editors and publishers can directly involve themselves in crowdsourcing open citations, by uploading their citation data via the OpenCitations infrastructure into our new index, CROCI, the Crowdsourced Open Citations  Index.

Methods and material

To answer the RQs mentioned above, we used open data and technologies coming from various parties. Specifically, the open CC0 citation data we used came from the CSV dump of most recent release of COCI dated 12 November 2018 (OpenCitations, 2018), which contains 449,840,503 DOI-to-DOI citation links between 46,534,705 distinct bibliographic entities. The Crossref dump we used for the production of this most recent version of COCI was dated 3 October 2018, and included all the Crossref citation data available at that time in both the ‘open’ dataset (accessible by all) and the ‘limited’ dataset (accessible only to users of the Crossref Cited-by service and to Metadata Plus members of Crossref, of which OpenCitations is one – for details, see https://www.crossref.org/reference-distribution/).

We additionally extracted information about the number of closed citations to each of the 99,444,883 DOI-identified entities available in the October Crossref dump. This number was calculated by subtracting the number of open citations to each entity available within COCI from the value “is-referenced-by-count” available in the Crossref metadata for that particular cited entity, which reports all the DOI-to-DOI citation links that point to the cited entity from within the whole Crossref database (including those present in the Crossref ‘closed’ dataset).

Furthermore, we extracted the particular publication type of each entity, so as to identify it either as a journal article, or as a book chapter, etc. We determined these publication types for all the DOI-identified entities available in the Crossref dump we used. We then identified the publisher of each entity, by querying the Crossref API using the entity’s DOI prefix. This allowed us to group the number of open citations and closed citations to the articles published by that particular publisher, and to determine the top twenty publishers in terms of the number of open citations that their own publications had received.

Finally, we again queried the Crossref API, this time using the DOI prefixes of the citing entities, to check the participation of these top twenty publishers in terms of the number of open citations they were themselves publishing in response to the open citation movement sponsored by I4OC. Details of all these analyses are available online in CC0 (Heibi et al., 2019).

Results

First (RQ1) we determined the numbers of open citations and closed citations received by the entities in the Crossref dump. All the entity types retrieved from Crossref were aligned to one of following five categories: journal, book, proceedings, dataset, other, as illustrated in (Heibi et al., 2019). The outcomes are summarised in Figure 1, where it is evident that the number of open citations available in COCI is always greater than the number of closed citations to these entities to which COCI does not have access, for each of the publication categories considered, with the categories proceedings and dataset having the largest ratios.

Figure 1. The number of open citations (available in COCI) vs. closed citations (according to Crossref data) of the cited entities within COCI, analyzed and grouped according to five distinct categories. [Note that the vertical axis has a logarithmic scale].

Analysis of the Crossref data show that there are in total ~4.1 million DOIs that have received no open citations and at least one closed citation.  Conversely, there are ~10.7 million DOIs that have received no closed citations and at least one open citation in COCI. Most of the papers in both these categories have received very few citations.

The outcome of the second analysis (RQ2) shows which publishers are receiving the most open citations. To this end, we considered all the open citations recorded in COCI, and compared them with the number of closed citations to these same entities recorded in Crossref. Figure 2 shows the top twenty publishers that received the greatest number of open citations. Elsevier is the first publisher according to this ranking, but it also records the highest number of closed citations received (~97M vs. ~105.5M). The highest ratio in terms of open citations vs. closed citations was recorded by IEEE publications (ratio 6.25 to 1), while the lowest ratio was for the American Chemical Society (ratio 0.73 to 1).

Figure 2. The top twenty publishers sorted in decreasing order according to the number of open citations the entities they published have received, according to the open citation data within COCI. We accompany this count with the number of closed citations to the entities published by each of them according to the values available in Crossref, the total numbers of citations to these publishers’ entities, and the percentages of these totals that are open or closed.

Considering the twenty publishers listed in Figure 2, we wanted additionally to know their current support for the open citation movement (RQ3). The results of this analysis (made by querying the Crossref API on 24 January 2019) are shown in Figure 3. Among the top ten publishers shown in Figure 2, i.e. those who themselves received the largest numbers of open citations, only five, namely Springer Nature, Wiley, the American Physical Society, Informa UK Limited, and Oxford University Press, are participating actively in the open publication of their own citations through Crossref.

Figure 3. The contributions to open citations made by the twenty publishers listed in Figure 2, as of 24 January 2018, according to the data available through the Crossref API. The counts listed in this table refers to the number of publications for which each publisher has submitted metadata to Crossref that include the publication’s reference list. The categories closed, limited and open refer to publications for which the reference lists are not visible to anyone outside the Crossref Cited-by membership, are visible only to them and to Crossref Metadata Plus members, or are visible to all, respectively. Additional information on this classification of Crossref reference lists is available at https://www.crossref.org/reference-distribution/. The final column in the table shows the total number of publications for which the publisher has submitted metadata to Crossref, whether or not those metadata include the reference lists of those publications.

It is noteworthy that JSTOR contributes very few references to Crossref, while the many citations directed towards its own holdings place JSTOR twelfth in the list of publishers receiving open citations (Figure 2).  However, as the last column of Figure 3 shows, all the major publishers listed here are failing to submit reference lists to Crossref for a large number of the publications for which they submit metadata, that number being the difference between the value in the last column for that publisher and the combined values in the preceding three columns.  JSTOR is the worst in this regard, submitting references with only 0.53% of its deposits to Crossref, while the American Physical Society is the best, submitting references with 96.54% of its publications recorded in Crossref.

Additional information about these analyses, including the code and the data we have used to compute all the figures, is available as a Jupyter notebook at https://github.com/sosgang/pushing-open-citations-issi2019/blob/master/script/croci_nb.ipynb.

It should be stressed that a very large number of potentially open citations are totally missing in the Crossref database, and consequently from COCI, for the simple reason that many publishers, particularly smaller ones with limited technical and financial resources, but also all the large ones shown in Figure 3 and most of the others, are simply not depositing with Crossref the reference lists for any or all of their publications.

Discussion

According to the data retrieved, the open DOI-to-DOI citations available in COCI exceed the number of closed DOI-to-DOI citations recorded in Crossref for every publication category, as shown in Figure 1. The journal category is the one receiving the most open citations overall, as expected considering the historical and present importance of journals in most areas of the scholarly ecosystem. However, the number of closed citations to journal articles within Crossref is also of great significance, since these 322 million closed citations represent 43% of the total.

It is important to note that about one third of these closed citations to journal articles (according to Figure 2) are references to entities published by Elsevier, and that references from within Elsevier’s own publications constitute the largest proportion of these closed citations, since Elsevier is the largest publisher of journal articles. Thus Elsevier’s present refusal to open its article references is contributing significantly to the invisibility of Elsevier’s own publications within the corpus of open citation data that is being increasingly used by the scholarly community for discovery, citation network visualization and bibliometric analysis.

It is also worth mentioning the discrepancy between the citations available in COCI, which comes from the data contained in the open and limited Crossref datasets as of 3 October 2018, and those available within those same Crossref datasets as of 24 January 2019. The most significant difference relates to IEEE. While the citations present in COCI include those from IEEE publications to other entities prior to November 2018 (since in October 2018 its article metadata with references were present within the Crossref limited dataset), in November 2019 this scholarly society decided to close the main part of its Crossref references, and thus from that moment they became unavailable to Crossref Metadata Plus members such as OpenCitations, as highlighted in Figure 3. Thus IEEE citations from articles whose metadata was submitted to Crossref after the date of this switch to closed will no longer be automatically ingested into COCI.

To date, the majority of the citations present in Crossref that are not available in COCI comes from just three publishers: Elsevier, the American Chemical Society and University of Chicago Press (Figure 3). In fact, considering the average value of 18.6 DOI-to-DOI citation links for each citing entity – calculated by dividing the total number of citations in COCI by the number of citing entities in the same dataset – these three publishers are holding more than 214 million DOI-to-DOI citations that could potentially be opened. (The IEEE citation data which was in the Crossref ‘limited’ category as of October 2018 are actually included in COCI, although those from that organization’s more recent publications will no longer be, as mentioned above).

We think it is deeply regrettable and almost incomprehensible that any professional organization, learned society or university press, whose primary mission is to serve the interests of the practitioners, scholars and readers it represents, should choose not open all its publications’ reference lists as a public good, whatever secondary added-value services it chooses to build on top of the citations that those reference lists contain.

CROCI, the Crowdsourced Open Citations Index

The results of the Initiative for Open Citations (I4OC) have been remarkable, since its efforts have led to the liberation of millions of citations in a relatively short time.  However, many more citations, the lifeblood of the scholarly communication, are still not available to the general public, as mentioned in the previous section. Some researchers and journal editors, in particular, have recently started to interact with publishers that are not participating in I4OC, in attempts to convince them to release their citation data. Remarkable examples of these activities are the petition promoted by Egon Willighagen (https://tinyurl.com/acs-petition) addressed to the American Chemical Society, and the several unsuccessful requests made to Elsevier by the Editorial Board of the Journal of Informetrics, which eventually resulted in the resignation of the entire Editorial Board on 10 January 2019 in response to Elsevier’s refusal to address their issues (http://www.issi-society.org/media/1380/resignation_final.pdf).

To provide a pragmatic alternative that would permit the harvesting of currently closed citations, so that they could then be made available to the public, we at OpenCitations have created a new OpenCitations Index: CROCI, the Crowdsourced Open Citations Index, into which individuals identified by ORCiD identifiers may deposit citation information that they have a legal right to submit, and within which these submitted citation data will be published under a CC0 public domain waiver to emphasize and ensure their openness for every kind of reuse without limitation. Since citations are statements of fact about relationships between publications (resembling statements of fact about marriages between individual persons), they are not subject to copyright, although their specific textual arrangements within the reference lists of particular publications may be.  Thus the citations from which the reference list of an author’s publication has been composed may legally be submitted to CROCI, although the formatted reference list cannot be. Similarly, citations extracted from within an individual’s electronic reference management system and presented in the requested format may be legally submitted to CROCI, irrespective of the original sources of these citations.

To populate CROCI, we ask researchers, authors, editors and publishers to provide us with their citation data organised in a simple four-column CSV file (“citing_id”, “citing_publication_date”, “cited_id”, “cited_publication_date”), where each row depicts a citation from the citing entity (“citing_id”, giving the DOI of the cited entity) published on a certain date (“citing_publication_date”, with the date value expressed in ISO format “yyyy-mm-dd”), to the cited entity (“cited_id”, giving  the DOI of the cited entity) published on a certain date (“cited_publication_date”, again with the date value expressed in ISO format “yyyy-mm-dd”). The submitted dataset may contain an individual citation, groups of citations (for example those derived from the reference lists of one or more publications), or entire citation collections. Should any of the submitted citations be already present within CROCI, these duplicates will be automatically detected and ignored.

The date information given for each citation should be as complete as possible, and minimally should be the publication years of the citing and cited entities. However, if such date information  is unavailable, we will try to retrieve it automatically using OpenCitations technology already available. DOIs may be expressed in any of a variety of valid alternative formats, e.g. “https://doi.org/10.1038/502295a”, “http://dx.doi.org/10.1038/502295a”, “doi: 10.1038/502295a”, “doi:10.1038/502295a”, or simply “10.1038/502295a”.

An example of such a CVS citations file can be found at https://github.com/opencitations/croci/blob/master/example.csv. As an alternative to submissions in CSV format, contributors can submit the same citation data using the Scholix format (Burton et al., 2017) – an example of such format can be found at https://github.com/opencitations/croci/blob/master/example.scholix.

Submission of such a citation dataset in CSV or Scholix format should be made as a file upload either to Figshare (https://figshare.com) or to Zenodo (https://zenodo.org). For provenance purposes, the ORCID personal identifier of the submitter of these citation data should be explicitly provided in the metadata or in the description of the Figshare/Zenodo object. Once such a citation data file upload has been made, the submitter should inform OpenCitations of this fact by adding an new issue to the GitHub issue tracker of the CROCI repository (https://github.com/opencitations/croci/issues).

OpenCitations will then process each submitted citation dataset and ingest the new citation information into CROCI. CROCI citations will be available at http://opencitations.net/index/croci using an appropriate REST API and SPARQL endpoint, and will additionally be published as periodic data dumps in Figshare, all releases being under CC0 waivers. We propose in future to enable combined searches over all the OpenCitations indexes, including COCI and CROCI.

We are confident that the community will respond positively to this proposal of a simple method by which the number of open citations available to the academic community can be increased, in particular since the data files to be uploaded have a very simple structure and thus should be easy to prepare. In particular, we hope for submissions of citations from within the reference lists of authors’ green OA versions of papers published by Elsevier, IEEE, ACS and UCP, and from publishers not already submitting publication metadata to Crossref, so as to address existing gaps in open citations availability. We look forward to your active engagement in this initiative to further increase the availability of open scholarly citations.

Acknowledgements

The authors would like to thank the SoS Gang (https://sosgang.github.io) for their support, and for having make available a space (https://github.com/sosgang/pushing-open-citations-issi2019) within which to share openly all the scripts and data developed for this study.

Postscript

On the 5th February 2019, after these analyses had been concluded and the text of this paper had been finalized, Crossref published an announcement that DOIs were missing from approximately 11% of its references (https://www.crossref.org/blog/underreporting-of-matched-references-in-crossref-metadata/), because of an historical fault in the manner in which Crossref automatically processes references upon deposit by publishers.  This fault means that the absolute numbers of the open and closed DOI-to-DOI citations given in this paper are significantly lower than they should be. While this does not invalidate the comparisons we have reported here, it is clearly regrettable. In April 2019, after Crossref have completed the re-processing of their reference data to include the missing DOIs, we will create an updated version of COCI, and will then recompute and republish the data presented here to include those citations for which Crossref is presently failing to correctly assigned DOIs to the cited entities.

References

Burton, A., Fenner, M., Haak, W. & Manghi, P. (2017). Scholix Metadata Schema for Exchange of Scholarly Communication Links (Version v3). Zenodo. DOI: https://doi.org/10.5281/zenodo.1120265

Heibi, I., Peroni, S. & Shotton, D. (2019). Types, open citations, closed citations, publishers, and participation reports of Crossref entities. Version 1. Zenodo. DOI: https://doi.org/10.5281/zenodo.2558257

OpenCitations (2018). COCI CSV dataset of all the citation data. Version 3. Figshare. DOI: https://doi.org/10.6084/m9.figshare.6741422.v3

Peroni, S., Dutton, A., Gray, T. & Shotton, D. (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71: 253-277. DOI: https://doi.org/10.1108/JD-12-2013-0166

Peroni, S. & Shotton, D. (2018). Open Citation: Definition. Version 1. Figshare. DOI: https://doi.org/10.6084/m9.figshare.6683855

Schiermeier, Q. (2017). Initiative aims to break science’s citation paywall. Nature News. DOI: https://doi.org/10.1038/nature.2017.21800

Shotton, D. (2013). Open citations. Nature, 502: 295-297. DOI: https://doi.org/10.1038/502295a

Shotton, D. (2018). Funders should mandate open citations. Nature, 553: 129. DOI: https://doi.org/10.1038/d41586-018-00104-7

Sugimoto, C. R., Waltman, L., Larivière, V., van Eck, N. J., Boyack, K. W., Wouters, P. & de Rijcke, S. (2017). Open citations: A letter from the scientometric community to scholarly publishers. ISSI. http://www.issi-society.org/open-citations-letter/ (last visited 26 January 2018)

van Eck, N.J. & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2): 523-538. DOI: https://doi.org/10.1007/s11192-009-0146-3

van Eck, N.J., Waltman, L., Larivière, V. & Sugimoto, C. R. (2018). Crossref as a new source of citation data: A comparison with Web of Science and Scopus. CWTS Blog. https://www.cwts.nl/blog?article=n-r2s234 (last visited 26 January 2018)

Posted in Citations as First-Class Data Entities, Data publication, Open Citations | Tagged , , , , , | 4 Comments

The OpenCitations Enhancement Project – final report

The OpenCitations Enhancement Project
Final report for the Alfred P. Sloan Foundation

Report period: 1st May 2017 – 30 November 2018.
Report written: 30th December 2018

Background

OpenCitations (http://opencitations.net) is a scholarly infrastructure organization dedicated to open scholarship and the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies, and engaged in advocacy for semantic publishing and open citations [Peroni and Shotton, 2018b]. It provides the OpenCitations Data Model [Peroni and Shotton, 2018d], the SPAR (Semantic Publishing and Referencing) Ontologies [Peroni and Shotton, 2018e] for encoding scholarly bibliographic and citation data in RDF, and open software of generic applicability for searching, browsing and providing APIs over RDF triplestores. It has developed the OpenCitations Corpus (OCC) [Peroni et al., 2017] of open downloadable bibliographic and citation data recorded in RDF, and a system and resolution service for Open Citation Identifiers (OCIs) [Peroni and Shotton, 2018c], and it is currently developing a number of Open Citation Indexes using the data openly available in third-party bibliographic databases.

The Directors of OpenCitations are David Shotton, Oxford e-Research Centre, University of Oxford (david.shotton@opencitations.net), and Silvio Peroni, Digital Humanities Advanced Research Centre, Department of Classical Philology and Italian Studies, University of Bologna (silvio.peroni@opencitations.net). We are committed to open scholarship, open data, open access publication, and open source software. We espouse the FAIR data principles developed by Force11, of which David Shotton was a founding member (https://www.force11.org/group/fairgroup/fairprinciples), and the aim of the Initiative for OpenCitations (I4OC, https://i4oc.org), of which both David Shotton and Silvio Peroni were founding members, to promote the availability of citation data that is structured, separable, and open.

Project personnel and roles

Ivan Heibi – Research Fellow

Ivan Heibi was appointed to the 12 months Research Fellowship position funded by the Sloan Foundation.

Ivan has been responsible for the development of new visualization and programming interfaces for exploring and making sense of the citation data included in the OCC and in the new OpenCitations Indexes, for the main part of the scripts related to the population and regular maintenance of COCI (The OpenCitations Index of Crossref open DOI-to-DOI citations, the first of the OpenCitations Indexes), and for conference presentations and paper writing.

Silvio Peroni – Lead Applicant and Principal Investigator

Silvio has been responsible for project management, for interview, appointment and supervision of the work of Ivan Heibi, for all aspects of software coding and technical developments required for the OpenCitations Corpus (OCC), the OpenCitations Indexes, and the Open Citation Identifier Resolution Service, for the ordering and management of new Sloan-funded hardware, and for conference presentations, paper writing and other forms of outreach and dissemination (e.g. blog and social networks).

David Shotton – Consultant Co-Investigator

David has been responsible for project management, interaction with publishers, conference presentations, paper writing, other forms of outreach and dissemination (e.g. blog and social networks), for web site and data model revision, and for independent usability evaluation, stress-testing and design feedback of new user interfaces and applications.

Project management

Project management has been straightforward, as should be the case, given the small size of our team. It has involved more than 1500 e-mail exchanges between David Shotton and Silvio Peroni, about 500 e-mail exchanges between Silvio Peroni and Ivan Heibi since November 2017, two dozen or so video conferences, some of which have involved collaborators, and an extended face-to-face meeting during the WikiCite 2017 Conference in Vienna at the start of the project and the Workshop on Open Citations 2018 in Bologna.

We have together harmoniously developed the concept of and vision for OpenCitations as an infrastructure organization, the structure and content of the OpenCitations web site (http://opencitaitons.net), the classes and properties of our supporting SPAR ontologies (http://www.sparontologies.net), and the community of collaborators and users of our developments. This has involved outreach and dissemination at a number of international research conferences, involvement with publishers through the Initiative for Open Citations, and associated publications.

We have studied and preliminarily tested a new scalable architecture centred on one powerful independent physical server, that both stores and handles all the data in the Corpus and in the new OpenCitations Indexes, and also offers adequate performance for query services. This server is supplemented by 30 additional small physical machines, Raspberry Pi 3Bs, working in parallel, each in charge of ingesting a defined set of reference lists and feeding the ingested data to the central server for further processing and storage as RDF in our Blazegraph triplestore.

Current status of the OpenCitations services

Currently, we release two different datasets – the OpenCitations Corpus and COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations – and several interfaces so as to make these data queryable from different access points.

Functionality and holdings

As of 29th December 2018, the OpenCitations Corpus (OCC) contains information about 13,964,148 citation links to 7,565,367 cited resources, ingested from 326,743 citing bibliographic resources obtained from the Open Access corpus of Europe PubMed Central and from the citation data imported from the EXCITE project. The main part of the development effort in the past months has been spent in implementing ingestion strategies that allow partners to provide us citation data, stored according the OpenCitations Data Model [Peroni and Shoton, 2018b], so as to be added directly into the Corpus. In September 2018, we successfully completed the ingestion of the initial data coming from the EXCITE project (citations from social sciences scholarly papers published by German publishers), and we are actively interacting with the LOC-DB project and the Venice Scholar Index so as to add their data to the OCC as well. In ongoing work, we are also collaborating with arXiv and EXCITE to harvest all the references from all PDF documents in the arXiv ePrints collection, to record these in RDF according to the OpenCitations Data Model, and to ingest them into the OpenCitations Corpus, as well as creating a new Index of these citations.

The first of these OpenCitations Indexes is COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, an RDF dataset containing details of all the citations that are specified by the open references to DOI-identified works present in Crossref, as of the latest COCI update. COCI does not index Crossref references that are closed, nor Crossref references to entities that lack DOIs. These citations are treated as first-class data entities, with accompanying properties including the citations timespan and possible kinds of self-citation characteristics, modelled according to the index data model described in the OpenCitations Indexes page. COCI was launched in July 2018, and the most recent update of COCI is dated 12 November 2018. It presently contains 449,840,503 citations between 46,534,705 bibliographic resources. COCI is the first citation index released by OpenCitations, being a bibliographic index recording citations between publications that permits the user to establish which later documents cite earlier documents, and to create citation graphs of these citations.).

While full coverage of the scholarly citation graph depicted by the aforementioned datasets (or as full as practically possible) is required for the calculation of certain bibliometric indicators such as journal impact factors and individual h-indexes (Hirsch numbers), partial coverage while OCC grows is still of value, since it includes citations of all the most important biomedical papers obtained from the Open Access Subset off PubMed Central. These can be easily recognized by their large number of inward citation links, and can be used to explore the development of disciplines and research trends. In addition, COCI, with its wider scope, has sufficient coverage to be used for large-scale bibliometrics analysis.

Purchase of new hardware, and testing and development of new software

Because of unavoidable academic teaching commitments for Silvio Peroni, the installation of the new Sloan-funded hardware for OCC, purchased in autumn 2017, had to be postponed. All the services (old and new) of OpenCitations were successfully transferred to our new server in October 2018. In order to implement such transition, the ingestion process of the OCC was halted, so as to allow Silvio Peroni to properly test the new hardware and to extend the existing ingestion software so as to be usable within the new parallel processing architecture. The final tests are currently running, and we will recommence the full ingestion process of the OCC using the new hardware configuration, with its greatly enhanced ingest rate, in January 2019. In the meantime, the new infrastructure has been used to allow us to create COCI, so far the largest RDF dataset of open citation data available worldwide.

In addition, we have completed the transition of all the OpenCitations software from the old GitHub repository (i.e. https://github.com/essepuntato/opencitations) to a new GitHub organization, namely https://github.com/opencitations. This organisation includes several repositories which permit third parties to initiate the whole suite of OpenCitations software on a local machine. This is of key importance for the resilience of this open source project.

User interfaces

SPARQL, the query language used to interrogate RDF triplestores, is a quite powerful language. However, one needs appropriate skills with Semantic Web technologies to master it for solving even easy search tasks. Thus normal web users are unable to use appropriately such technologies if not appropriately instructed, leaving all these technologies in the hands of a limited number of experts. The datasets made available by OpenCitations suffered similar issues.

Initially, the goal we had was to develop ad-hoc user interfaces to abstract the complexities of the SPARQL endpoints into well-designed Web interfaces that anyone could use. During the development, though, we thought it would be better to develop generic frameworks for building customizable interfaces that allow one to expose, in a more human-understandable way, RDF data stored in any RDF triplestore and accessible through any SPARQL-endpoint, so as to forster reuse of such software in contexts that, in principle, might go far beyond the OpenCitations domain.

To this end we developed three different open software applications:

  • OSCAR [Heibi et al., 2018a] [Heibi et al., 2018b], a Javascript application for creating textual search interfaces to RDF data;
  • LUCINDA, another Javascript application for creating Web browsers over RDF data;
  • RAMOSE, a Python application that permits one to easily create and serve a conventional HTTP REST API over a SPARQL endpoint.

All these tools provide a configurable mechanism – by means of one single textual configuration file – that allows one to generate Web-based interfaces to any SPARQL endpoint. In practice, these tools are flexible Web-interface makers to RDF data, and as such represent a significant advance available to the entire community.

All these applications have been used to produce several interfaces to all the datasets released by OpenCitations. In particular, we have created user-friendly textual search interfaces (via OSCAR) both for the OCC (see the search box now on the OpenCitations home page, and the related search page) and for COCI (see the related search page). We have additionally developed browsing applications (via LUCINDA) to permit humans an easier navigation of all the entities included in the OCC (e.g. see the bibliographic resource br/1791056) and in COCI (e.g. see the citation oci:02001010806360107050663080702026306630509-02001010806360107050663080702026305630301). Finally, we have also implemented REST HTTP APIs (via RAMOSE) for simplifying the queries to both datasets, the OCC and COCI, by Web developers with no expertise in Semantic Web technologies.

In addition, in order to demonstrate its flexibility, we have also created two web pages using OSCAR, LUCINDA, and RAMOSE for permitting similar tasks (text query, browsing, and REST APIs) on the scholarly data in Wikidata / WikiCite – another project recently funded by the Alfred P. Sloan Foundation. These interfaces have been introduced in two distinct event: during the hack day of the Workshop on Open Citations 2018 and during the WikiCite 2018 Conference.

A further prototypical interface / service has been recently proposed so as to try to gather additional open citation data to include in the OCC, involving users of the scholarly domain such as editors and researchers. This application is called BCite [Daquino et al., 2018]. BCite is designed to provide a full workflow for citation discovery, allowing users to specify the references as provided by the authors of an article, to retrieve them in the required format and style, to double-check their correctness, and, finally, to create new open citation data according to the OpenCitations Data Model [Peroni and Shotton, 2018d], so as to permit their future integration into the OCC. While presently only a prototype, we received several commendations for this tool, and we are currently studying funding strategies to develop a full standalone application that can be used by anyone and that allows users to directly interact with the OCC, so as to upload new data into the Corpus.

Open Citation Identifiers

During the reporting period, it became increasingly evident to us that citations deserved treating as First Class Data Entities, which would give the following advantages:

  • All the information regarding each citation would be available in one place.
  • Citations become easier to describe, distinguish, count and process.
  • If available in aggregate, citations become easier to analyze using bibliometric methods, for example to determine how citation time spans vary by discipline.

Four developments were required to make this possible:

  • The metadata describing the citation must be definable in a machine-readable manner.
  • Such metadata must be storable, searchable and retrievable.
  • Each citation must be identifiable, using a globally unique Persistent Identifier.
  • There must be a Web-based resolution service that takes the identifier as input and returns a description of the citation.

To this end, we have achieved:

  • the first requirement by the addition of appropriate classes and properties to CiTO, the Citation Typing Ontology, and by the addition of a new member of the class datacite:ResourceIdentifierScheme in the DataCite Ontology, namely Open Citation Identifier;
  • the second requirement by modifying the OpenCitations Data Model [Peroni and Shotton, 2018d] so that citations can be properly described using these new ontology terms within the OpenCitations Corpus;
  • the third requirement by creating the syntax for this new Open Citation Identifier (OCI) [Peroni and Shotton, 2018c], and enabling the creation of such identifiers, both to specify citations within the OpenCitations Corpus, and also (importantly) to specify citations described in Wikidata (by QIDs) and in Crossref (by DOIs); and
  • the fourth requirement by creating a resolving service for OCIs at http://opencitations.net/oci and additional software (in Python) for retrieving information about a particular citation identified by an OCI.

To date, OCIs have been actively used both in the OCC and in COCI to identify all the citations they contain. In addition, we now plan to create and publish additional OpenCitations Indexes of all the citations DataCite, Wikidata and (of course) the OCC, which we hope will be of great benefit to bibliometricians in their analysis of citation networks, self-citation, etc., and in the calculation of citation metrics. OCIs have been recognised by the EU Project FREYA as unique global identifiers for bibliographic citations.

Collaborations and Users

OpenCitations Data Model

We are collaborating with the following groups and academic projects, both to promote the use of the OpenCitations Data Model (OCDM), and to provide a publication venue for the citation data that they are liberating from the scholarly literature:

  • Matteo Romanello of the Digital Humanities Laboratory at the University of Lausanne is using OCDM for modelling citations of the classical literature within ancient Venetian documents in the context of the Venice Scholar Index, and is currently working on producing a dataset of citation data compliant with OCDM so as to be ingested in the OCC.
  • Two DFG-funded German projects that are extracting citations from Social Science publications:
    • The Linked Open Citations Database (LOC-DB) at the University of Mannheim is using OCDM to model their data, with the aim of producing them accordigly with such model so as to be ingested in the OCC.
    • Steffen Staab (University of Koblenz) and Philipp Mayr (GESIS) are running the EXCITE Project, which uses OCDM to model their citation data, and already adopted the OCC as their publication platform. In fact, in September 2018, ~1 million citations coming from the EXCITE Project were successfully ingested in the OCC.
  • Sergey Parinov is technically leading CitEcCyr, which is an open repository of citation relationships obtained from research papers in the Russian language and Cyrillic script. This project intends to model its citations using the OCDM, and will use the OpenCitations Corpus as its publication platform.

Users of OpenCitations data

The following project and organizations have let us know that they are using data from the OCC and COCI:

  • Wikidata includes alignments between several bibliographic entries with OCC resources;
  • OpenAIRE imported OCC metadata about articles into their LOD database;
  • Daniel Ecer and Lisa Knoll of eLife performed analytics on the OCC data;
  • Ontotext demonstrated SPARQL query federation between Springer Nature LOD and OCC;
  • Anna Kamińska published a bibliometrics case study of PLOS ONE articles in OCC;
  • Daniel Himmelstein processed OpenCitations data to create DOI-to-DOI citation tables;
  • Thiago Nunes and Daniel Schwabe are using OCC to exemplify their XPlain framework;
  • Antonina Dattolo and Marco Corbatto are using the OCC as source for VisualBib framework;
  • Nees Jan van Eck and Ludo Waltman extended VOSviewer so as to use data in the OCC + COCI, that will be officially published in the next release of the tool;
  • Barney Walker developed Citation Gecko, a graph-based citation discovery tool based on the OCC and COCI for retrieving citation data about the papers;
  • Philipp Zumstein developed a Zotero plugin that gives information about open citations using COCI;
  • Dominique Rouger developed a Web application that provides a visual graph representation of citation links in COCI.

Website statistics

From May 2017 to November 2018, the official OpenCitations website has been accessed ~5.8M times – we have excluded from this list the hits done by well-known spiders and crawlers. It is worth mentioning that the pages related to the data available and the services for querying them (i.e. “/corpus”, “/sparql”, and “/index” in the following diagram) have together gained a very high percentage of the overall accesses, showing that the main reason people access the OpenCitations website is to explore and use the data in the OCC and in COCI. It is also clear how the introduction of COCI brought additional accesses to the OpenCitations services, and the trend is increasing – e.g. in December 2018 (not shown in the following diagram) we got more than 200M accesses to “/index”, mainly related to the use of the COCI REST APIs.

Community outreach

From May 2017 to November 2018, the documents (i.e. [Peroni and Shotton, 2018b] [Peroni and Shotton, 2018c] [Peroni and Shotton, 2018d]) and the dumps of the OCC and of COCI available on Figshare (see http://opencitations.net/download) have been viewed 37,960 times and downloaded 3,842 times. The figure below summarizes how many views and downloads such resources have received month by month. For example, the latest version of COCI in CSV has been downloaded 239 times since its release in November (see https://doi.org/10.6084/m9.figshare.6741422.v3).

In the past nineteen months, the posts published by the official Twitter account of OpenCitations have been engaged by 599,200 distinct Twitter accounts, the Twitter profile (@opencitations) has been visited 12,719 times and has been mentioned in 565 tweets written by others, and it has collected an additional 1,925 followers. The diagram below shows all these statistics month by month.

Although we made relatively few new blog posts in the reporting period (there have been twenty more since then), from May to November 2018, the blog dedicated to OpenCitations (https://opencitations.wordpress.com) received 19,454 visits by 15,184 distinct users. As shown in the diagram below, the biggest peaks in terms of visits has been in July 2018 (the month when we launched COCI) and in September 2018 (the month of the Workshop on Open Citations).

Users of the SPAR Ontologies

The SPAR Ontologies [Peroni and Shotton, 2018e] are in use by about 40 other projects and organizations, including:

  • The United States Global Change Information System, which encodes federal information relating to climate change, makes extensive use of SPAR ontology terms.
  • The United Nations Document Ontology (UNDO) has been specifically aligned with FaBiO.
  • Wikidata has many classes that have been alighted with FaBiO or CiTO.
  • DBPedia’s DataID ontology uses the FaBiO and DataCite ontologies.
  • W3C’s Data on the Web Best Practices: Dataset Usage Vocabulary uses SPAR Ontologies.

For the full list, see http://www.sparontologies.net/uptake.

To date, as far as we are aware, more than 677 papers have been published that cite or use one or more of the SPAR ontologies. For the full list, see http://www.sparontologies.net/uptake#publications.

OpenCitations and the Initiative for Open Citations

OpenCitations and the Initiative for Open Citations, despite the similarity of title, are two distinct organizations. The primary purpose of OpenCitations is to host and build the OpenCitations Corpus (OCC) and the OpenCitations Indexes, as long as additional service to browse, query, and analyse citation data. In contrast, the Initiative for Open Citations (I4OC, https://i4oc.org) is separate and independent organization, whose founding was spearheaded by Dario Taraborelli of the WikiMedia Foundation. OpenCitations is one of several founding members of the Initiative for Open Citations, as documented at https://i4oc.org/#founders. I4OC is a pressure group to promote the unrestricted availability of scholarly citation data, but does not itself host citation data.

Because open reference lists are necessary for the population of OCC, we at OpenCitations have devoted considerable effort to promoting I4OC’s aims, and we host the I4OC web site on behalf of that community.

Within a short space of time, I4OC has persuaded most of the major scholarly publishers to open their reference lists submitted to Crossref, so that the proportion of all references submitted to Crossref that are now open has risen from 1% to over 50%. These are now available for OpenCitations to harvest into the OpenCitations Corpus and publish in RDF, as well as for others to harvest and use as they wish [Shotton, 2018].

Publications during the reporting period

Scholarly papers

Marilena Daquino, Ilaria Tiddi, Silvio Peroni, David Shotton (2018). Creating Open Citation Data with BCite. In Emerging Topics in Semantic Technologies – ISWC 2018 Satellite Events: 83-93. DOI: https://doi.org/10.3233/978-1-61499-894-5-83, OA at http://ceur-ws.org/Vol-2184/paper-01.pdf

Ivan Heibi, Silvio Peroni, David Shotton (2018). Enabling text search on SPARQL-endpoints through OSCAR. Submitted for publication to Data Science – Methods, Infrastructure, and Applications. OA at at https://w3id.org/people/essepuntato/papers/oscar-datascience2019/

Ivan Heibi, Silvio Peroni, David Shotton (2018). OSCAR: A Customisable Tool for Free-Text Search over SPARQL Endpoints. In Semantics, Analytics, Visualization – 3rd International Workshop, SAVE-SD 2017, and 4th International Workshop, SAVE-SD 2018, Revised Selected Papers: 121-137. DOI: https://doi.org/10.1007/978-3-030-01379-0_9, OA at https://w3id.org/people/essepuntato/papers/oscar-savesd2018.html

Silvio Peroni, David Shotton (2018). OpenCitations: enabling the FAIR use of open citation data. In Proceedings of the GARR Conference 2017 – The data way to Science – Selected Papers. DOI: https://doi.org/10.26314/GARR-Conf17-proceedings-19

Silvio Peroni, David Shotton (2018). The SPAR Ontologies. In Proceedings of the 17th International Semantic Web Conference (ISWC 2018): 119-136. DOI: https://doi.org/10.1007/978-3-030-00668-6_8

Silvio Peroni, David Shotton, Fabio Vitali (2017). One year of the OpenCitations Corpus: Releasing RDF-based scholarly citation data into the Public Domain. In Proceedings of the 16th International Semantic Web Conference (ISWC 2017): 184-192. DOI: https://doi.org/10.1007/978-3-319-68204-4_19, OA at https://w3id.org/people/essepuntato/papers/oc-iswc2017.html

David Shotton (2018). Funders should mandate open citations. Nature 553: 129. https://doi.org/10.1038/d41586-018-00104-7

Additional documents

Silvio Peroni, David Shotton (2018). Open Citation: Definition. Figshare. DOI: https://doi.org/10.6084/m9.figshare.6683855.v1

Silvio Peroni, David Shotton (2018). Open Citation Identifier: Definition. Figshare. DOI: https://doi.org/10.6084/m9.figshare.7127816.v1

Silvio Peroni, David Shotton (2018). The OpenCitations Data Model. Figshare. DOI: https://doi.org/10.6084/m9.figshare.3443876.v5

Blog posts

6 May 2017: Querying the OpenCitations Corpus

15 May 2017: The Sloan Foundation funds OpenCitations

24 Nov 2017: Milestone for I4OC – open references at Crossref exceed 50%

24 Nov 2017: Elsevier references dominate those that are not open at Crossref

28 Nov 2017: Openness of non-Elsevier references

9 Jan 2018: The new Crossref reference distribution policy

9 Jan 2018: Barriers to comprehensive reference availability

15 Jan 2018: Funders should mandate open citations

16 Jan 2018: Oxford University Press opens its references!

29 Jan 2018: OpenCitations and the Initiative for Open Citations: A Clarification

19 Feb 2018: Citations as First-Class Data Entities: Introduction

22 Feb 2018: Citations as First-Class Data Entities: Citation Descriptions

25 Feb 2018: Citations as First-Class Data Entities: The OpenCitations Data Model

4 Mar 2018: Citations as First-Class Data Entities: The OpenCitations Corpus

12 Mar 2018: Citations as First-Class Data Entities: Open Citation Identifiers

15 Mar 2018: Citations as First-Class Data Entities: The Open Citation Identifier Resolution Service

23 Mar 2018: Early adopters of the OpenCitations Data Model

17 Apr 2018: Workshop on Open Citations

12 Jul 2018: COCI, the OpenCitations Index of Crossref open DOI-to-DOI references

19 Nov 2018: New release of COCI: 450M DOI-to-DOI citation links now available

Conference presentations and other outreach

Workshop on Open Citations 2018

OpenCitations, the EXCITE Project and Europe PubMed Central ran the first Workshop on Open Citations (Twitter: @workshop_oc) at the University of Bologna in Bologna, Italy, on 3-5 September 2018. It was organised as follows:

  • Day One and Day Two: Formal presentations and discussions on the creation, availability, uses and applications of open bibliographic citations, and of bibliometric studies based upon them;
  • Day Three: A Hack Day on Open Citations to see what services can be prototyped using large volumes of open citation data.

The workshop involved 60 participants, including researchers, computer scientists, scholarly publishers, academic administrators, research funders and policy makers. The workshop was organised around the following topics:

  • Opening up citations: Initiatives, collaborations, methods and approaches for the creation of open access to bibliographic citations;
  • Policies and funding: Strategies, policies and mandates for promoting open access to citations, and transparency and reproducibility of research and research evaluation;
  • Publishers and learned societies: Approaches to, benefits of, and issues surrounding the deposit, distribution, and services for open bibliographic metadata and citations;
  • Projects: Metrics, visualizations and other projects. The uses and applications of open citations, and bibliometric analyses and metrics based upon them.

All the sessions were recorded and are available on the official YouTube channel of the University of Bologna, and are linked (together with slides) at the website of the workshop – https://workshop-oc.github.io.

A further Workshop on Open Citations is being planned for autumn 2019.

Presentations

We have made conference presentations on OpenCitations, the Initiative for Open Citations, and Open Citation Identifiers at the following international conferences and workshops:

WikiCite Conference 2017, Vienna, 23 May 2017, https://www.slideshare.net/essepuntato/opencitations (Silvio Peroni and David Shotton)

COASP 9, 9th Conference of Open Access Scholarly Publishing, Lisbon, 20 September 2017, https://www.slideshare.net/essepuntato/the-initiative-for-open-citations-and-the-opencitations-corpus (David Shotton)

SemSci 2017, 1st International Workshop on Enabling Open Semantic Science, Vienna, 21 October 2017, https://w3id.org/people/essepuntato/presentations/the-open-citations-revolution.html (Silvio Peroni)

ISWC 2017, 16th International Semantic Web Conference, Vienna, 24 October 2017, https://w3id.org/people/essepuntato/presentations/oc-iswc2017.html (Silvio Peroni)

FORCE 2017, Research Communication and e-Scholarship Conference, Berlin, 27 October 2017, http://w3id.org/people/essepuntato/presentations/oc-force2017.html (Silvio Peroni)

Linked Open Citation Database (LOC-DB) Workshop, Mannheim, 7 November 2017, https://locdb.bib.uni-mannheim.de/wordpress/wp-content/uploads/2017/10/Shotton-LOC-DB-Mannheim.pdf (David Shotton)

GARR Conference 2017, Venice, 16 November 2017, https://www.eventi.garr.it/it/documenti/conferenza-garr-2017/presentazioni-2/232-conf2017-presentazione-peroni/file (Silvio Peroni)

OpenCon 2017, Oxford, 1 December 2017, https://doi.org/10.6084/m9.figshare.5844981.v1 (David Shotton)

PIDapalooza Conference of Persistent Identifiers, Girona, 24 January 2018, https://doi.org/10.6084/m9.figshare.5844972.v2 (David Shotton)

2018 International Workshop on Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination, Lyon, 24 April 2018, https://doi.org/10.6084/m9.figshare.7531577.v1 (Ivan Heibi)

Workshop on Open Citations 2018, Bologna, 3 September 2018, https://workshop-oc.github.io/presentations/D1S3_David_Shotton.pdf (David Shotton)

Workshop on Open Citations 2018, Bologna, 4 September 2018, https://docs.google.com/presentation/d/1mybQmjhFY6kLtTE1TdONaxsl0nSjmRGOSCnFMTwfzWQ/edit?usp=sharing (Silvio Peroni)

The 5th Conference on Scholarly Publishing in the Context of Open Science (PUBMET 2018), Zadar, 20 September 2018, https://doi.org/10.6084/m9.figshare.7110653.v3 (Silvio Peroni)

The 17th International Semantic Web Conference (ISWC 2018), Monterey, 12 October 2018, https://doi.org/10.6084/m9.figshare.7151759.v1 (Silvio Peroni)

WikiCite Conference 2018, Berkeley, 27 November 2018, https://doi.org/10.6084/m9.figshare.7396667.v1 (Ivan Heibi)

A further presentation on Open Citation Identifiers will be given at the 2019 PIDapalooza Conference of Persistent Identifiers in Dublin in January 2019.

Tweets

We have tweeted about the project and related matters under the names @opencitations, @dshotton, @essepuntato, @ivanHeiB, @workshop_oc, and @i4oc_org.

Future sustainability

While presently the OpenCitations Corpus has only partial coverage, our aim is that OpenCitations should become a comprehensive source of open citation information from all disciplines of scholarly endeavour, used on a daily basis by scholars worldwide, to equal or better the commercial offerings from Clarivate Analytics (Web of Science) and Elsevier (Scopus).

We also wish to develop effective graphical user interfaces to explore the citation network, and analytical tools over our open data. Since the OCC and COCI data are all open and available for others also to build such tools, we anticipate that such developments will best be undertaken collaboratively, under some open community organization, and indeed such development is currently being undertaken in collaboration with colleagues from CWTS at the University of Leiden, famous for their development of VOSviewer.

In order to fully support open scholarship, OpenCitations need to mature from being an academic research and development project to become a recognised scholarly infrastructure service such as PubMed. We wish to avoid becoming a commercial company, and see our development better served by being ‘adopted’ by a major established scholarly institution such as national or university library or an internationally recognised centre providing scholarly bibliographic services, that has already shown a commitment to open scholarship, where the interaction between that institution and OpenCitations would be mutually beneficial. To this end, we are currently in the mid-phase of negotiations with two institutions.

Conclusions

The grantees wish to express their deep gratitude to the Alfred P. Sloan Foundation for financial support enabling them to undertake the OpenCitations Enhancement Project, without which the rapid developments reported here would not have been possible.

References

Marilena Daquino, Ilaria Tiddi, Silvio Peroni, David Shotton (2018). Creating Open Citation Data with BCite. In Emerging Topics in Semantic Technologies – ISWC 2018 Satellite Events: 83-93. DOI: https://doi.org/10.3233/978-1-61499-894-5-83, OA at http://ceur-ws.org/Vol-2184/paper-01.pdf

Ivan Heibi, Silvio Peroni, David Shotton (2018a). Enabling text search on SPARQL-endpoints through OSCAR. Submitted for publication to Data Science – Methods, Infrastructure, and Applications. OA at at https://w3id.org/people/essepuntato/papers/oscar-datascience2019/

Ivan Heibi, Silvio Peroni, David Shotton (2018b). OSCAR: A Customisable Tool for Free-Text Search over SPARQL Endpoints. In Semantics, Analytics, Visualization – 3rd International Workshop, SAVE-SD 2017, and 4th International Workshop, SAVE-SD 2018, Revised Selected Papers: 121-137. DOI: https://doi.org/10.1007/978-3-030-01379-0_9, OA at https://w3id.org/people/essepuntato/papers/oscar-savesd2018.html

Silvio Peroni, Alexander Dutton, Tanya Gray, David Shotton (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71: 253-77. DOI: https://doi.org/10.1108/JD-12-2013-0166, OA at http://speroni.web.cs.unibo.it/publications/peroni-2015-setting-bibliographic-references.pdf

Silvio Peroni, David Shotton (2018a). OpenCitations: enabling the FAIR use of open citation data. In Proceedings of the GARR Conference 2017 – The data way to Science – Selected Papers. DOI: https://doi.org/10.26314/GARR-Conf17-proceedings-19

Silvio Peroni, David Shotton (2018b). Open Citation: Definition. Figshare. DOI: https://doi.org/10.6084/m9.figshare.6683855.v1

Silvio Peroni, David Shotton (2018c). Open Citation Identifier: Definition. Figshare. DOI: https://doi.org/10.6084/m9.figshare.7127816.v1

Silvio Peroni, David Shotton (2018d). The OpenCitations Data Model. Figshare. DOI: https://doi.org/10.6084/m9.figshare.3443876.v5

Silvio Peroni, David Shotton (2018e). The SPAR Ontologies. In Proceedings of the 17th International Semantic Web Conference (ISWC 2018): 119-136. DOI: https://doi.org/10.1007/978-3-030-00668-6_8

Silvio Peroni, David Shotton, Fabio Vitali (2016a). Building Citation Networks with SPACIN. Knowledge Engineering and Knowledge Management – EKAW 2016 Satellite Events, EKM and Drift-an-LOD, Revised Selected Papers: 162-166. DOI: https://doi.org/10.1007/978-3-319-58694-6_23, OA at https://w3id.org/oc/paper/spacin-demo-ekaw2016.html

Silvio Peroni, David Shotton, Fabio Vitali (2016b). Freedom for bibliographic references: OpenCitations arise. In Proceedings of 2016 International Workshop on Linked Data for Information Extraction (LD4IE 2016): 32-43. http://ceur-ws.org/Vol-1699/paper-05.pdf

Silvio Peroni, David Shotton, Fabio Vitali (2017). One year of the OpenCitations Corpus: Releasing RDF-based scholarly citation data into the Public Domain. In Proceedings of the 16th International Semantic Web Conference (ISWC 2017): 184-192. DOI: https://doi.org/10.1007/978-3-319-68204-4_19, OA at https://w3id.org/people/essepuntato/papers/oc-iswc2017.html

David Shotton (2013). Open citations. Nature, 502: 295-297. https://doi.org/10.1038/502295a

David Shotton (2018). Funders should mandate open citations. Nature 553: 129. https://doi.org/10.1038/d41586-018-00104-7

Posted in Citations as First-Class Data Entities, Data publication, Open Citation Identifiers, Open Citations, Semantic Publishing, Web interface design | Tagged , , , , , , , , | 3 Comments

The Wellcome Trust funds OpenCitations

The Open Biomedical Citations in Context Corpus funded by the Wellcome Trust

The Wellcome Trust, which funds research in big health challenges and campaigns for better science, has agreed to fund The Open Biomedical Citations in Context Corpus, a new project to enhance the OpenCitations Corpus, as part of the Open Research Fund programme.

As readers of this blog will know, the OpenCitations Corpus is an open scholarly citation database that freely and legally makes available accurate citation data (academic references) to assist scholars with their academic studies, and to serve knowledge to the wider public.

Objectives

The Open Biomedical Citations in Context Corpus, funded by the Wellcome Trust for 12 months from March 2019, will make the OpenCitations Corpus (OCC) more useful to the academic community by significantly expanding the kinds of citation data held within the Corpus, so as to provide data for each individual in-text reference and its semantic context, making it possible to distinguish references that are cited only once from those that are cited multiple times, to see which references are cited together (e.g. in the same sentence), to determine in which section of the article references are cited (e.g. Introduction, Methods), and, potentially, to retrieve the function of the citation.

At OpenCitations, we will achieve these objectives in the following ways:

  • by extending the OpenCitations Data Model so as to describe how the in-text reference data should be modeled in RDF for inclusion in the OpenCitations Corpus;
  • by develping scripts for extracting in-text references from articles within the Open Access Subset of biomedical literature hosted by Europe PubMed Central;
  • by extending the existing ingestion workflow so as to add the new in-text reference data into the Corpus;
  • by developing appropriate user interfaces for querying and browsing these new data.

Personnel

We are looking for a post-doctoral computer scientist / research engineer specifically to achieves the aforementioned objectives. This post-doctoral appointment will start the 1st of March 2019. We seek a highly intelligent, skilled and motivated individual who is expert in Python, Semantic Web technologies, Linked Data and Web technologies. Additional expertise in Web Interface Design and Information Visualization would be highly beneficial, plus a strong and demonstrable commitment to open science and team-working abilities.

The minimal formal requirement for this position is a Masters degree in computer science, computer science and engineering, telecommunications engineering, or equivalent title, but it is expected that the successful applicant will have had research experience leading to a doctoral degree. The position has a net salary (exempt from income tax, after deduction of social security contributions) in excess of 23K euros per year.

The formal advertisement for this post – which will be held at the Digital Humanities Advanced Research Centre (DHARC), Department of Computer Classical Philology and Italian Studies, University of Bologna, Italy, under the supervision of Dr Silvio Peroni – is published online, and it is accompanied by the activity plan (in Italian and English). The application must be presented exclusively online by logging in the website https://concorsi.unibo.it (default in Italian, but there is a link to switch the language in English). People who do not have a @unibo.it email account must register to the platform. The deadline for application is the 25th January 2019 at 15:00 Central Europe Time. Please feel free to contact Silvio Peroni (silvio dot peroni at unibo dot it) for further information.

People involved

The people formally involved in the projects are:

  • Vincent Larivière – École de Bibliothéconomie et des Sciences de l’Information, Université de Montréal, Canada;
  • Silvio Peroni (Principal Investigator) – Digital Humanities Advanced Research Centre (DHARC), Department of Computer Classical Philology and Italian Studies, University of Bologna, Italy, and Director of OpenCitations;
  • David Shotton – Oxford e-Research Centre, University of Oxford, Oxford, UK, and Director of OpenCitations;
  • Ludo Waltman – Centre for Science and Technology Studies (CWTS), Leiden University, Netherlands.

In addition, the project is supported by Europe PubMed Central (EMBL-EBI, Hinxton, UK).

Posted in Open Citations, Uncategorized | Tagged , | 4 Comments

New release of COCI: 445M DOI-to-DOI citation links now available

As introduced in a previous blog post, COCI is the OpenCitations Index of Crossref open DOI-to-DOI references, all released as CC0 material. It is our first OpenCitations Index of open citations, in which we have applied the concept of citations as first-class data entities to index the contents of one of the major databases ofopen scholarly citation information, namely Crossref, and to render and make available this information in machine-readable RDF.

We are now proud to announce a new release of COCI, the second, which now contains almost 445 million DOI-to-DOI citation links coming from both ‘the ‘Open’ and the ‘Limited’ sets of Crossref reference data.  This represents an increase of 42% in the number of indexed citations, compared with the initial release of COCI on 4th June 2018, which indexed 316,243,802 citations involving 45,145,889 bibliographic resources. In addition, the data model for COCI has now been extended so as to state directly the presence of journal self-citations and author self-citations.

Extended data model

The previous data model used for storing the citation data in COCI – which is itself a subset of the OpenCitations Data Model – has been extended so as to keep track of two particular types of self-citation, as shown in the following figure.

The new data model used in COCI for describing its citation data, which includes classes for describing two kinds of self-citations, i.e. journal self-citations and author self-citations.

Generally speaking, a self-citation is citation in which the citing and the cited entities have something significant in common with one another, over and beyond their subject matter. The two kinds of self-citations we are now tracking are:

  • journal self-citation (class cito:JournalSelfCitation), i.e. a citation in which the citing and the cited entities are published in the same journal. This information has been obtained by comparing the ISSNs of the journals where two journal articles related by a citation have been published, as provided by Crossref. If they share the same ISSN, then the citation is described as journal self-citation;
  • author self-citation (class cito:AuthorSelfCitation), i.e. a citation in which the citing and the cited entities have at least one author in common. This information has been obtained by comparing the ORCIDs associated to the authors of a citing bibliographic entity with the ORCIDs of the authors of the cited entity. In this case, if any ORCID is shared, then the citation is described as author self-citation.  This categorization excludes authors bearing the same name where the ORCIDs are not known, since, while these instances may be author self-citations, they may alternatively merely represent name coincidences of distinct individuals.

It is worth mentioning that, while the ISSN information are usually present in the data returned by Crossref, the presence of ORCID id data associated with the authors of the various paper represented in Crossref is presently very limited, so that the number of recorded author self-citations in COCI is likely to be a considerable underestimate.

In this new release, COCI contains 445,826,118 citations, of which 30,114,696 are recorded as journal self-citations and 251,699 are recorded as author self-citations.

Extended REST API

The REST API for querying COCI has been extended so as to return information about the aforementioned self-citations. In particular, the response to the operations “references” and “citations” now has two more fields, i.e. “journal_sc” and “author_sc”, that are set to “yes” if the citation returned is a journal self-citation or an author self-citation respectively, or “no” otherwise.

Using the capabilities of the REST API, it is also possible to keep in or exclude from the result set those citations that are (or are not) one of the aforementioned types of self-citation. For instance, the following call

https://w3id.org/oc/index/coci/api/v1/citations/10.1002/pol.1987.140251103?filter=journal_sc:yes

returns all the citations having the article with DOI “10.1002/pol.1987.140251103” that are journal self-citations.

Conclusions

In this blog post we have introduced the second release of COCI, the OpenCitations Index of Crossref open DOI-to-DOI references, a citation index which now contains almost 450 million open citations created from the ‘Open’ and ‘Limited’ references included within Crossref.

As a reminder, all the data in COCI:

We plan soon to extend the OpenCitations Indexes by adding indexes of citations coming from other source datasets, including Wikidata and DataCite.

Posted in Citations as First-Class Data Entities, Data publication, Open Citation Identifiers, Open Citations | Tagged , , | 1 Comment