OpenCitations updates: interviews, outreach, research positions, citations

Several important recent events have involved OpenCitations directly as a participant. Here we introduce some of the most significant ones:

  • Our interviews during a Fireside Chat with John Chodacki during the Open Publishing Fest;
  • The SCOSS poster about OpenCitations and the other two selected infrastructures during LIBER 2020, which won the LIBER 2020 Peoples Choice Poster Award;
  • The availability of two new short-term research positions with OpenCitations for our Wellcome Trust funded project (application closing deadline: 23 July 2020); and
  • A new release of COCI, bringing the total number of open citations available in this dataset to more than 721 million.

Open Publishing Fest

The Open Publishing Fest was a decentralized on-line public event held from 18 May to 29 May

to bring together communities supporting open source software, open content, and open publishing models.

(from the Open Publishing Fest website, last visited 4 July 2020)

Within this event, John Chodaki and Cameron Neylon organised a series of sessions named “Fireside Chats” in which they invited people from the open publishing community to discuss their careers and projects. David Shotton and Silvio Peroni were the guests in one of these chats with John Chodaki, in which we talked about the origins of OpenCitations and the plans for OpenCitations’ future. The video of our chat (see link below) is available on YouTube.

The recording of the fireside chat that David Shotton and Silvio Peroni had on May 28th 2020 with John Chodaki in the context of the Open Publishing Fest.

SCOSS at LIBER 2020: promoting open infrastructures

SCOSS, the Global Sustainability Coalition for Open Science Services, which selected OpenCitations last December as worthy of community crowd-funding support as part of its second funding cycle, participated in LIBER 2020 with a poster which showcases the infrastructures that they recommend that the community financially supports.

Tweet by SCOSS on the poster presented at Liber 2020.

This poster won the People’s Choice Award and confirms the crucial activity that SCOSS is doing in making the academic community aware of the importance of providing financial support to open infrastructures like OpenCitations for the benefit of the whole of society.

A tweet by SCOSS on the document they wrote on the importance of supporting open infrastructures.

Furthermore, Giannis Tsakonas, Director in the Library & Information Center at the University of Patras (Greece), and a member of the LIBER Executive Board as head of its Innovative Scholarly Communication Steering Committee, shared a wonderful thread on Twitter about OpenCitations and the other open infrastructures (DOAB, OAPEN, and PKP) that have been selected by SCOSS in its second funding cycle.

A thread by Giannis Tsakonas about us and the other open infrastructures selected by SCOSS for their second funding cycle.

Short-term research positions open at OpenCitations in Bologna for our Wellcome Trust funded project

Wellcome Trust announced they have extended all their grants that were due to end in 2020 or 2021, including our Open Research Fund funded project entitled “Open Biomedical Citations in Context Corpus”. This project aims at providing data for each individual in-text reference pointer (aka in-text citation) and its semantic context, making it possible to distinguish references that are cited only once in the text of a paper from those that are cited multiple times, to see which references are cited together (e.g. in the same sentence), to determine in which section of the article references are cited (e.g. Introduction, Methods), and, potentially, to retrieve the function of each citation.

Some preliminary outcomes of the project have already been described in a recent blog post, and a preprint describing some of the activities of the project has also been also made available on arXiv. That paper focuses on the extensions we have made to the OpenCitations Data Model, used for the storage of data in all the OpenCitations datasets, to enable the additional metadata types resulting from the Citations in Context project to be recorded.

In the context of the funding extension to this Wellcome Trust project, the Department of Classical Philology and Italian Studies (FICLIT) at the University of Bologna has just opened two new positions for short-term (5 months) research fellowships from August to December 2020 inclusive, for which the application closing deadline is 23 July 2020. The main goals of these the short-term research fellowship are (a) to develop the software for handling the data stored in the Open Biomedical Citations in Context Corpus, and (b) to develop indexing mechanisms to analyse a large number of documents simultaneously within our local computing environment, without having to use external services.

The net salary for each research fellowship is 1,600 EUR per month, tax free. The minimal requirement to apply for one of these positions is to have a Bachelor degree, although higher qualifications in Computer Science would be beneficial. Since these are University of Bologna positions, the application forms are in Italian. However, the description of the activity plan of the research fellowships is available in both Italian and English.  We would be happy to provide further information, and help in completing the application forms if necessary, so please do not hesitate to email us.

New release of COCI with an additional 18 million citations

Every two months we are able to publish additions to COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. This latest release (dated 4 July 2020) extended COCI with more than 18 million additional citations, so that COCI now contains more than 721 million DOI-to-DOI citation links between more than 58.8 million bibliographic entities.

These new citations were harvested from the most recent Crossref data dump, downloaded on 8 June 2020, which includes the references of articles deposited in Crossref between 4 April 2020 and 4 June 2020. As before, we will use this new release of COCI to update the Coronavirus Open Citations Dataset, the third release of which will include details about relevant additional references and publications.

We remind you that COCI has been fully described in our open-access article

Ivan Heibi, Silvio Peroni & David Shotton (2019). Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Scientometrics, 121 (2): 1213-1228. DOI: https://doi.org/10.1007/s11192-019-03217-6

and that all the bibliographic and citation data in COCI:

Posted in Citations as First-Class Data Entities, Data publication, Open Citations, Open scholarship, Open Science | Tagged , , , , , , , , , | 2 Comments

COCI has surpassed 700M citations

We are excited to share that COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, was on 12 May 2020 extended with more than 47 million additional citations, and has reached a total number of more than 702 million DOI-to-DOI citation links between more than 58 million bibliographic entities.

The citations added in this the fifth release of COCI came from the most recent Crossref dump downloaded on 22 April 2020, which includes the references of the articles deposited in Crossref between 4 October 2019 and 4 April 2020. Such updates to COCI will now occur regularly at bimonthly intervals.

As a consequence, COCI now contains 702,772,530 citations, and also includes publications about the COVID-19 pandemic. We will use this new release of COCI to update the Coronavirus Open Citations Dataset, the second release of which will include details about these additional references and publications.

COCI, which is fully described in our open-access article, was one of the subjects of a multidisciplinary comparison between the major citation indexes recently published on arXiv. In addition, it has been recently mentioned on the Scholix web site as one of the implementors of the Scholix citation data format.

Finally, we wish to remind you that all the bibliographic and citation data in COCI:

Posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, open access, Open Citations, Open Science, Semantic Publishing | Tagged , , , , | 1 Comment

Additional 31 million citations in COCI

We are proud to announce that COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, has just been extended with more than 31 million additional citations.

As introduced in an earlier blog post and an open-access article recently published on Scientometrics, COCI is our first OpenCitations Index of open citations. In COCI, we have applied the concept of citations as first-class data entities, each identified using a unique persistent Open Citation Identifier (OCI). COCI indexes the contents of one of the major databases of open scholarly citation information, namely Crossref, and renders and makes available this information in machine-readable RDF and in other formats.

The fourth release of COCI contains more than 655 million DOI-to-DOI citation links between more than 55 million bibliographic entities. The additional 31 million citations added in the new release come from the reprocessing of previous dumps of Crossref  data. In particular, we retrieved all the citations that involve references in citing articles that were in the Crossref ‘Limited’ set when we downloaded it in October 2018. Such citing articles currently appear in the Crossref ‘Closed’ dataset due to more recent restrictive policy decisions taken by their publishers.

Finally, we wish to remind you that all the bibliographic and citation data in COCI:

Posted in Citations as First-Class Data Entities, Open Citation Identifiers, Open Citations, Open Science | Tagged , , , , | Leave a comment

The French National Fund for Open Science supports OpenCitations

The French National Fund for Open Science (FNSO) has decided to support OpenCitations, PKP, and DOAB as part of SCOSS, the Global Sustainability Coalition for Open Science Services.

FNSO has identified OpenCitations as an infrastructure disseminating bibliographic and citation metadata in open access with a level of quality and coverage that provides a workable, free and open alternative to the academic community’s current dependency on proprietary tools, therefore freeing up possibilities for citation analysis, promoting the evolution of bibliometric indicators and broadening knowledge of science.

The FNSO is contributing € 250,000, which is 16.3% of the amount that was requested under SCOSS and is committing to a political and technical partnership with OpenCitations.

OpenCitations is deeply honoured and delighted that the French Open Science Committee has chosen to award such a substantial portion of its open science budget to support our work. These funds will be spent (a) on strengthening our computational infrastructure, (b) on employing software engineers to develop new data sources and services, and data curators to ensure the highest possible quality of our data, and (c) on community engagement through workshops and publications.

Posted in Open Citations, Open Science | Tagged , , | 1 Comment

OpenCitations described

OpenCitations is an infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data. We at OpenCitations are proud to announce the publication, in the first issue of Quantitative Science Studies, of a canonical paper in which we introduce and describe OpenCitations and outline its achievements and goals [1].

Here, I outline the contents of our paper, and provide definitive links on the topics described. Many of these topics have been the subjects of earlier blog posts.

This paper appears in the first Special Issue of QSS, dedicated to the description of the bibliometric data sources that lie at the heart of scientometric research, which aims to characterize the most important data sources currently available and to show how they differ in various dimensions, for instance in the data they provide, their level of openness, and their support for making research reproducible. The first three papers in this special issue cover the most important commercial bibliographic data sources: Web of Science (Clarivate Analytics), Scopus (Elsevier), and Dimensions (Digital Science), while the remaining three articles describe open data sources: Microsoft Academic, Crossref and OpenCitations.

In the introduction to our own paper, we describe the origins of OpenCitations, discuss the growth and benefits of open science, and introduce the Semantic Web techniques used at OpenCitations for recording and publishing our data. We then go on to describe OpenCitations’ services and data, namely Open Citation Identifiers, the OpenCitations Data Model, the SPAR (Semantic Publishing and Referencing) Ontologies, the OpenCitations Corpus, and the OpenCitations Indexes of citation data, of which the first and largest is COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, that currently holds information on over 624 million citations. We conclude our survey of OpenCitations’ services and data by outlining the generic open source software developed at OpenCitations, including OSCAR, the OpenCitations RDF Search Application for searching over RDF datasets, LUCINDA, OSCAR’s associated OpenCitations RDF Resource Browser, and RAMOSE, OpenCitations’ application for creating REST APIs over SPARQL endpoints, thus opening Semantic Web datasets to those not familiar with SPARQL, the RDF query language.

In the second half of the paper, we describe OpenCitations as an organization in terms of its compliance with the principles for the sustainability of open infrastructures proposed by Bilder, Lin and Neylon (2015) [2], and report the selection of OpenCitations by the Global Sustainability Coalition for Open Science Services (SCOSS) as an open infrastructure organization worthy of crowd-funding support by the stakeholder community. We then provide usage statistics for our datasets and web site, and describe the adoption of OpenCitations data and services by the community, before concluding with a forward look at our proposed developments of OpenCitations activities.

References

[1] Silvio Peroni and David Shotton (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies 1 (1): 428-444. https://doi.org/10.1162/qss_a_00023

[2] Geoffrey Bilder, Jennifer Lin and Cameron Neylon (2015). Principles for open scholarly infrastructures. Figshare. https://doi.org/10.6084/m9.figshare.1314859

Posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, open access, Open Citation Identifiers, Open Citations, Open scholarship, Open Science, Uncategorized | Tagged , , , , , , , , , , , | Leave a comment

The first issue of Quantitative Science Studies

The memorable date 20/02/2020 saw the publication by MIT Press of the first issue of Volume One of a new journal, Quantitative Science Studies (QSS), the official open access journal of the International Society for Scientometrics and Informetrics (ISSI). QSS’s Editor in Chief is Ludo Waltman (CWTS, University of Leiden, Netherlands), Vincent Larivière (Université de Montréal, Montreal, Quebec, Canada) and Staša Milojević (Indiana University Bloomington, Bloomington, Indiana, USA) are its Associate Editors, and it has a large and distinguished editorial board.

What makes the launch of this new journal remarkable is the story of how it came into being. In 2019, the entire editorial team of the Journal of Informetrics (JOI), a leading journal in this field published by Elsevier, resigned en masse and decided to start an alternative journal, QSS, both because of Elsevier’s position on open citations, and because, in their opinion, the financial model used by Elsevier violates the scientific ethos.

Reproducibility in the field of scientometrics requires scientific metadata that are both of high-quality and open, particularly those relating to bibliographic citations. The JOI editorial board was deeply concerned by the refusal of Elsevier to join almost all other large scholarly publishers in supporting the Initiative for Open Citations (I4OC). As we have previously reported on this blog, Elsevier is the largest contributor of bibliographic references to Crossref, but insists that these data should be kept closed.

Elsevier’s position, driven by commercial interests (since it sells access to citation data through Scopus), flies in the face of the scientific community’s clear move towards open science, with hundreds of scientometricians having signed an ISSI open letter urging scholarly publishers to support I4OC.

Science is a self-governing system, and the editorial team held the view that the ultimate responsibility for a scholarly journal should fall with the scientific community, who serve as the gatekeepers, producers, and consumers of scientific content.

The editorial team also believed Elsevier’s subscription fees to be excessive, and its article processing charges (APCs) for open access publishing to be unfairly high, thus limiting both those who can afford to read Elsevier journals and those who can afford to publish in them, so that publishing with Elsevier inevitably places major limits on scholarship, harming both science and society. It was for all these reasons that they forsook JOI and started QSS.

We at OpenCitations congratulate the editorial team for their courage in deciding to make this journal flip, and wish them, together with the ISSI and MIT Press, every success for this important new journal. We also commend the Technische Informationsbibliothek (TIB) – Leibniz Information Centre for Science and Technology and the Communication, Information, Media Centre (KIM) of the University of Konstanz, who, in collaboration with the Fair Open Access Alliance (FOAA), have generously agreed to cover APCs for the first three years of the QSS journal.

Posted in Bibliographic references, open access, Open Citations, Open scholarship | Tagged , , , , , , , , , | Leave a comment

Introducing InTRePIDs – In-Text Reference Pointer Identifiers

Rationale

Readers of this blog will be familiar with Open Citation Identifiers (OCIs), described in an earlier post and formally defined in [1]. OCIs enable bibliographic citations, treated as first class information entities, to be uniquely identified and referenced, and are used to identify the >624 million individual citations indexed in the latest release of COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, as described in a recent post.

However, COCI and similar citation indexes do not provide any information about where within the citing paper a citation is generated, the textual contexts of the in-text reference pointers, or the reasons for including different in-text reference pointers denoting the same reference at different points within the text.

As explained in the preceding post describing the Open Biomedical Citations in Context Corpus funded by the Wellcome Trust and under development by OpenCitations, deep citation analysis requires a more nuanced approach to citations, which acknowledges that each in-text reference pointer that denotes a bibliographic reference in the reference list of a citing publication instantiates its own citation, as shown in Figure 1.

Figure 1. Citations between a citing paper and a cited paper instantiated both by the inclusion of a bibliographic reference within the reference list of the citing paper and by the inclusion within the text of the citing paper of one or more in-text reference pointers denoting that reference.

The pointer citations clearly involve the same cited publication as does the reference citation itself, but each has its own unique characteristics: the location and textual context of its in-text reference pointer within the text of the citing publication, and its particular rhetorical function which is determined by that context.

If the reference citation is open (as defined in [2]) and identified by an OCI, each in-text reference pointer related to that citation can be identified uniquely using an In-Text Reference Pointer Identifier (InTRePID).

InTRePIDs facilitate in-depth scholarship on in-text reference pointer locations and citation functions, and fine-grained analysis of the relationships between publications, by making it possible

  • to identify each in-text reference pointer with a unique PID,
  • to distinguish references that are cited only once from those that are cited multiple times,
  • to see which references are cited together (e.g. in the same sentence or within an in-text reference pointer list),
  • to determine from which section(s) of the article references are cited (e.g. Introduction, Methods, Discussion), and, potentially,
  • to determine the rhetorical function of the citations from analysis of their textual contexts, by the application of natural language processing, machine learning and artificial intelligence techniques to conduct sentiment analysis on the citation contexts.

Definition of an InTRePID

An InTRePID is composed of two parts separated by an oblique stroke

intrepid:<oci-numerals>/<ordinal><total>

where

  • <oci-numerals> is the numerical part of the OCI uniquely identifying the particular open citation to which the in-text reference pointer and its denoted bibliographic reference relate. Thus an InTRePID can be assigned for any in-text reference pointer that relates to an open citation for which a valid OCI has been assigned;
  • <ordinal> identifies the nth occurrence of an in-text reference pointer within the text of the citing paper relating to that citation; and
  • <total> defines the total number of in-text reference pointers denoting that bibliographic reference within the citing paper.

For example, intrepid:070433-070475/46 is a valid InTRePID for an in-text reference pointer defined within the OpenCitations Citations in Context Corpus.

A formal definition document for the InTRePID is given in [3].

Exemplar in-text reference pointers

Consider the following citing paper:

Zou, J. et al. (2020). Phenotypic and genotypic correlates of penicillin susceptibility in nontoxigenic Corynebacterium diphtheriae, British Columbia, Canada, 2015–2018. Emerging Infectious Diseases, 26: 97-103. https://doi.org/10.3201/eid2601.191241

This paper contains six in-text reference pointers denoting Reference 13 in the reference list:

13. Lowe, C. et al. (2011). Cutaneous diphtheria in the urban poor population of Vancouver, British Columbia, Canada: a 10-year review. J. Clinical Microbiology 49: 2664-2666. https://doi.org/10.1128/JCM.00362-11

The InTRePIDs for these pointers are recorded within the OpenCitations Biomedical Citations in Context Corpus, together with the corpus identifiers and DOIs of the citing and cited papers, as shown in the excerpt presented in Figure 2.

Figure 2. An excerpt from the OpenCitations Biomedical Citations in Context Corpus, showing highlighted the InTRePIDs for the six in-text reference pointers within Zou, J. et al. (2020) denoting Reference 13, the reference to Lowe, C. et al. (2011), together with the internal corpus identifiers for each in-text reference pointer, and the corpus identifiers and DOIs for the citing and cited papers.

Of these six in-text reference pointers, having InTRePIDs intrepid:070433-070475/1-6 to intrepid:070433-070475/6-6, the first and the fourth of these, together with their document locations, their embedding sentences, their in-text reference pointer lists, and their InTRePIDs, chosen as examples, are as follows:

Introduction. “Nontoxigenic strains have been shown to have epidemic potential, causing infections in persons afflicted by homelessness, alcohol abuse, and injection drug use (9,13–15).” (intrepid:070433-070475/1-6)

Discussion. “We also noted ST5 and ST32 in our review from downtown Vancouver during 1998–2007 (13).” (intrepid:070433-070475/4-6)

The first of these discusses those people most susceptible to diphtheria infection, while the other discusses which multilocus sequence types (STs) of C. diphtheriae were found, thus relating to the organism causing the infection rather than to the infected individuals. The rhetorical function of these two in-text reference pointers is quite distinct.

To permit this information to be recorded within the OpenCitations Citations in Context Corpus, extensions were required to the OpenCitations Data Model, a new extended version of which was recently published [4], as described in a related blog post.

The OpenCitations InTRePID Resolution Service

To support the use of InTRePIDs to identify in-text reference pointers, OpenCitations has recently developed an InTRePID Resolution Service (currently in ‘beta’ in its development cycle), which is running at http://opencitations.net/intrepid. A screenshot of this service is shown in Figure 3.

Figure 3. A screenshot of the user interface of the InTRePID Resolution Service.

In addition to using the Web user interface shown in Figure 3, InTRePIDs can be entered into this resolution service in the form of resolvable URIs, e.g.

http://opencitations.net/intrepid/070433-070475/4-6

As shown in Figure 4, the OpenCitations InTRePID Resolution service returns metadata concerning the in-text reference pointer identified by the InTRePID, and the bibliographic reference that it denotes, from which further information about the citation and the citing and cited publications may be obtained by following the links provided.

Figure 4. A screenshot of the Web page displaying metadata returned by the InTRePID Resolution Service.

Note that as well as rendering this information in HTML on a web page, the resolution service can also provide it in a variety of machine-readable formats.

Conclusion

InTRePIDs, which enable the identification of individual in-text reference pointers, and the InTRePID Resolution Service, are new services from OpenCitations that will facilitate scholarship on the textual contexts and rhetorical functions of such in-text reference pointers, and of the citations that they instantiate.

InTRePIDs were first announced on 30th January 2020 at PIDapalooza 2020 in Lisbon, the Open Festival of Persistent Identifiers.

References

[1] Silvio Peroni and David Shotton (2019): Open Citation Identifier: Definition. Figshare. https://doi.org/10.6084/m9.figshare.7127816.v2

[2] Silvio Peroni and David Shotton (2018). Open Citation: Definition. Figshare. https://doi.org/10.6084/m9.figshare.6683855

[3] David Shotton, Marilena Daquino and Silvio Peroni (2020). In-Text Reference Pointer Identifier: Definition. Figshare. https://doi.org/10.6084/m9.figshare.11674032

[4] Marilena Daquino, Silvio Peroni and David Shotton (2019). The OpenCitations Data Model. Version 2.0. Figshare. https://doi.org/10.6084/m9.figshare.3443876

Posted in Bibliographic references, Citations as First-Class Data Entities, Open Citation Identifiers, Open Citations | Tagged , , , , , , , , , , , , , , , | 1 Comment

The Open Biomedical Citations in Context Corpus: Progress Report

The creation of the Open Biomedical Citations in Context Corpus (CCC) is the goal of a one-year project funded by the Wellcome Trust. The aim is to create a new open corpus of bibliographic and citation data that contain detailed information about individual in-text reference pointers in biomedical journal articles. The project is led by Professor Silvio Peroni of the Research Centre for Open Scholarly Metadata (University of Bologna), is being undertaken by Dr Marilena Daquino (University of Bologna), and actively involves the Oxford e-Research Centre (University of Oxford), the École de Bibliothéconomie et des Sciences de l’Information (Université de Montréal), and the Centre for Science and Technology Studies (CWTS), (Leiden University).

An in-text reference pointer is a textual device (e.g. “[1]”, or “(Peroni and Shotton 2012)”) that appears in the main text of a citing work and denotes a bibliographic reference listed in the Bibliography section of the citing work. While a single in-text reference pointer uniquely denotes a single bibliographic reference, it can occur together with one or more other pointers, forming an in-text reference pointer list that denotes several references (e.g. “[5-13]”, or “(Peroni and Shotton 2012; Peroni and Shotton 2019)”). In-text reference pointers may appear in several places within the same citing publication (e.g. Introduction, Methods, Discussion), may occur within different document components (e.g. body text, figure captions, tables), and may address the cited publication for different purposes (e.g. as the source of an experimental protocol, as a data source, or for general background information).

Unfortunately, current citation indexes contain no information about in-text reference pointers, such as the number of times a particular work is referenced in the citing work, the text of the sentences in which they occur, or the rhetorical purpose of such citations.

Having data at the level of individual in-text reference pointers offers many new opportunities, enabling one: (1) to distinguish between works that are referenced just once in a citing publication and those that are referenced multiple times, and thereby (potentially) to distinguish when a citation is fundamental for the understanding or the development of the citing work, or merely incidental; (2) to see which in-text reference pointers occur together (e.g. in the same sentence or the same paragraph), thus, potentially, to infer similarities between the co-cited publications; and (3) to determine in which specific sections of the publication these in-text references occur (e.g. Introduction, Methods, Results), and thus, potentially, by means of textual analysis of the citation contexts, to retrieve the rhetorical functions of the citations – i.e. the reason why an author cites another work. 

The goal of the CCC Project is to provide stakeholders with an exemplar Linked Open Data corpus, created from the open access biomedical research literature, that is tailored for such deep citation analyses. The corpus will be a new member of the collection of OpenCitations datasets, and will be accompanied by services for accessing and querying data.

In the CCC Project, we have achieved or are currently dealing with the following developments:

  • Extending the OpenCitations Data Model (OCDM). The OpenCitations Data Model has been extended and enriched with new terms and relations to represent bibliographic entities related to in-text reference pointers, such as the in-text reference pointers themselves, in-text reference pointer lists, discourse elements (e.g. sections, paragraphs, sentences), and annotations on citations, bibliographic references and in-text reference pointers. In addition, the provenance layer of the data model has been revised to provide meaningful provenance information in a more compact way. A revised version of the OCDM including these terms was published on November 8, 2019, and it is available on Figshare [1].
  • Extending the OpenCitation harvesting and data re-engineering pipeline. The CCC Project leverages existing OpenCitations technologies for building this new corpus, using as input articles from the Open Access Subset of biomedical literature hosted by Europe PubMed Central (EPMC) and encoded in XML. The OpenCitations pipelines for knowledge extraction (i.e. the software called BEE) and for data re-engineering (i.e. the software called SPACIN) have been enhanced so as to harvest relevant information from the full-text of the XML sources provided by EPMC, rather than just the reference lists,  and to transform these data into RDF according to the revised OCDM. The source code of the new pipeline is available on GitHub.
  • Creating InTRePID, a new persistent identifier for in-text reference pointers. Different in-text reference pointers denoting the same bibliographic reference have distinct logical, rhetorical and textual contexts wherein they occur. To permit them to be identified individually and handled properly, we have recently developed a new persistent identifier, the In-Text Reference Pointer Identifier (InTRePID), for identifying individual in-text reference pointers relating to an open bibliographic citation. The InTRePID is based on the Open Citation Identifier (OCI), currently being used to identify the >624 million citations present in the new release of COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, described in the previous blog post. The formal definition of an InTRePID is available on Figshare [2]. In addition, an InTRePID Resolution Service has been developed (currently in beta-testing) to facilitate the retrieval of the metadata relating to in-text reference pointers. At the moment, a subset of the CCC corpus is available online for testing the InTRePID Resolution Service.
  • Development of services for accessing and querying the Citations in Context Corpus. Along with the development of the CCC itself, we are also developing services for querying data within the CCC. In particular, we are currently working to extend the RAMOSE software to provide an API for accessing the CCC triplestore. This CCC API will permit users to access the CCC corpus and retrieve detailed information about in-text reference pointers and their related annotations in a variety of human- and machine-readable formats. The source code of the API Manager is available on GitHub. The configuration file for querying the CCC corpus is still in the process of development.

Moreover, we are currently working to evaluate the content data quality of the CCC corpus and to develop reconciliation activities with information stored in Crossref. Specifically, by means of new validation methods, we are testing whether the extracted in-text reference pointers are complete (i.e. determining that all the in-text reference pointers for a particular bibliographic reference have been correctly extracted from the text), and that in-text reference pointer lists (e.g. “[5-13]”) have been correctly parsed to extract all the implicit pointers (in this case “[6]”, “[7]”, “[8]”, “[9]”, “[10]”, “[11]” and “[12]”), and to associate them correctly with the appropriate bibliographic references that they denote. This activity is fundamental, in order to address the diverse citation styles adopted by different journals and to overcome possible incoherencies in the publishers’ XML markup of the articles. Secondly, whenever a DOI is not specified for the citing or cited publications in the full-text of the citing publication, a text search using the Crossref API is performed in order to match possible candidates and supply the missing DOI. This reconciliation process itself can be error-prone since recommended matches are obtained by means of a non-transparent scoring mechanism. Therefore we are currently testing the application of a scoring threshold that will eliminate false positives and provide us only with correct results.

The deployment of the enhanced OpenCitations pipeline for populating the CCC corpus automatically is planned to start in the next weeks. For more details and to provide suggestions, please contact us!

References

[1] Marilena Daquino, Silvio Peroni and David Shotton (2019). The OpenCitations Data Model. Version 2.0. Figshare. DOI:  ​https://doi.org/10.6084/m9.figshare.3443876 

[2] David Shotton, Marilena Daquino and Silvio Peroni (2020). In-Text Reference Pointer Identifier: Definition. Figshare.  DOI: https://doi.org/10.6084/m9.figshare.11674032

Posted in Ontologies, Open Citation Identifiers | Tagged , , | 1 Comment

More than 624 million citations now available on COCI

COCI is the OpenCitations Index of Crossref open DOI-to-DOI citations, all released as CC0 material, and is described in the article

Heibi I, Peroni S, Shotton D (2019). Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Scientometrics 121(2): 1213-1228. https://doi.org/10.1007/s11192-019-03217-6

COCI is our first OpenCitations Index of open citations, in which we have applied the concept of citations as first-class data entities, each identified using a unique persistent Open Citation Identifier (OCI), to index the contents of one of the major databases of open scholarly citation information, namely Crossref, and to render and make available this information in machine-readable RDF.

We are now proud to announce the third release of COCI, which contains more than 624 million DOI-to-DOI citation links coming from both ‘the ‘Open’ and the ‘Limited’ sets of Crossref reference data. This represents an increase of 40% in the number of indexed citations, compared with the second release of COCI on 12th November 2018, which indexed more than 445 million citations. The data model used for this third release of COCI is the updated revision of the OpenCitation Data Model, published on 8 November 2019 and available at https://doi.org/10.6084/m9.figshare.3443876.

This new release of COCI has been created using new software developed specifically for this purpose, which is available on our GitHub repository under an open ISC license. This software automates the process of creating an OpenCitations Index compliant with the OpenCitations Data Model and creates the citation data and related provenance information in three different formats: CSV, N-Triples (RDF), and Scholix. The support for Scholix – a high-level interoperability framework supported by Crossref, DataCite, Europe PubMed Central, OpenAIRE and others  – has recently been added to provide an additional format for the exchange of information about the links between scholarly literature and datasets.

A great advantage of the new software is that it will now enable us to extend COCI (and any other OpenCitations Index) by means of incremental additions, rather than having to re-create the entire index at each update. This should enable us to release index updates more frequently than hitherto, thus keeping the index more closely in synchrony with the latest reference data released by Crossref. Note that we are currently run the software on previous dumps of Crossref data so as to retrieve all the citations that involve references in citing articles that were in the ‘Limited’ set when we downloaded it, but that currently appear in the Crossref ‘Closed’ data set due to more recent restrictive policy decisions taken by their publishers.

Finally, we wish to remind you that all the bibliographic and citation data in COCI:

Posted in Citations as First-Class Data Entities, Data publication, Open Citation Identifiers, Open Citations | Tagged , , , | 4 Comments