The Open Biomedical Citations in Context Corpus: Progress Report

The creation of the Open Biomedical Citations in Context Corpus (CCC) is the goal of a one-year project funded by the Wellcome Trust. The aim is to create a new open corpus of bibliographic and citation data that contain detailed information about individual in-text reference pointers in biomedical journal articles. The project is led by Professor Silvio Peroni of the Research Centre for Open Scholarly Metadata (University of Bologna), is being undertaken by Dr Marilena Daquino (University of Bologna), and actively involves the Oxford e-Research Centre (University of Oxford), the École de Bibliothéconomie et des Sciences de l’Information (Université de Montréal), and the Centre for Science and Technology Studies (CWTS), (Leiden University).

An in-text reference pointer is a textual device (e.g. “[1]”, or “(Peroni and Shotton 2012)”) that appears in the main text of a citing work and denotes a bibliographic reference listed in the Bibliography section of the citing work. While a single in-text reference pointer uniquely denotes a single bibliographic reference, it can occur together with one or more other pointers, forming an in-text reference pointer list that denotes several references (e.g. “[5-13]”, or “(Peroni and Shotton 2012; Peroni and Shotton 2019)”). In-text reference pointers may appear in several places within the same citing publication (e.g. Introduction, Methods, Discussion), may occur within different document components (e.g. body text, figure captions, tables), and may address the cited publication for different purposes (e.g. as the source of an experimental protocol, as a data source, or for general background information).

Unfortunately, current citation indexes contain no information about in-text reference pointers, such as the number of times a particular work is referenced in the citing work, the text of the sentences in which they occur, or the rhetorical purpose of such citations.

Having data at the level of individual in-text reference pointers offers many new opportunities, enabling one: (1) to distinguish between works that are referenced just once in a citing publication and those that are referenced multiple times, and thereby (potentially) to distinguish when a citation is fundamental for the understanding or the development of the citing work, or merely incidental; (2) to see which in-text reference pointers occur together (e.g. in the same sentence or the same paragraph), thus, potentially, to infer similarities between the co-cited publications; and (3) to determine in which specific sections of the publication these in-text references occur (e.g. Introduction, Methods, Results), and thus, potentially, by means of textual analysis of the citation contexts, to retrieve the rhetorical functions of the citations – i.e. the reason why an author cites another work. 

The goal of the CCC Project is to provide stakeholders with an exemplar Linked Open Data corpus, created from the open access biomedical research literature, that is tailored for such deep citation analyses. The corpus will be a new member of the collection of OpenCitations datasets, and will be accompanied by services for accessing and querying data.

In the CCC Project, we have achieved or are currently dealing with the following developments:

  • Extending the OpenCitations Data Model (OCDM). The OpenCitations Data Model has been extended and enriched with new terms and relations to represent bibliographic entities related to in-text reference pointers, such as the in-text reference pointers themselves, in-text reference pointer lists, discourse elements (e.g. sections, paragraphs, sentences), and annotations on citations, bibliographic references and in-text reference pointers. In addition, the provenance layer of the data model has been revised to provide meaningful provenance information in a more compact way. A revised version of the OCDM including these terms was published on November 8, 2019, and it is available on Figshare [1].
  • Extending the OpenCitation harvesting and data re-engineering pipeline. The CCC Project leverages existing OpenCitations technologies for building this new corpus, using as input articles from the Open Access Subset of biomedical literature hosted by Europe PubMed Central (EPMC) and encoded in XML. The OpenCitations pipelines for knowledge extraction (i.e. the software called BEE) and for data re-engineering (i.e. the software called SPACIN) have been enhanced so as to harvest relevant information from the full-text of the XML sources provided by EPMC, rather than just the reference lists,  and to transform these data into RDF according to the revised OCDM. The source code of the new pipeline is available on GitHub.
  • Creating InTRePID, a new persistent identifier for in-text reference pointers. Different in-text reference pointers denoting the same bibliographic reference have distinct logical, rhetorical and textual contexts wherein they occur. To permit them to be identified individually and handled properly, we have recently developed a new persistent identifier, the In-Text Reference Pointer Identifier (InTRePID), for identifying individual in-text reference pointers relating to an open bibliographic citation. The InTRePID is based on the Open Citation Identifier (OCI), currently being used to identify the >624 million citations present in the new release of COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, described in the previous blog post. The formal definition of an InTRePID is available on Figshare [2]. In addition, an InTRePID Resolution Service has been developed (currently in beta-testing) to facilitate the retrieval of the metadata relating to in-text reference pointers. At the moment, a subset of the CCC corpus is available online for testing the InTRePID Resolution Service.
  • Development of services for accessing and querying the Citations in Context Corpus. Along with the development of the CCC itself, we are also developing services for querying data within the CCC. In particular, we are currently working to extend the RAMOSE software to provide an API for accessing the CCC triplestore. This CCC API will permit users to access the CCC corpus and retrieve detailed information about in-text reference pointers and their related annotations in a variety of human- and machine-readable formats. The source code of the API Manager is available on GitHub. The configuration file for querying the CCC corpus is still in the process of development.

Moreover, we are currently working to evaluate the content data quality of the CCC corpus and to develop reconciliation activities with information stored in Crossref. Specifically, by means of new validation methods, we are testing whether the extracted in-text reference pointers are complete (i.e. determining that all the in-text reference pointers for a particular bibliographic reference have been correctly extracted from the text), and that in-text reference pointer lists (e.g. “[5-13]”) have been correctly parsed to extract all the implicit pointers (in this case “[6]”, “[7]”, “[8]”, “[9]”, “[10]”, “[11]” and “[12]”), and to associate them correctly with the appropriate bibliographic references that they denote. This activity is fundamental, in order to address the diverse citation styles adopted by different journals and to overcome possible incoherencies in the publishers’ XML markup of the articles. Secondly, whenever a DOI is not specified for the citing or cited publications in the full-text of the citing publication, a text search using the Crossref API is performed in order to match possible candidates and supply the missing DOI. This reconciliation process itself can be error-prone since recommended matches are obtained by means of a non-transparent scoring mechanism. Therefore we are currently testing the application of a scoring threshold that will eliminate false positives and provide us only with correct results.

The deployment of the enhanced OpenCitations pipeline for populating the CCC corpus automatically is planned to start in the next weeks. For more details and to provide suggestions, please contact us!

References

[1] Marilena Daquino, Silvio Peroni and David Shotton (2019). The OpenCitations Data Model. Version 2.0. Figshare. DOI:  ​https://doi.org/10.6084/m9.figshare.3443876 

[2] David Shotton, Marilena Daquino and Silvio Peroni (2020). In-Text Reference Pointer Identifier: Definition. Figshare.  DOI: https://doi.org/10.6084/m9.figshare.11674032

This entry was posted in Ontologies, Open Citation Identifiers and tagged , , . Bookmark the permalink.

1 Response to The Open Biomedical Citations in Context Corpus: Progress Report

  1. Pingback: Introducing InTRePIDs – In-Text Reference Pointer Identifiers | OpenCitations

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s