From little acorns . . . A retrospective on OpenCitations

The initial vision

Now that OpenCitations is hosting over one billion freely available scholarly bibliographic citations, this is perhaps an opportune moment to look back to the start of this initiative. A little over eleven years ago, on 24 April 2010, I spoke at the Open Knowledge Foundation Conference, OKCon2010, in London, on the topic

OpenCitations: Publishing Bibliographic Citations as Linked Open Data

I reported that, earlier that same week, I had applied to Jisc for a one-year grant to fund the OpenCitations Project (opencitations.net). Jisc (at that time ‘The JISC’, the Joint Information Systems Committee) was tasked by the UK government, among other things, to support research and development in information technology for the benefit of the academic community.

The purpose of that original OpenCitations R&D project was to develop a prototype in which we:

  • harvested citations from the open access biomedical literature in PubMed Central;
  • described and linked them using CiTO, the Citation Typing Ontology [1];
  • encoded and organized them in an RDF triplestore; and
  • published them as Linked Open Data in the OpenCitations Corpus (OCC).

I told those at the conference that in this demonstration project, with limited JISC funding, we could not hope to “boil the whole ocean”, but that nevertheless there would be substantial benefits from even partial coverage of citation data from the scholarly literature:

  • We could show the way and establish best practice.
  • Despite partial coverage, all key papers would most likely be cited several times.
  • The overall topological structure of the citation network would be revealed.
  • We would create a ‘benchmark’ corpus of high-quality RDF citation data that could be used to develop analytical and visualization tools.
  • We could show the value of open citation data in helping scholars to discover full text articles of all types, and thus encourage subscription-access publishers to release their reference metadata.

The important thing, I said, was to make a start!

The Jisc OpenCitations Project

That JISC grant application was funded, and the project, to last for a year with modest funding of £100K, started in my lab in the Department of Zoology at Oxford University on 1st June 2010, and was subsequently extended for a further six months.

Using data from the Open Access subset of PubMed Central, we created the first prototype release of the OpenCitations Corpus of linked bibliographic citation data, containing 6,529,815 independent bibliographic records of both citing and cited entities, comprising references to ~20% of all post-1980 articles recorded in PubMed, including those to all the most important highly cited papers in every field of biomedical endeavour.

This achievement was almost entirely the result of the excellent work by our chief data wrangler Alex Dutton, whose skill and natural feel for linked data did wonders for this project. Ben O’Steen, Graham Klyne and Alistair Miles made important contributions.

The project also resulted in many other development, described here, most which were developed or at least initiated during a short but wonderfully productive collaboration with Silvio Peroni, who spent six months with me in 2010 as a doctoral student intern from the University of Bologna, to which he subsequently returned to complete his thesis and develop his academic career.

These included:

  • the deconstruction and re-development of the original version of CiTO into a suite of orthogonal and complementary ontologies covering the whole domain of scholarly publishing – the SPAR (Semantic Publishing and Referencing) Ontologies [2, 3];
  • the mapping of various existing metadata schemas into RDF using SPAR, including the DataCite Metadata Schema, and subsequently JATS, now the default NISO standard for XML markup of scholarly documents) [4]; and
  • the initiation of the Semantic Publishing Blog and this OpenCitations Blog.

Life after Jisc – the flowering of OpenCitations

After the Jisc funding ended and I, after a long career in biological teaching and research, formally retired from the Department of Zoology at the Oxford University, members of the initial OpenCitations team moved on to other things. Like so many grant-funded academic project whose initial financial support had dried up, OpenCitations could have foundered at that stage, as an interesting prototype but with too little content to be useful. However, the concept of providing an open alternative to proprietary citation indexes was too important to abandon. But how could it be transitioned into something enduring and useful, particularly when as a matter of principle one had decided that the citation data should be made freely available, thus precluding income generation by charging for ‘premium’ services or the formation of a commercial spin-off?

Finally, I realized that something radical needed to be done to move OpenCitations forward. I had maintained a lively collaboration with Silvio Peroni at the University of Bologna, resulting between 2011 and 2014 in the publication of 18 articles and conference papers concerning the SPAR ontologies, ontology development, documentation and visualization, and related topics, and in 2015 I invited him to start working with me directly on OpenCitations. It was the best decision I could have made. We decided to take the initial concept and re-implement it from the bottom up. OpenCitations gave Silvio a major computer science project to which he could apply his considerable talent, and soon resulted in the development of a revised RDF data model for describing citation data, the OpenCitations Data Model (OCDM) [5] and a suite of new software tools to harvest, organise and publish citations at linked open data [6]. The credit for almost all the subsequent conceptual and technical developments within OpenCitations, which have incrementally led to our present situation, is due to Silvio Peroni, and the scholarly community is indebted to him for the intelligence, skill and diligent application he has given to OpenCitations over the past six years. I am truly honoured to have Silvio as co-Director of OpenCitations, and wish to take this opportunity to acknowledge his contributions and to thank him publicly.

Our work on OpenCitations at that stage, summarized in [7], would not have been possible without the enthusiastic support of Silvio’s senior colleague Fabio Vitali and of the Department of Computer Science and Engineering at the University of Bologna, which not only provided a stimulating environment for Silvio’s post-doctoral work, but also supplied computing services and infrastructure at no charge to OpenCitations. It was also greatly helped by Professor David De Roure of Oxford University, who gave me an academic home and a formal affiliation within the Oxford e-Research Centre after my retirement from the Department of Zoology, which enabled me to continue to hold research grants.

As has been documented in earlier posts in this blog, we greatly benefitted in 2017 from a grant from the Alfred P. Sloan Foundation which enabled us to purchase a new and more powerful computing infrastructure for the sole use of OpenCitations and to extend and improve our software, and subsequently in 2019 by a project grant from the Wellcome Trust to develop the Open Biomedical Citations in Context Corpus, that permitted the extension of OCDM and SPAR for the characterization of in-text references and their textual contexts.

A significant breakthrough came in January 2018 with our decision to treat citations as first-class data entities, each with its own persistent identifier (PID), the Open Citations Identifier (OCI) [8]. This gave Silvio the freedom to envision a new kind of database, a citation index in which each citation had its own metadata, including citation timespan, citation categorization (e.g. self-citation), and of course the DOIs of the citing and cited publications. The creation of this new database was possible only with the incredible effort by Ivan Heibi, who served as a Research Fellow in the project funded by the Alfred P. Sloan Foundation at that time, and who was entirely responsible for developing the first version of the code necessary for creating such a database. Having harvested all the open references from Crossref metadata dumps, Silvio and Ivan created COCI, the OpenCitations Index of Crossref DOI-to-DOI Citations, which immediately became our principal source of open citations, the original OpenCitations Corpus being retained as a ‘sandbox’ in which to experiment with new data representations, for example those required for the Open Biomedical Citations in Context Corpus. Access to COCI was facilitated by Silvio’s development of a REST API, using his software tool RAMOSE (Restful API Manager Over SPARQL Endpoints), which enables the easily configurable deployment of a REST API over any SPARQL endpoint to an RDF triplestore

A significant breakthrough came in January 2018 with our decision to treat citations as first-class data entities, each with its own persistent identifier (PID), the Open Citations Identifier (OCI) [8]. This gave Silvio the freedom to envision a new kind of database, a citation index in which each citation had its own metadata, including citation timespan, citation categorization (e.g. self-citation), and of course the DOIs of the citing and cited publications. The creation of this new index was possible only with the incredible effort by Ivan Heibi, who served as a Research Fellow in the project funded by the Alfred P. Sloan Foundation at that time, and who was entirely responsible for developing the first version of the code necessary for creating such a database. Having harvested all the open references from Crossref metadata dumps, Silvio and Ivan created COCI, the OpenCitations Index of Crossref DOI-to-DOI Citations, which immediately became our principal source of open citations, the original OpenCitations Corpus being retained as a ‘sandbox’ in which to experiment with new data representations, for example those required for the Open Biomedical Citations in Context Corpus. Access to COCI was facilitated by Silvio’s development of a REST API, using his software tool RAMOSE (Restful API Manager Over SPARQL Endpoints), which enables the easily configurable deployment of a REST API over any SPARQL endpoint to an RDF triplestore [9]. We were able to organize our all data, both ‘traditional’ and new, and to encode it in RDF, thanks to the comprehensive OpenCitations Data Model [5], itself based on our SPAR Ontologies [3], which we evolved as necessary to accommodate new data representation requirements.

During this period we published a number of definitions, conference papers and journal articles documenting these advances, details of which can be found here. Of these, the most recent canonical publication describing OpenCitations as an infrastructure for open scholarship, and its datasets, tools, services and activities, is Peroni and Shotton (2020) [10]. We also established the Research Centre for Open Scholarly Metadata at the University of Bologna, primarily to handle administrative, financial and academic aspects of OpenCitations activities.

OpenCitations’ future

The problem remained: how to sustain the OpenCitations infrastructure financially. We were greatly helped by Bilder, Lin and Neylon’s formulation of the Principles of Open Scholarly Infrastructures (POSI) [11], in which they clearly pointing out that reliance solely on grant funding for specific projects was not the answer. OpenCitations compliance with POSI is described here. We were thus immensely grateful that SPARC Europe and other institutions had the wisdom to establish SCOSS (The Global Sustainability Coalition for Open Science Services) to facilitate the crowd-sourced financial support of useful open infrastructures by the scholarly community, including academic libraries, government agencies and other stakeholders. OpenCitations applied for SCOSS support in 2019, which led to the selection of OpenCitations for support in the SCOSS second round.

The donations we are now starting to receive from such stakeholders, and the new staff that this funding has recently allowed us to hire, signal the start of our transition from a financially vulnerable academic project to a sustainable open scholarly infrastructure of real value to the community.

The work of opening more of the global citation graph now requires two things:

  • that each publisher takes responsibility for ensuring that the references from all of its journal articles and books are submitted, together with all other bibliographic metadata, to open scholarly bibliographic metadata aggregators such as Crossref and DataCite, from which they can be indexed into open citation indexes of sufficient quality, depth of detail and breadth of coverage that these offer genuine alternatives to the expensive proprietary citation indexing services upon which the academic community presently relies; and
  • that the entire scholarly stakeholder community re-directs a fraction of the enormous sums currently spent on its subscriptions to proprietary bibliographic services in order to support Open Science infrastructures such as OpenCitations that making citations and other forms of scholarly metadata and objects freely available.

References

[1] David Shotton (2010). CiTO, the Citation Typing Ontology. J. Biomedical Semantics 1 (Suppl. 1): S6. http://dx.doi.org/10.1186/2041-1480-1-S1-S6

[2] Silvio Peroni, David Shotton (2012). FaBiO and CiTO: ontologies for describing bibliographic resources and citations. Web Semantics, 17: 33-34. https://doi.org/10.1016/j.websem.2012.08.001, OA at http://speroni.web.cs.unibo.it/publications/peroni-2012-fabio-cito-ontologies.pdf

[3] Silvio Peroni, David Shotton (2018). The SPAR Ontologies. In Proceedings of the 17th International Semantic Web Conference (ISWC 2018): 119-136. https://doi.org/10.1007/978-3-030-00668-6_8

[4] Peroni S, Lapeyre DA and Shotton D (2012) From Markup to Linked Data: Mapping NISO JATS v1.0 to RDF using the SPAR (Semantic Publishing and Referencing) Ontologies. Proc. 2012 JATS Conference, National Library of Medicine, Bethesda, Maryland, USA (October 2012): 16-17. http://www.ncbi.nlm.nih.gov/books/NBK100491/

[5] Marilena Daquino, Silvio Peroni , David Shotton (2020). The OpenCitations Data Model. Figshare. https://doi.org/10.6084/m9.figshare.3443876.v7

[6] Silvio Peroni, David Shotton, Fabio Vitali (2017). One Year of the OpenCitations Corpus: Releasing RDF-based scholarly citation data into the Public Domain. In The Semantic Web – ISWC 2017 (Lecture Notes in Computer Science Vol. 10588, pp. 184–192). Springer, Cham. https://doi.org/10.1007/978-3-319-68204-4_19

[7] Silvio Peroni, Alexander Dutton, Tanya Gray, David Shotton (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71 (2): 253-277. http://dx.doi.org/10.1108/JD-12-2013-0166, OA at http://speroni.web.cs.unibo.it/publications/peroni-2015-setting-bibliographic-references.pdf

[8] Silvio Peroni, David Shotton (2019). Open Citation Identifier: Definition. Figshare. https://doi.org/10.6084/m9.figshare.7127816

[9] Daquino, M., Heibi, I., Peroni, S., & Shotton, D. (2021). Creating Restful APIs over SPARQL endpoints with RAMOSE. Semantic Web. http://arxiv.org/abs/2007.16079

[10] Silvio Peroni, David Shotton (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1): 428-444. https://doi.org/10.1162/qss_a_00023

[11] Geoffrey Bilder, Jenny Lin, Cameron Neylon (2015). Principles for Open Scholarly Infrastructure. http://dx.doi.org/10.6084/m9.figshare.1314859

This entry was posted in Citations as First-Class Data Entities, JISC, Ontologies, Open Citation Identifiers, Open Citations, Open Science and tagged , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s