Save the dates: OpenCitations October events 

With the numerous September events in which the OpenCitations’ directors have recently been involved behind us, it is now time to announce the participation of our director Silvio Peroni in two October events. 

On Wednesday 6th, Silvio will take part in the Beilstein OpenScience Symposium 2021 (October 5-7), giving a short presentation “Open Citations, an Open Infrastructure to Provide Access to Global Scholarly Bibliographic and Citation Data” during the Poster Flash Talk Presentation (17:10-18:00 CEST). The Beilstein OpenScience Symposium is an annual event that gathers leaders in the FAIR and Open Data movement, covering a wide range of research fields, including biomedical research, physics and social science, and exploring how open data practices are transforming sectors outside academia. The 2021 online edition will present a series of talks addressing the many ways that data transparency contributes to the research progress. Among them, the poster presentations involve short oral presentations on Wednesday, 6th October, to accompany the posters that will be displayed throughout the entire symposium. Poster abstracts are available in the Abstract Book that can be downloaded on Beilstein Symposium’s website: https://www.beilstein-institut.de/en/symposia/open-science/program/ .

You can register for the event here: https://www.beilstein-institut.de/en/symposia/open-science/registration/

The poster and slides from the presentation are available on Zenodo:

Peroni, S. (2021, October 6). Open Citations, an Open Infrastructure to Provide Access to Global Scholarly Bibliographic and Citation Data—Poster Flash Talk Session slides. Beilstein Open Science Symposium 2021, Virtual Event. Zenodo. https://doi.org/10.5281/zenodo.5553025

Peroni, S., Shotton, D. W., & Di Giambattista, C. (2021). Open Citations, an Open Infrastructure to Provide Access to Global Scholarly Bibliographic and Citation Data. https://doi.org/10.5281/zenodo.5553040

The second event is the annual European Computer Science Summit (ECSS), organized by Informatics Europe and involving academics, industry leaders, decision makers and others interested in Informatics/Computer Science research and education in Europe. ECSS 2021, “Informatics for a Sustainable Future” (Oct. 25-27), will be held as a hybrid event, involving both online as well as on-site sessions held in Madrid, at the Facultad de Ciencias de la Actividad Física y del Deporte (INEF), Universidad Politécnica de Madrid located at Calle de Martín Fierro, 7

During the last day of the meeting (Oct. 27), Silvio Peroni will be speaking at the “National Informatics Associations Workshop”, an annual workshop organised by Informatics Europe in collaboration with the National Informatics Associations in Europe. This year the workshop will address the themes Informatics in Interdisciplinary Curricula and Research Evaluation in Informatics, thus focusing on an important question: how to recognise, assess and credit research contributions specific to Informatics, such as conference publications and software artefacts. Elaborating on this, Silvio’s talk (to be delivered in person, rather than online!) is entitled “Open citations in Informatics” (9:00 CEST).  

For further information and registrations: https://www.informatics-europe.org/ecss/registration/how-to-register.html

We thank Beilstein Institute and Informatics Europe for involving OpenCitations in these international events, which provide opportunities to promote the OpenCitations infrastructure and services in stimulating environments.

We hope to see you there

Posted in Bibliographic references, open access, Open Citations, Open Science, Uncategorized | Tagged , , , , , , | Leave a comment

OpenCitations in Five Hundred Words

Yesterday I gave a lightning talk at the 2021 OASPA Conference, with the title OpenCitations – what does the future hold? The poster accompanying my talk, published on Zenodo at https://doi.org/10.5281/zenodo.5526713, is reproduced below.

Poster for 2021 OASPA Conference Lightning Talk

Here is what I said:

= = =

Most of the talks at this conference have focussed on open access to textual content. But open bibliographic metadata is also vitally important, not least to enable the calculation of metrics that are both open and reproducible.

OpenCitations is a not-for-profit open infrastructure that provides such free access to global scholarly citations. We hold dear the values and principles that underpin Open Science, and are early adopters of the Principles of Open Scholarly Infrastructure (POSI) and the FAIR data principles.

Our largest citation index is COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, which currently indexes approximately 1.2 billion citations, released to the public domain under a CC0 waiver.

Our goal is for OpenCitations to provide open bibliographic citation information having scope, depth, accuracy and provenance that surpasses that of the commercial citation indexes, for access to which scholarly institutions presently pay enormous annual subscriptions.

I wish to mention just two of our planned developments:

OpenCitations Meta is a new database that will enable us to store in-house full bibliographic metadata about citing and cited publications. This will have two advantages: It will speed user queries, since we will no longer have to wait for responses to on-the-fly API calls to Crossref and ORCID to retrieve such metadata. More importantly, it will enable us to index the large number of references involving publications that do not have DOIs, something that for technical reasons is presently lacking.

Additionally, we plan to develop new citation indexes over other sources of open references, starting with DOCI, indexing references from DataCite, and NOCI, indexing the content of the NIH Open Citation Collection.

Our progress has until recently been severely manpower-limited. However, OpenCitations was fortunate to have been selected by SCOSS, the Global Sustainability Coalition for Open Science Services, as an open infrastructure providing a unique and valuable service, and worthy of crowdfunded financial support by the global stakeholder community of research institutes, academic libraries, funders and publishers.

As a result of the generous support so far provided or pledged by ~50 such institutions, we have already reached about one-third of our requested SCOSS budget, enabling us this year to appoint new staff to start our planned technical developments, to support our community outreach, and to help take our vision forward.

Such financial support is vital for our sustainability, since we generate no income from our provision of free data, services and software. We thus invite you too to contribute to OpenCitations.

However, we also seek community involvement in other ways: participation in the community-led governance of OpenCitations; help in developing our open source software and services; curatorial involvement to improve OpenCitations data; and collaborations with other like-minded infrastructures to develop federated access to open scholarly information of all types, thereby returning control over such information to the academic community that generated it in the first place.

If you would like to work with OpenCitations in any of these ways, please contact me.

= = =

Posted in Bibliographic references, Open Citations | Tagged , , , , | Leave a comment

Academia’s missing references

No-one is quite sure of the total number of scholarly publications within the global corpus. Indeed that number will be strongly influenced by the degree to which, in addition to books and journal articles, one includes within the definition of scholarly publications ‘grey literature’ such as reports published by official bodies, patents, etc. Consequentially, the total number of scholarly references within those publications is also unknown, and this number too will vary according to the inclusion criteria chosen. Furthermore, Crossref Event Data and similar datasets recording social media mentions of journal articles in blog posts and tweets extends the concept of a reference beyond that used in ‘conventional’ citation indexes such as COCI.

We celebrate the fact that well over one billion bibliographic citations are now openly available under CC0 waivers in NIH OCC (the National Institutes of Health Open Citation Collection) [1,2] and COCI (the OpenCitations Index of Crossref DOI-to-DOI Citations) [3]. Despite present gaps in their coverage, they include references to all the most important publications within the global corpus, because these will all have been cited multiple times.

Open references available from Crossref and other aggregators and indexes

Crossref, with over 1.6 billion open references, is the largest single source of such bibliographic metadata. Significant numbers of references are also available in a variety of other databases, repositories and indexes.

NIH OCC (the National Institutes of Health Open Citation Collection) is a merger of several citation databases, drawing on PubMed for crucial article metadata, and augmenting this with information from full-text articles that have been made freely available on the internet [1]. The CiteSeerX database, the arXiv preprint repository, and the Dryad data repository are examples of different types of infrastructure that also publish open bibliographic references, while there is further availability of article references from open aggregators such as DataCite and Wikidata. These may either use their own DOIs, DOIs from the Crossref DOI registration agency, or no DOIs at all. Either way, those references will not appear in Crossref.

What is lacking is semantic coherence and interoperability between these sources, permitting federated queries across them. This makes difficult the task of obtaining a comprehensive overview of the availability of open bibliographic references.

However, there are even more citations that are not yet freely and easily available anywhere in bulk, relating to the reference lists within publications of a number of distinct types. This blog post explores academia’s missing references – those that have not yet been documented within open freely accessible citation indexes – and what might be done to bring these into the public domain.

1 References that are closed at Crossref

Eight years ago, I wrote

“In this open-access age, it is a scandal that reference lists from journal articles — core elements of scholarly communication that permit the attribution of credit and integrate our independent research endeavours — are not readily and freely available for use by all scholars.” [4]

I stand by that statement, and, through OpenCitations [5], I have been working with colleagues to rectify the situation, as I described in my previous post. The Initiative for Open Citations (I4OC) can rightly be applauded for its part in encouraging almost all the major academic publishers who deposit references at Crossref to make them open.

The only major scholarly publisher not to be listed as an I4OC Participating Publisher is the Institute of Electrical and Electronics Engineers (IEEE), that, having deposited at Crossref reference lists for 58% of its preprints and publications, persists in keeping these deposited references closed and unavailable for indexing and re-use. It is to be hoped that IEEE will now realize that what it looses by not having references openly available outweighs any benefits it might have received from keeping them closed, and will join its fellow publishers in ensuring that its Crossref-deposited references are made open, both for its current issues and for its back-number journal articles. I thus call again upon IEEE to change its present position, as both Elsevier and the American Chemical Society had the courage to do recently, and to instruct Crossref to open all IEEE references. A single email to Crossref Support, with the instruction “Open all references”, is all it would take!

References in Crossref are either open, ‘limited’ or closed. Limited references are available to those subscribing to Crossref Metadata Plus, which includes OpenCitations, but not to the general public. Closed references are not freely available to anyone. The following table shows the number of Crossref works in each category, and the number of references within those categories.


WorksReferencesAverage references per work
Total126,627,618

Without references70,477,843
(55.7% of total)


With references56,149,775
(44.3% of total)
1,734,831,31130.9
Open49,274,155
(38.9% of total)
1,605,120,22932.6
Limited2,933,323
(2.3% of total)
66,459,70422.7
Closed3,942,297
(3.1% of total)
63,251,37816.0
Table 1. Numbers of works and their references recorded in Crossref on 31 July 2021.

2 References not deposited at Crossref for publications with Crossref DOIs

The number of works with Crossref DOIs that lack submitted references is surprisingly large. As of 31 July 2021, 70,477,843 publications (55.7% of all works recorded at Crossref) lacked deposited references (Table 1).

Crossref classifies all types of journal content, including editorials, book reviews and letters, as “journal articles”, thus some of these works without deposited references genuinely lack them. However, the majority are conventional journal articles and books with reference lists that the publishers have simply not deposited at Crossref along with the other metadata for these works.

The average number of references per Crossref work with deposited references is 30.9 (Table 1). If, to make allowance for those works that genuinely lack references, we assume a conservative average of 25 references per work for the 70,477,843 works lacking deposited references, this means that there are over 1.75 billion references within these works that have not been deposited at Crossref, and thus are not conveniently available for indexing and reuse.

These missing references relate both large publishers that have submitted references for only some of their publications, and small publishers that perhaps lack, or think they lack, the resources to deposit any reference lists in addition to the other metadata they are already sending to Crossref for each of their DOIs. However, there are several easy methods for depositing reference lists, as detailed by Crossref here. So I encourage all publishers who are supportive of Open Science to update their procedures and commence or complete the deposition of their publication reference lists at Crossref, starting with their current issues. Crossref Support will provide assistance if required. Note that a publisher does not have to subscribe to the Crossref Cited-by service to deposit its references!

3 Citations missing in COCI: open references in Crossref to publications lacking DOIs

COCI is the OpenCitations Index of Crossref DOI-to-DOI Citations, and, as the name suggests, it indexes Crossref open references from works with Crossref DOIs to other works that have DOIs [3]. It therefore does not index open references in Crossref to works that, for whatever reason, lack DOIs.

The most recent (September 2021) release of COCI, based on the August Crossref dump, contains 1,186,958,898 citations between 69,074,291 unique work, comprising 51,103,720 citing bibliographic resources bearing Crossref DOIs and 56,105,783 cited bibliographic resources. Of the cited bibliographic resources, 38,135,212 bear a DOI issued by Crossref and have open or limited references (thus also being COCI citing resources), while 17,970,571 either have a Crossref DOI but lack open or limited references or have a DOI issued by another DOI registration agency such as DataCite (thus not being COCI citing resources).

Note that in Crossref, the ratio of works with open or limited references to works without open or limited references is 0.7:1 (Table 1). However, in COCI, the ratio of cited works with Crossref DOIs containing open or limited references to all other cited works is 2.1:1. Thus works with Crossref DOIs containing open or limited references are three times more likely to be cited than works that either have a Crossref DOI but lacking open or limited references or have a DOI issued by another DOI registration agency. This is most likely because the most important journals from almost all the larger publishers now have open references. However, it is still a remarkable ratio.

Because references to works lacking DOIs are not included in COCI, the average number of bibliographic references per citing article in COCI is only 23.2, in contrast to the numbers of references to works of all types given in Table 1.

From these data, it can be seen that there are over 480 million open Crossref references to a wide variety of works lacking DOIs that OpenCitations does not index in COCI. This is because of an intentional and fundamental limitation in the structure of the Open Citation Identifier (OCI) [6], requiring both citing and cited publications to have identifiers of the same type, that lies at the heart of the functionality of our OpenCitations Indexes.

OpenCitations is currently developing a solution that, without compromising that intentional design limitation in OCIs, will nevertheless permit us to index and publish these ‘missing’ references as Linked Open Data. We will report on this development in due course.

Crossref Event Data is a Crossref service / database that records mentions of publications bearing Crossref DOIs in social media including tweets and blog posts, and in other non-traditional citation sources such as news items and Wikipedia articles. From today (Thursday 23rd September 2021), Crossref Event Data will start to include in its holdings open references from publications bearing Crossref DOIs to other publications bearing DOIs. Limited and closed references will not be included. Initially, open references from current publications will be included in Crossref Event Data, with open references from older works with DOIs being added later. In that respect it will come to resemble COCI, except that COCI also included ‘limited’ references, treats citations as first-class data entities with their own identifiers, and makes its citation data available in RDF as Linked Open Data, as well as via a REST API. Subsequently, Crossref Event Data will also record references to publications with other forms of identifier, as OpenCitations also plans to do.

4 References available on publishers’ web sites

Many publishers, particularly those of Open Access works, already make their publication reference lists openly available on their own web sites. While this is commendable, it is not sufficient, since, if scholarly references are not made available in a centralized aggregator such as Crossref from which they can be conveniently harvested in bulk for analysis and re-use, they are much more difficult to access.

Scraping references from the HTML of individual web sites is difficult, time-consuming and liable to be incomplete. While Microsoft Academic achieved considerable success in scraping references from publishers’ web sites, possible because of the special relationships these publishers have with the Microsoft search engine Bing, this service will unfortunately soon no longer be available, illustrating a problem inherent with scholarly infrastructures provided by commercial companies that do not adopt the Principles of Open Scholarly Infrastructures.

A consequence is that such publications will become increasingly ‘invisible’, as bibliographic and analytical services come to rely more and more on centrally available data.

6 References within PDFs of scholarly works lacking DOIs

There is a large but unknown quantity of reference-containing books, academic reports, patents and journal articles from publishers that, for their own good reasons, choose not to use DOIs. The text of many of these publications is already available in a marked-up machine-readable format such as JATS, used in preparation for publication, from which the reference lists could easily be extracted. Other publications are only available as PDFs, both as published Versions of Record, or as preprints deposited in a variety of preprint repositories such as arXiv or CORE. Mining reference lists out of PDFs required expertise in text mining and AI technologies, and is labour-intensive, since it usually required the tuning of extraction algorithms to handle the particular styles and formatting of individual journals, one at a time. Two stages are involved: first, the recognition and extraction of the text of the individual references from the PDF, and second the parsing of each text string into the component parts of the reference (author names, title, publication year, etc.) A considerable number of the citations within NIH-OCC have been obtained in this manner [1], commercial companies such as Lexical Intelligence specialize in this area, and publicly available software such as GROBID is available for the purpose. However, the overall task of extracting ‘missing’ academic references from the global PDF corpus is daunting in magnitude and would require a well-funded organization.

The correct way to proceed would be for each publisher to take responsibility for liberating the references of their own publications, whether or not the publications themselves are open access, and whether or not these references are already available in a marked-up machine-readable format or only within PDF documents. Then, if the publisher still chose not to use DOIs and to submit these metadata to Crossref, these references could be submitted directly to OpenCitations for aggregation and publication as Linked Open Data.

Conclusion

From the foregoing discussion it is clear that the academic community has a long way to go before the majority of scholarly citations, the products of their own labours, are openly available for analysis and re-use. We at OpenCitations are working to address these issues and to publish more of these missing citations. However, completion of the task will require a coordinated collaborative international effort.

Are you willing to be involved?

References

[1] B. Ian Hutchins et al. (2019). The NIH Open Citation Collection: A public access, broad coverage resource. PLoS Biol. 17 (10): e3000385. https://doi.org/10.1371/journal.pbio.3000385

[2] B. Ian Hutchins (2021). A tipping point for open citation data. Quantitative Science Studies 2 (2): 433–437. https://doi.org/10.1162/qss_c_00138

[3] Ivan Heibi, Silvio Peroni, David Shotton (2019). Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Scientometrics 121 (2): 1213-1228. https://doi.org/10.1007/s11192-019-03217-6

[4] David Shotton (2013). Open citations. Nature, 502 (7471): 295-297. http://dx.doi.org/10.1038/502295a

[5] Silvio Peroni, David Shotton (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1): 428-444. https://doi.org/10.1162/qss_a_00023

[6] Silvio Peroni, David Shotton (2019). Open Citation Identifier: Definition. Figshare. https://doi.org/10.6084/m9.figshare.7127816

Posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, open access, Open Citations | Tagged , , , , , , | 4 Comments

Save the dates: OpenCitations’ September events 

We are happy to announce OpenCitations’ participation in a number of online conferences and events during the next few weeks. Our directors Silvio Peroni and David Shotton will be speaking at the Open Science Fair 2021, the OASPA Conference 2021 and Open Access Tage.  

Open Science Fair 2021 (20-23 September) is an event organized by OpenAIRE, in collaboration with some key international initiatives in the area of Open Science: COAR, EIFL, Force11, LA Referencia, LIBER, OPERAS, Sparc, Sparc Europe. Like a real fair, the visitors can explore virtual pavilions, participating in various Keynote Talks, Parallel Sessions and Workshops dedicated to Open Science. Silvio Peroni will give two talks on Tuesday 21:  

  • In the Lightning Talk, “ScholeXplorer and OpenCitations as the new frontier of open citation indexing” (11:30 CEST), coauthored with Paolo Manghi (OpenAire), Alessia Bardi (CNR-ISTI) and Sandro La Bruzzo (CNR-ISTI), Silvio will present ScholeXplorer and OpenCitations, two of the services included in the MONITOR portfolio of the OpenAIRE-Nexus project. More information and registrations at: https://www.opensciencefair.eu/2021/lightning-talks/scholexplorer-and-opencitations-as-the-new-frontier-of-open-citation-indexing  
  • The Workshop “The perils of being invisible. Collective funding models for Open Science infrastructure” (16:30-18:00 CEST) “will help identify the main challenges of collective funding models for Open Science Infrastructure, as well as explore the path forward to make them more efficient”. Silvio Peroni, Niels Stern (DOAB/OAPEN) James MacGregor (PKP), Agata Morka (SPARC Europe/SCOSS), Jon Treadway (the Great North Wood Consulting), Jean-Francois Lutz (University of Lorraine) and Vanessa Proudman (SPARC Europe) will reflect on the evanescence of Open Science Infrastructure (OSI) in library budget considerations. The speakers will also promote interaction with other workshop participants in order to create a collective dialogue. You can register for the event here: https://www.opensciencefair.eu/2021/workshops/the-perils-of-being-invisible  

The OASPA Conference 2021 (21-23 September), entitled “Designing 21st Century Knowledge Sharing Systems”, will be dedicated to “many timely and fundamental topics relating to open scholarly communication”, including  “the ongoing impact of the pandemic”.  David Shotton will take part in the Poster Lightning Talks Session 3 (Thursday 23, 1-2 pm BST), with the title “OpenCitations – what does the future hold?”, a reflection on OpenCitations’ values, data, services, achievements so far, and plans for the future. For further information and registration: https://oaspa.org/conference/  

Silvio Peroni, together with James MacGregor (Public Knowledge Project) and Niels Stern (OAPEN) will hold the Workshop “How Open Infrastructure Benefits Libraries?” (September 27, 11:30-13 CEST) as part of the Open Access Tage 2021 (27-29 September), an annual event dedicated to Open Access initiatives and community. During the workshop, the speakers will investigate the social and economic value of open infrastructures for libraries. For more information and to register for the event: https://oat21.sched.com/event/kdFg/workshop-2-how-open-infrastructure-benefits-libraries  

We thank the organizers of these prestigious international events for having invited OpenCitations to participate. The Open Science resounds and grows through such community-centered initiatives.  

If you wish to learn more about Open Science, ongoing Open Access initiatives, and OpenCitations’ commitment to and activities within these areas, don’t miss the opportunity to participate in these on-line conferences … see you there! 

Posted in open access, Open Citations, Open scholarship, Open Science | Tagged , , , , , , | Leave a comment

California Digital Library invests in OpenCitations

OpenCitations is excited to announce that the California Digital Library (CDL) has joined our growing list of contributors.

CDL’s commitment to sustainable open scholarship has great value for the global scholarly community.  Through its investments and partnerships, CDL aims to create an international academic and librarian dialogue, trusting in the idea that “the university, its scholars and its libraries thrive when we transcend organizational boundaries and commit ourselves to shared investments”.

CDL’s contribution will generously support OpenCitations throughout 2021-2023. CDL funding in the fiscal year 2020-2021 also includes two other SCOSS-endorsed infrastructures, OAPEN and DOAB, the non-profit organization Open Access Switchboard, and the services PsyArXiv and SCOAP3 Books. As can be read in the recent post by Ellen Finnie, this investment reflects CDL’s “commitment to ’invest in open’ by allocating a portion of our collections funding to the development of open content and infrastructure in support of UC scholarship and teaching”.

OpenCitations team is grateful to be included in CDL’s ongoing investment in open infrastructure.  Thank you!

Posted in Open Citations, Open scholarship, Open Science | Tagged , , , , | Leave a comment

92 million new citations added to COCI

It’s been a month since the announcement of 1.09 Billion Citations available in the July 2021 release of COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations.  

We’re now proud to announce the September 2021 release of COCI, which is based on open references to works with DOIs within the Crossref dump dated August 2021. This new release extends COCI with more than 92 Million additional citations, giving a total number of more than 1.18 Billion DOI-to-DOI citation links.

This latest release includes citations from the most recent articles published by the American Chemical Society, whose bibliographic references were opened in February 2021. The ACS back number citations will be available in the next COCI release, when a new processing of all the Crossref data will be completed.

You can find more information about COCI in our open-access article 

Ivan Heibi, Silvio Peroni & David Shotton (2019). Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Scientometrics, 121 (2): 1213-1228. DOI: https://doi.org/10.1007/s11192-019-03217-6  

Finally, just a reminder that the bibliographic and citation data in COCI: 

  • can be queried using the OpenCitations Indexes SPARQL endpoint; 
  • can be retrieved by using the COCI REST API
  • can be searched by using the OpenCitations Indexes Search Interface; 
  • are also available as dumps on Figshare in CSV, N-Triples, and Scholix; and 
  • can be freely re-used for any purpose. 
Posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, open access, Open Citation Identifiers, Open Citations, Open Science | Tagged , , , , | Leave a comment

From little acorns . . . A retrospective on OpenCitations

The initial vision

Now that OpenCitations is hosting over one billion freely available scholarly bibliographic citations, this is perhaps an opportune moment to look back to the start of this initiative. A little over eleven years ago, on 24 April 2010, I spoke at the Open Knowledge Foundation Conference, OKCon2010, in London, on the topic

OpenCitations: Publishing Bibliographic Citations as Linked Open Data

I reported that, earlier that same week, I had applied to Jisc for a one-year grant to fund the OpenCitations Project (opencitations.net). Jisc (at that time ‘The JISC’, the Joint Information Systems Committee) was tasked by the UK government, among other things, to support research and development in information technology for the benefit of the academic community.

The purpose of that original OpenCitations R&D project was to develop a prototype in which we:

  • harvested citations from the open access biomedical literature in PubMed Central;
  • described and linked them using CiTO, the Citation Typing Ontology [1];
  • encoded and organized them in an RDF triplestore; and
  • published them as Linked Open Data in the OpenCitations Corpus (OCC).

I told those at the conference that in this demonstration project, with limited JISC funding, we could not hope to “boil the whole ocean”, but that nevertheless there would be substantial benefits from even partial coverage of citation data from the scholarly literature:

  • We could show the way and establish best practice.
  • Despite partial coverage, all key papers would most likely be cited several times.
  • The overall topological structure of the citation network would be revealed.
  • We would create a ‘benchmark’ corpus of high-quality RDF citation data that could be used to develop analytical and visualization tools.
  • We could show the value of open citation data in helping scholars to discover full text articles of all types, and thus encourage subscription-access publishers to release their reference metadata.

The important thing, I said, was to make a start!

The Jisc OpenCitations Project

That JISC grant application was funded, and the project, to last for a year with modest funding of £100K, started in my lab in the Department of Zoology at Oxford University on 1st June 2010, and was subsequently extended for a further six months.

Using data from the Open Access subset of PubMed Central, we created the first prototype release of the OpenCitations Corpus of linked bibliographic citation data, containing 6,529,815 independent bibliographic records of both citing and cited entities, comprising references to ~20% of all post-1980 articles recorded in PubMed, including those to all the most important highly cited papers in every field of biomedical endeavour.

This achievement was almost entirely the result of the excellent work by our chief data wrangler Alex Dutton, whose skill and natural feel for linked data did wonders for this project. Ben O’Steen, Graham Klyne and Alistair Miles made important contributions.

The project also resulted in many other development, described here, most which were developed or at least initiated during a short but wonderfully productive collaboration with Silvio Peroni, who spent six months with me in 2010 as a doctoral student intern from the University of Bologna, to which he subsequently returned to complete his thesis and develop his academic career.

These included:

  • the deconstruction and re-development of the original version of CiTO into a suite of orthogonal and complementary ontologies covering the whole domain of scholarly publishing – the SPAR (Semantic Publishing and Referencing) Ontologies [2, 3];
  • the mapping of various existing metadata schemas into RDF using SPAR, including the DataCite Metadata Schema, and subsequently JATS, now the default NISO standard for XML markup of scholarly documents) [4]; and
  • the initiation of the Semantic Publishing Blog and this OpenCitations Blog.

Life after Jisc – the flowering of OpenCitations

After the Jisc funding ended and I, after a long career in biological teaching and research, formally retired from the Department of Zoology at the Oxford University, members of the initial OpenCitations team moved on to other things. Like so many grant-funded academic project whose initial financial support had dried up, OpenCitations could have foundered at that stage, as an interesting prototype but with too little content to be useful. However, the concept of providing an open alternative to proprietary citation indexes was too important to abandon. But how could it be transitioned into something enduring and useful, particularly when as a matter of principle one had decided that the citation data should be made freely available, thus precluding income generation by charging for ‘premium’ services or the formation of a commercial spin-off?

Finally, I realized that something radical needed to be done to move OpenCitations forward. I had maintained a lively collaboration with Silvio Peroni at the University of Bologna, resulting between 2011 and 2014 in the publication of 18 articles and conference papers concerning the SPAR ontologies, ontology development, documentation and visualization, and related topics, and in 2015 I invited him to start working with me directly on OpenCitations. It was the best decision I could have made. We decided to take the initial concept and re-implement it from the bottom up. OpenCitations gave Silvio a major computer science project to which he could apply his considerable talent, and soon resulted in the development of a revised RDF data model for describing citation data, the OpenCitations Data Model (OCDM) [5] and a suite of new software tools to harvest, organise and publish citations at linked open data [6]. The credit for almost all the subsequent conceptual and technical developments within OpenCitations, which have incrementally led to our present situation, is due to Silvio Peroni, and the scholarly community is indebted to him for the intelligence, skill and diligent application he has given to OpenCitations over the past six years. I am truly honoured to have Silvio as co-Director of OpenCitations, and wish to take this opportunity to acknowledge his contributions and to thank him publicly.

Our work on OpenCitations at that stage, summarized in [7], would not have been possible without the enthusiastic support of Silvio’s senior colleague Fabio Vitali and of the Department of Computer Science and Engineering at the University of Bologna, which not only provided a stimulating environment for Silvio’s post-doctoral work, but also supplied computing services and infrastructure at no charge to OpenCitations. It was also greatly helped by Professor David De Roure of Oxford University, who gave me an academic home and a formal affiliation within the Oxford e-Research Centre after my retirement from the Department of Zoology, which enabled me to continue to hold research grants.

As has been documented in earlier posts in this blog, we greatly benefitted in 2017 from a grant from the Alfred P. Sloan Foundation which enabled us to purchase a new and more powerful computing infrastructure for the sole use of OpenCitations and to extend and improve our software, and subsequently in 2019 by a project grant from the Wellcome Trust to develop the Open Biomedical Citations in Context Corpus, that permitted the extension of OCDM and SPAR for the characterization of in-text references and their textual contexts.

A significant breakthrough came in January 2018 with our decision to treat citations as first-class data entities, each with its own persistent identifier (PID), the Open Citations Identifier (OCI) [8]. This gave Silvio the freedom to envision a new kind of database, a citation index in which each citation had its own metadata, including citation timespan, citation categorization (e.g. self-citation), and of course the DOIs of the citing and cited publications. The creation of this new index was possible only with the incredible effort by Ivan Heibi, who served as a Research Fellow in the project funded by the Alfred P. Sloan Foundation at that time, and who was entirely responsible for developing the first version of the code necessary for creating such a database. Having harvested all the open references from Crossref metadata dumps, Silvio and Ivan created COCI, the OpenCitations Index of Crossref DOI-to-DOI Citations, which immediately became our principal source of open citations, the original OpenCitations Corpus being retained as a ‘sandbox’ in which to experiment with new data representations, for example those required for the Open Biomedical Citations in Context Corpus. Access to COCI was facilitated by Silvio’s development of a REST API, using his software tool RAMOSE (Restful API Manager Over SPARQL Endpoints), which enables the easily configurable deployment of a REST API over any SPARQL endpoint to an RDF triplestore [9]. We were able to organize our all data, both ‘traditional’ and new, and to encode it in RDF, thanks to the comprehensive OpenCitations Data Model [5], itself based on our SPAR Ontologies [3], which we evolved as necessary to accommodate new data representation requirements.

During this period we published a number of definitions, conference papers and journal articles documenting these advances, details of which can be found here. Of these, the most recent canonical publication describing OpenCitations as an infrastructure for open scholarship, and its datasets, tools, services and activities, is Peroni and Shotton (2020) [10]. We also established the Research Centre for Open Scholarly Metadata at the University of Bologna, primarily to handle administrative, financial and academic aspects of OpenCitations activities.

OpenCitations’ future

The problem remained: how to sustain the OpenCitations infrastructure financially. We were greatly helped by Bilder, Lin and Neylon’s formulation of the Principles of Open Scholarly Infrastructures (POSI) [11], in which they clearly pointing out that reliance solely on grant funding for specific projects was not the answer. OpenCitations compliance with POSI is described here. We were thus immensely grateful that SPARC Europe and other institutions had the wisdom to establish SCOSS (The Global Sustainability Coalition for Open Science Services) to facilitate the crowd-sourced financial support of useful open infrastructures by the scholarly community, including academic libraries, government agencies and other stakeholders. OpenCitations applied for SCOSS support in 2019, which led to the selection of OpenCitations for support in the SCOSS second round.

The donations we are now starting to receive from such stakeholders, and the new staff that this funding has recently allowed us to hire, signal the start of our transition from a financially vulnerable academic project to a sustainable open scholarly infrastructure of real value to the community.

The work of opening more of the global citation graph now requires two things:

  • that each publisher takes responsibility for ensuring that the references from all of its journal articles and books are submitted, together with all other bibliographic metadata, to open scholarly bibliographic metadata aggregators such as Crossref and DataCite, from which they can be indexed into open citation indexes of sufficient quality, depth of detail and breadth of coverage that these offer genuine alternatives to the expensive proprietary citation indexing services upon which the academic community presently relies; and
  • that the entire scholarly stakeholder community re-directs a fraction of the enormous sums currently spent on its subscriptions to proprietary bibliographic services in order to support Open Science infrastructures such as OpenCitations that making citations and other forms of scholarly metadata and objects freely available.

References

[1] David Shotton (2010). CiTO, the Citation Typing Ontology. J. Biomedical Semantics 1 (Suppl. 1): S6. http://dx.doi.org/10.1186/2041-1480-1-S1-S6

[2] Silvio Peroni, David Shotton (2012). FaBiO and CiTO: ontologies for describing bibliographic resources and citations. Web Semantics, 17: 33-34. https://doi.org/10.1016/j.websem.2012.08.001, OA at http://speroni.web.cs.unibo.it/publications/peroni-2012-fabio-cito-ontologies.pdf

[3] Silvio Peroni, David Shotton (2018). The SPAR Ontologies. In Proceedings of the 17th International Semantic Web Conference (ISWC 2018): 119-136. https://doi.org/10.1007/978-3-030-00668-6_8

[4] Peroni S, Lapeyre DA and Shotton D (2012) From Markup to Linked Data: Mapping NISO JATS v1.0 to RDF using the SPAR (Semantic Publishing and Referencing) Ontologies. Proc. 2012 JATS Conference, National Library of Medicine, Bethesda, Maryland, USA (October 2012): 16-17. http://www.ncbi.nlm.nih.gov/books/NBK100491/

[5] Marilena Daquino, Silvio Peroni , David Shotton (2020). The OpenCitations Data Model. Figshare. https://doi.org/10.6084/m9.figshare.3443876.v7

[6] Silvio Peroni, David Shotton, Fabio Vitali (2017). One Year of the OpenCitations Corpus: Releasing RDF-based scholarly citation data into the Public Domain. In The Semantic Web – ISWC 2017 (Lecture Notes in Computer Science Vol. 10588, pp. 184–192). Springer, Cham. https://doi.org/10.1007/978-3-319-68204-4_19

[7] Silvio Peroni, Alexander Dutton, Tanya Gray, David Shotton (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71 (2): 253-277. http://dx.doi.org/10.1108/JD-12-2013-0166, OA at http://speroni.web.cs.unibo.it/publications/peroni-2015-setting-bibliographic-references.pdf

[8] Silvio Peroni, David Shotton (2019). Open Citation Identifier: Definition. Figshare. https://doi.org/10.6084/m9.figshare.7127816

[9] Daquino, M., Heibi, I., Peroni, S., & Shotton, D. (2021). Creating Restful APIs over SPARQL endpoints with RAMOSE. Semantic Web. http://arxiv.org/abs/2007.16079

[10] Silvio Peroni, David Shotton (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1): 428-444. https://doi.org/10.1162/qss_a_00023

[11] Geoffrey Bilder, Jenny Lin, Cameron Neylon (2015). Principles for Open Scholarly Infrastructure. http://dx.doi.org/10.6084/m9.figshare.1314859

Posted in Citations as First-Class Data Entities, JISC, Ontologies, Open Citation Identifiers, Open Citations, Open Science | Tagged , , , , , , , , , , , , | 1 Comment

Reflections on the global citation graph

In his call for open citations, Dario Taraborelli hailed the scholarly citation graph (in which the nodes (vertices) are individual academic publications and the links (edges) represent bibliographic citations from one publication to another) as one of humankind’s most important intellectual achievements.

We all understand that the inclusion within our own academic publications of bibliographic references to the works of others is one of the most explicit ways of acknowledging the thoughts, discoveries, achievements and influences of other scholars, and their contributions to our own work. Not only does what we gain from their publications enable us to make intellectual progress, by “standing on the shoulders of giants” as Newton once famously observed [1], but the influence of these publications extends forward in time across the entire intellectual landscape, like gigantic shadows cast at sunset, whether or not those influenced by these publications have occasion to reference them in their own works.

A bibliographic citation is not only “a conceptual directional link from a citing entity to a cited entity, created by a human performative act of making a citation”, but it is additionally both enduring and retrospective. Enduring, because once made it persists for ever within the global corpus of scholarly literature, and retrospective because (with the exception of occasional contemporaneous citations) the cited publication predates the citing publication.

At the anterior margin of a crawling cell, cellular protrusive extension (for example of a pseudopodium) is achieved by the catalysed polymerization of new filaments of the cytoskeletal protein actin from attachment sites on an existing stationary actin filament network, pushing the cell margin forward [2]. The scholarly citation network (or citation graph, the two terms here being used interchangeably) is similarly dynamic and temporally directional, being extended forward as new works of scholarship are published. Extension of knowledge is achieved by the catalytic inspiration provided by existing academic publications, themselves temporally stationary within the expanding citation network, leading to the publication of new works of scholarship that cite these previous publications and thus extend the citation network further into the future. The citation graph is thus not just an acyclic directed graph, but an acyclic temporally directed graph. Indeed, it is this temporal aspect of the citation network that is one of its most important features.

To use another analogy, the human genealogical tree is inherently multidimensional and difficult to represent pictorially in its entirety, because each new birth brings together the family trees of the child’s two parents. However, unless the parents are seriously promiscuous, the resulting genealogical tree is not impossibly complex. In contrast, the scholarly citation network is much more highly interlinked, since each new publication cites not just two but many preceding (‘parent’) publications, which themselves may beget many other citations.

Visualization of the global scholarly citation graph, or portions of it, is thus inherently difficult, and the important temporal aspect of the graph is the one ignored by almost every method used for visualizing aspects of that graph. Existing methods may take the broad view, showing the links, and the strength of those links, between one scholarly domain and another, thus visualizing the ‘structure of science’. Alternatively, they may take a more detailed view of a small section of the graph, visualize the proximity of individual publications to one another. Often a radial display is chosen for this, that shows in closest proximity those papers directly referenced by the selected publication in the centre, then at a greater radius those papers referenced by the cited papers shown in the inner circle, and so on. Because of the graph’s complexity, such displays quickly looses intelligibility after two citation links.

Among a small number of visualization applications that do not ignore the temporal aspect of the graph is Citeology, a temporally based citation network visualization tool developed some years ago by Justin Matejka and colleagues at the design software company Autodesk [3]. Unfortunately, this innovative software prototype was not central to that company’s mission, development ceased, and the Citeology Java app is no longer available. However, in his last email to me, Justin Matejka kindly offered to help others re-create this application.

There is thus an urgent need for innovative new open-source visualization tools that will clearly and dynamically display portions of the global citation graph, for example the direct and indirect citation connections between any two publications or any two individuals, along the temporal axis of publication date. Developers within the open science community please step forward!

References

[1] Isaac Newton, in a 1675 letter to Robert Hooke, wrote “If I have seen further it is by standing on the shoulders of Giants.” https://discover.hsp.org/Record/dc-9792/

[2] Bruce Alberts et al. (2014). Molecular Biology of the Cell. 6th Edition. Garland Science. Chapter 16, The Cytoskeleton.

[3] Justin Matejka, Tovi Grossman, George Fitzmaurice (2012). Citeology: Visualizing Paper Genealogy. ACM Extended Abstracts on Human Factors in Computing Systems. https://www.autodesk.com/research/publications/citeology https://d2f99xq7vri1nk.cloudfront.net/CiteologyVideo.mp4

Posted in Bibliographic references, Information visualization, Open Citations, Open scholarship, Open Science, Semantic Publishing | Tagged , , , , | Leave a comment