Academia’s missing references

No-one is quite sure of the total number of scholarly publications within the global corpus. Indeed that number will be strongly influenced by the degree to which, in addition to books and journal articles, one includes within the definition of scholarly publications ‘grey literature’ such as reports published by official bodies, patents, etc. Consequentially, the total number of scholarly references within those publications is also unknown, and this number too will vary according to the inclusion criteria chosen. Furthermore, Crossref Event Data and similar datasets recording social media mentions of journal articles in blog posts and tweets extends the concept of a reference beyond that used in ‘conventional’ citation indexes such as COCI.

We celebrate the fact that well over one billion bibliographic citations are now openly available under CC0 waivers in NIH OCC (the National Institutes of Health Open Citation Collection) [1,2] and COCI (the OpenCitations Index of Crossref DOI-to-DOI Citations) [3]. Despite present gaps in their coverage, they include references to all the most important publications within the global corpus, because these will all have been cited multiple times.

Open references available from Crossref and other aggregators and indexes

Crossref, with over 1.6 billion open references, is the largest single source of such bibliographic metadata. Significant numbers of references are also available in a variety of other databases, repositories and indexes.

NIH OCC (the National Institutes of Health Open Citation Collection) is a merger of several citation databases, drawing on PubMed for crucial article metadata, and augmenting this with information from full-text articles that have been made freely available on the internet [1]. The CiteSeerX database, the arXiv preprint repository, and the Dryad data repository are examples of different types of infrastructure that also publish open bibliographic references, while there is further availability of article references from open aggregators such as DataCite and Wikidata. These may either use their own DOIs, DOIs from the Crossref DOI registration agency, or no DOIs at all. Either way, those references will not appear in Crossref.

What is lacking is semantic coherence and interoperability between these sources, permitting federated queries across them. This makes difficult the task of obtaining a comprehensive overview of the availability of open bibliographic references.

However, there are even more citations that are not yet freely and easily available anywhere in bulk, relating to the reference lists within publications of a number of distinct types. This blog post explores academia’s missing references – those that have not yet been documented within open freely accessible citation indexes – and what might be done to bring these into the public domain.

1 References that are closed at Crossref

Eight years ago, I wrote

“In this open-access age, it is a scandal that reference lists from journal articles — core elements of scholarly communication that permit the attribution of credit and integrate our independent research endeavours — are not readily and freely available for use by all scholars.” [4]

I stand by that statement, and, through OpenCitations [5], I have been working with colleagues to rectify the situation, as I described in my previous post. The Initiative for Open Citations (I4OC) can rightly be applauded for its part in encouraging almost all the major academic publishers who deposit references at Crossref to make them open.

The only major scholarly publisher not to be listed as an I4OC Participating Publisher is the Institute of Electrical and Electronics Engineers (IEEE), that, having deposited at Crossref reference lists for 58% of its preprints and publications, persists in keeping these deposited references closed and unavailable for indexing and re-use. It is to be hoped that IEEE will now realize that what it looses by not having references openly available outweighs any benefits it might have received from keeping them closed, and will join its fellow publishers in ensuring that its Crossref-deposited references are made open, both for its current issues and for its back-number journal articles. I thus call again upon IEEE to change its present position, as both Elsevier and the American Chemical Society had the courage to do recently, and to instruct Crossref to open all IEEE references. A single email to Crossref Support, with the instruction “Open all references”, is all it would take!

References in Crossref are either open, ‘limited’ or closed. Limited references are available to those subscribing to Crossref Metadata Plus, which includes OpenCitations, but not to the general public. Closed references are not freely available to anyone. The following table shows the number of Crossref works in each category, and the number of references within those categories.

WorksReferencesAverage references per work

Without references70,477,843
(55.7% of total)

With references56,149,775
(44.3% of total)
(38.9% of total)
(2.3% of total)
(3.1% of total)
Table 1. Numbers of works and their references recorded in Crossref on 31 July 2021.

2 References not deposited at Crossref for publications with Crossref DOIs

The number of works with Crossref DOIs that lack submitted references is surprisingly large. As of 31 July 2021, 70,477,843 publications (55.7% of all works recorded at Crossref) lacked deposited references (Table 1).

Crossref classifies all types of journal content, including editorials, book reviews and letters, as “journal articles”, thus some of these works without deposited references genuinely lack them. However, the majority are conventional journal articles and books with reference lists that the publishers have simply not deposited at Crossref along with the other metadata for these works.

The average number of references per Crossref work with deposited references is 30.9 (Table 1). If, to make allowance for those works that genuinely lack references, we assume a conservative average of 25 references per work for the 70,477,843 works lacking deposited references, this means that there are over 1.75 billion references within these works that have not been deposited at Crossref, and thus are not conveniently available for indexing and reuse.

These missing references relate both large publishers that have submitted references for only some of their publications, and small publishers that perhaps lack, or think they lack, the resources to deposit any reference lists in addition to the other metadata they are already sending to Crossref for each of their DOIs. However, there are several easy methods for depositing reference lists, as detailed by Crossref here. So I encourage all publishers who are supportive of Open Science to update their procedures and commence or complete the deposition of their publication reference lists at Crossref, starting with their current issues. Crossref Support will provide assistance if required. Note that a publisher does not have to subscribe to the Crossref Cited-by service to deposit its references!

3 Citations missing in COCI: open references in Crossref to publications lacking DOIs

COCI is the OpenCitations Index of Crossref DOI-to-DOI Citations, and, as the name suggests, it indexes Crossref open references from works with Crossref DOIs to other works that have DOIs [3]. It therefore does not index open references in Crossref to works that, for whatever reason, lack DOIs.

The most recent (September 2021) release of COCI, based on the August Crossref dump, contains 1,186,958,898 citations between 69,074,291 unique work, comprising 51,103,720 citing bibliographic resources bearing Crossref DOIs and 56,105,783 cited bibliographic resources. Of the cited bibliographic resources, 38,135,212 bear a DOI issued by Crossref and have open or limited references (thus also being COCI citing resources), while 17,970,571 either have a Crossref DOI but lack open or limited references or have a DOI issued by another DOI registration agency such as DataCite (thus not being COCI citing resources).

Note that in Crossref, the ratio of works with open or limited references to works without open or limited references is 0.7:1 (Table 1). However, in COCI, the ratio of cited works with Crossref DOIs containing open or limited references to all other cited works is 2.1:1. Thus works with Crossref DOIs containing open or limited references are three times more likely to be cited than works that either have a Crossref DOI but lacking open or limited references or have a DOI issued by another DOI registration agency. This is most likely because the most important journals from almost all the larger publishers now have open references. However, it is still a remarkable ratio.

Because references to works lacking DOIs are not included in COCI, the average number of bibliographic references per citing article in COCI is only 23.2, in contrast to the numbers of references to works of all types given in Table 1.

From these data, it can be seen that there are over 480 million open Crossref references to a wide variety of works lacking DOIs that OpenCitations does not index in COCI. This is because of an intentional and fundamental limitation in the structure of the Open Citation Identifier (OCI) [6], requiring both citing and cited publications to have identifiers of the same type, that lies at the heart of the functionality of our OpenCitations Indexes.

OpenCitations is currently developing a solution that, without compromising that intentional design limitation in OCIs, will nevertheless permit us to index and publish these ‘missing’ references as Linked Open Data. We will report on this development in due course.

Crossref Event Data is a Crossref service / database that records mentions of publications bearing Crossref DOIs in social media including tweets and blog posts, and in other non-traditional citation sources such as news items and Wikipedia articles. From today (Thursday 23rd September 2021), Crossref Event Data will start to include in its holdings open references from publications bearing Crossref DOIs to other publications bearing DOIs. Limited and closed references will not be included. Initially, open references from current publications will be included in Crossref Event Data, with open references from older works with DOIs being added later. In that respect it will come to resemble COCI, except that COCI also included ‘limited’ references, treats citations as first-class data entities with their own identifiers, and makes its citation data available in RDF as Linked Open Data, as well as via a REST API. Subsequently, Crossref Event Data will also record references to publications with other forms of identifier, as OpenCitations also plans to do.

4 References available on publishers’ web sites

Many publishers, particularly those of Open Access works, already make their publication reference lists openly available on their own web sites. While this is commendable, it is not sufficient, since, if scholarly references are not made available in a centralized aggregator such as Crossref from which they can be conveniently harvested in bulk for analysis and re-use, they are much more difficult to access.

Scraping references from the HTML of individual web sites is difficult, time-consuming and liable to be incomplete. While Microsoft Academic achieved considerable success in scraping references from publishers’ web sites, possible because of the special relationships these publishers have with the Microsoft search engine Bing, this service will unfortunately soon no longer be available, illustrating a problem inherent with scholarly infrastructures provided by commercial companies that do not adopt the Principles of Open Scholarly Infrastructures.

A consequence is that such publications will become increasingly ‘invisible’, as bibliographic and analytical services come to rely more and more on centrally available data.

6 References within PDFs of scholarly works lacking DOIs

There is a large but unknown quantity of reference-containing books, academic reports, patents and journal articles from publishers that, for their own good reasons, choose not to use DOIs. The text of many of these publications is already available in a marked-up machine-readable format such as JATS, used in preparation for publication, from which the reference lists could easily be extracted. Other publications are only available as PDFs, both as published Versions of Record, or as preprints deposited in a variety of preprint repositories such as arXiv or CORE. Mining reference lists out of PDFs required expertise in text mining and AI technologies, and is labour-intensive, since it usually required the tuning of extraction algorithms to handle the particular styles and formatting of individual journals, one at a time. Two stages are involved: first, the recognition and extraction of the text of the individual references from the PDF, and second the parsing of each text string into the component parts of the reference (author names, title, publication year, etc.) A considerable number of the citations within NIH-OCC have been obtained in this manner [1], commercial companies such as Lexical Intelligence specialize in this area, and publicly available software such as GROBID is available for the purpose. However, the overall task of extracting ‘missing’ academic references from the global PDF corpus is daunting in magnitude and would require a well-funded organization.

The correct way to proceed would be for each publisher to take responsibility for liberating the references of their own publications, whether or not the publications themselves are open access, and whether or not these references are already available in a marked-up machine-readable format or only within PDF documents. Then, if the publisher still chose not to use DOIs and to submit these metadata to Crossref, these references could be submitted directly to OpenCitations for aggregation and publication as Linked Open Data.


From the foregoing discussion it is clear that the academic community has a long way to go before the majority of scholarly citations, the products of their own labours, are openly available for analysis and re-use. We at OpenCitations are working to address these issues and to publish more of these missing citations. However, completion of the task will require a coordinated collaborative international effort.

Are you willing to be involved?


[1] B. Ian Hutchins et al. (2019). The NIH Open Citation Collection: A public access, broad coverage resource. PLoS Biol. 17 (10): e3000385.

[2] B. Ian Hutchins (2021). A tipping point for open citation data. Quantitative Science Studies 2 (2): 433–437.

[3] Ivan Heibi, Silvio Peroni, David Shotton (2019). Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Scientometrics 121 (2): 1213-1228.

[4] David Shotton (2013). Open citations. Nature, 502 (7471): 295-297.

[5] Silvio Peroni, David Shotton (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1): 428-444.

[6] Silvio Peroni, David Shotton (2019). Open Citation Identifier: Definition. Figshare.

This entry was posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, open access, Open Citations and tagged , , , , , , . Bookmark the permalink.

2 Responses to Academia’s missing references

  1. Pingback: About engagement and evanescence: OpenCitations at the Open Science Fair 2021 | OpenCitations blog

  2. Pingback: The Initiative for Open Abstracts: Celebrating our first anniversary | OpenCitations blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s