Guest blog post by Alberto Martín-Martín, Facultad de Comunicación y Documentación, Universidad de Granada, Spain <firstname.lastname@example.org>
In this post, as a contribution to Open Access Week, Alberto Martín-Martín shares his comparative analysis of COCI and other sources of open citation data with those from subscription services, and comments on their relative coverage.
Comprehensive bibliographic metadata is essential for the development of effective understanding and analysis across all phases of the research workflow. Commercial actors have historically filled the role of infrastructure providers of bibliographic and citation data, but their choice of subscription-based business models and/or restrictive user licenses has significantly limited how users and other parties can access, build upon, and redistribute the information available on those platforms. Locking bibliographic and citation metadata behind these barriers is problematic, as it hinders innovation and is an obstacle to reproducibility.
Fortunately, the process of digital transformation that scientific communication is currently undergoing is providing us with the tools to get closer to the ideal of science as a public good. One of the most successful initiatives in this area is Crossref, arguably the single most critical piece of research metadata infrastructure currently in existence. I consider the best thing about it to be its commitment to openness. Not only is Crossref responsible for minting many of the DOIs that are assigned to academic publications, but it also publishes metadata about these publications (for over 120+ million records in their latest public data file) without imposing any access or reuse limitations.
Crossref metadata has already boosted innovation in a variety of academic-oriented tools. New discovery services such as Dimensions, The Lens, and Scilit all take advantage of Crossref metadata to keep their indexes up to date with the latest publications. The open-source reference manager Zotero is able to pull metadata associated with a given DOI from Crossref’s servers, providing an easy way to populate one’s personal reference collection that is more reliable than using Google Scholar. The Unpaywall database uses Crossref metadata (among other data sources) to keep track of which documents are Open Access, and this data is in turn used by Unsub, a service that helps libraries make more informed decisions about their journal subscriptions.
Historically, citation indexing has been a functionality available only from a few subscription-based data sources (most notably Web of Science and Scopus), or from free but largely restricted sources (e.g., Google Scholar). In recent years, however, commercial exclusivity over citation data has been waning. Digital publishing workflows make it easier for publishers to deposit the list of cited references along with the rest of the metadata when they register a new document in Crossref, and many are already doing it. Crossref’s policy is to make these lists of references publicly available by default, although publishers can elect to prevent their public release. From this, it follows that if most publishers deposited their reference lists in Crossref and consented to make them open, a comprehensive open citation index, one that is free of the restrictions present in traditional platforms, could be built.
The Initiative for Open Citations (I4OC) is an advocacy group that has been working since 2017 to achieve this precise goal, and it has already managed to convince a large number publishers (over two thousand) to open the references they deposit in CrossRef. In the first half of 2021, Elsevier, the American Chemical Society, and Wolters Kluwer joined this group, so that today all the major scholarly publishers now support I4OC and have open references at Crossref, with the exception of IEEE (the Institute of Electrical and Electronics Engineers). Thanks to the efforts of I4OC and the collaboration of publishers, 88% of the publications for which publishers have deposited references in CrossRef are now open. This has allowed organizations such as OpenCitations (one of the founding members of I4OC) to create a non-proprietary citation index using these data, namely COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Other open citation indexes such as the NIH Open Citation Collection (NIH-OCC) and Refcat have also been recently released.
How do such open citation indexes compare to long-established indexes? In 2019, I set out with colleagues to analyze the coverage of citations contained within the most widely used academic bibliographic data sources (Web of Science, Scopus, and Google Scholar) to a selected corpus of 2,515 highly-cited English-language documents published in 2006 from 252 subject categories, and to compare this to the coverage provided by some of the more recent data sources (Microsoft Academic, Dimensions, and COCI). At that time, COCI was the smallest of the six indexes, containing only 28% of all citations. For comparison, Web of Science contained 52%, and Scopus contained 57%.
There are a number of reasons for those differences: first, at that point some of the larger commercial publishers including Elsevier, IEEE, and ACS, which routinely deposit references in Crossref, had not yet opened them. Second, many smaller publishers still do not deposit their reference lists in Crossref. Third, COCI only captures citation relationships between documents that have DOIs, thus missing citations to publications that lack them. Finally, while for our study data collection from all sources was carried out during May/June of 2019, COCI at that time had not been updated since November 2018, which increased its disadvantage when compared to other data sources with more frequent updates.
Since Elsevier is the largest academic publisher in the world, its recent opening of references at Crossref resulted in a significant increase in the total number of openly available Crossref references. The most recent version of COCI (dated 3 September 2021, and based on open references to works with DOIs within the Crossref dump dated August 2021) now contains both the processed references from Elsevier, and the references in the most recently published articles by ACS (the complete backfile of ACS references will appear in future versions of COCI).
Given these significant developments, how much has the picture changed? To find this out, I updated our 2019 analysis using the version of COCI released on September 3rd 2021 and the NIH-OCC dataset released in the same month. To carry out a reasonably fair comparison while reusing the data extracted in 2019 from the other sources, I employed the same corpus of target documents, and only used citations in which the citing document was published before the end of June 2019. The intention was to learn how much the coverage of open citation data has grown as a result of the subsequent opening of reference lists in Crossref that were not public in 2019, and similar efforts.
The combination of COCI’s and NIH-OCC’s September 2021 releases contained more than 1.62 million citations to our sample corpus of documents from all areas, a 91% increase over the 0.85 million citations that we were able to recover in 2019 from COCI alone. Considering the citations available in all data sources, 53% of all citations are now available from these two open sources under CC0 waivers, up from the 28% we found in 2019. This coverage now surpasses the 52% found by Web of Science, and is much closer to the 54% found by Dimensions, and the 57% covered by Scopus. The relative overlap between COCI and the other data sources has also significantly increased: in 2019 COCI found 47% of the citations available in Web of Science, whereas now open citation data sources find 87% of the WoS citations. In the case of Scopus, in 2019 COCI found 44% of the citations available in Scopus: the percentage available from open sources has now increased to 81%. The number of citations found by COCI but not present in the other data sources has also widened slightly. These data are presented graphically in Figure 1.
Fig. 1. Percentage of citations found by each database, relative to all citations (first row), and relative to the number of citations found by the other databases (subsequent rows).
Where are these new citations coming from? Well, as we might expect, references from articles published in Elsevier journals comprise the lion’s share of the newly found citations in open data sources (close to half of all new citations), as shown in Figure 2. But there are also some IEEE citations here. This is because until recently reference lists from IEEE publications were available in the ‘limited’ Crossref category to members of Crossref Metadata Plus, a paid-for service that provides a few additional advantages over the free services Crossref provides. As a member of Crossref Metadata Plus, OpenCitations obtained these reference lists while they were available and included them in COCI. Subsequently, IEEE decided to make their references completely closed, explaining why references from more recent IEEE publications are not included in COCI.
Fig 2. The increases between 2019 and 2021 of citations indexed by open sources (COCI + NIH-OCC) from the articles of different publishers
There can be no doubt that open citation data is of benefit to the entire academic community. Thanks to COCI, NIH-OCC, and similar initiatives, and despite some setbacks, we are already witnessing how open infrastructure can help us develop models and practices that are better aligned with the opportunities that our current digital environment offers and the challenges that our society faces.
Conclusion: The coverage of citation data available under CC0 waivers from open sources is now comparable to that from subscription sources such as Web of Science and Scopus, offering a viable alternative upon which to base open and reproducible metrics of academic performance.