Performing live time-traversal queries on RDF datasets

Guest post by Arcangelo Massari, University of Bologna

In this post, Arcangelo Massari, who recently graduated in Digital Humanities and Digital Knowledge under Professor Silvio Peroni at the University of Bologna, shares the results of his master thesis.

A particular problem in information retrieval is that of obtaining data from an evolving dataset, independent of the time at which that item of data was added, changed or removed. To permit such time-independent queries to be performed over evolving RDF datasets, I have developed two new pieces of open source software, time-agnostic-library [1] and time-agnostic-browser [2], that are now available from the OpenCitations GitHub repository.

The time-agnostic-library is a Python library to perform live time-traversal queries on RDF datasets. Time-traversal means being agnostic about time: a SPARQL query that is not run on the current state of the collection but over its entire history or over a specified timespan of that history [3]. This tool allows materializations – obtaining all versions of an entity over time, or its status at a given time. Furthermore, SPARQL queries can be performed to get the delta between two or more versions of one or more resources. Thereby, the time-agnostic-library realizes all the retrieval functionalities described in the taxonomy by Fernández et al. [3].

To complement this query software, the time-agnostic-browser is a web application built on top of the time-agnostic-library to achieve the same results via a graphical user interface.

The primary purpose of these developments is to offer a system for browsing the provenance [4] of RDF statements across time: who produced them, when, where the information was taken from, and what changes were made compared to the previous state of the resource. Knowledge of such information is essential because data changes over time, either because of the natural evolution of concepts or due to the correction of mistakes. Indeed, the latest version of knowledge may not be the most accurate. Such phenomena are particularly tangible in the Web of Data, as highlighted in a study by the Dynamic Linked Data Observatory, which noted the modification of about 38% of the nearly 90,000 RDF documents monitored for 29 weeks, and the permanent disappearance of 5% of them [5] (Figure 1).

Figure 1. Donut chart showing the results of the study conducted by the Dynamic Linked Data Observatory on the evolution of RDF documents [5].

Additionally, the truthfulness of data cannot be assessed without provenance records and a system to query them. In fact, the truth value of an assertion on the Web is never absolute, as demonstrated by Wikipedia, which in its official policy on the subject states: “The threshold for inclusion in Wikipedia is verifiability, not truth.” [6]. The Semantic Web does not alter that condition, and trustworthiness has to be evaluated by each application by probing the context of the statements [7]. It is a challenging task and thus, in the Semantic Web Stack, trust is the highest and most complex level to satisfy, subsuming all the previous ones (Figure 2).

Figure 2.The Semantic Web layers [7]. Trust is the uppermost level of the stack, subsuming all the others.

Notwithstanding these premises, at present the most extensive RDF datasets – DBPedia [8], Wikidata [9], Yago [10], and the Dynamic Linked Data Observatory [11] – do not use RDF to track changes and record the provenance of such changes. Instead, they all adopt backup-based archiving policies. Some of them, such as Yago 4, record provenance but not changes. As far as citation databases are concerned, OpenCitations is the only infrastructure to implement change-tracking mechanisms and to record full RDF provenance records for each data entity. Among the leading players in this field, neither Web of Science nor Scopus have adopted similar solutions.

In accordance with the OpenCitations Data Model (OCDM) [12], a provenance snapshot is generated by OpenCitations every time a bibliographical entity is created or modified. Each snapshot (prov:Entity) records the responsible agent (prov:wasAttributedTo), the generation time (prov:generatedAtTime), the invalidation time (prov:invalidatedAtTime), the primary source (prov:hadPrimarySource), and a link to the previous snapshot (prov:wasDerivedFrom), using terms from the Provenance Ontology. In addition, OCDM introduced a system to simplify restoring an entity’s status at a given time, by saving the delta between two versions as a SPARQL update query (prov:hasUpdateQuery) [13] (Figure 3). This approach enables one to restore an entity to a specific timepoint (snapshot) in a straightforward way by applying the inverse operations, i.e., deletions instead of additions, etc.

Figure 3. Provenance in the OpenCitations Data Model.

This solution is concretely used in all the datasets related to the OpenCitations infrastructure, such as COCI, an open index containing almost 1.2 billion DOI-to-DOI citation links derived from the open reference data available in Crossref [14]. It is important to note that this OpenCitations provenance model is generic and reusable in any other context. Since the time-agnostic-library leverages OCDM, it too is generic and can be used for any RDF dataset that tracks changes and provenance as OpenCitations does.

The time-agnostic-library is released under the ISC license and is downloadable through pip [1]. Test-driven development was adopted as a software development process during its creation [15]. It makes three main classes available to the user: AgnosticEntity, VersionQuery, and DeltaQuery, for materializations, version queries, and delta queries, respectively (Listing 1).

Listing 1. Code template to achieve materializations, time-traversal queries, and delta queries.

All three operations can be performed over the entire available history of the dataset, or by specifying a time interval via a tuple in the form (START, END).

The time-agnostic-browser [2] is also released under the ISC license and can be run as a Flask application. It is organized into two macro-sections: “Explore” and “Query”. In the former, a text input accepts a URI. By submitting it, the entire history of the corresponding resource is displayed. In the latter, a text area receives a SPARQL query, which is resolved on all dataset states. Its main added value is hiding the triples and the complexity of the underlying RDF model: predicate URIs, as well as subjects and objects, appear in a human-readable format. Moreover, all the entities are displayed as links, providing shortcuts to reconstruct the history of the related resources (Figure 4).

Figure 4. Graphical user interface of an entity history reconstruction through the time-agnostic-browser.

The efficiency of time-agnostic-library was measured with two types of benchmarks [16], one on execution times and the other on the amount of computer memory (RAM) required by ten different use cases, each repeated ten times to produce significant results and avoid outliers. In light of these benchmarks, time-agnostic-library has proven effective for any materialization. Regarding structured queries, they are swift if all subjects are known or deductible. On the other hand, the presence of unknown subjects in the user’s SPARQL query involves the identification of all present and past entities that satisfy that pattern, and so requires a more significant amount of time and resources. Specifically, all materializations and the cross-version structured query with known subjects required about half a second and about 50 MB of RAM; conversely, with unknown subjects, 581 seconds and 519 MB of RAM are required. It can be concluded that the proposed software can be used effectively in all cases where the subject is known, that is, for any materialization or formulated SPARQL queries without isolated triple patterns containing unknown subjects.

Other software solutions for such problems have been proposed. Table 1 shows the list of available software to perform materializations and time-traversal queries on RDF datasets. As can be observed, time-agnostic-library is the only one to support all retrieval functionalities without requiring pre-indexing processes. This feature makes it particularly suitable for use in scenarios with large amounts of data that often change over time. Moreover, compared to the approach of Im, Lee and Kim [17] and OSTRICH [18], the OpenCitations Data Model only requires storing the current state of the dataset, rather than the original one, allowing one to query the latest version, without additional computational effort to first re-create the original version.

SoftwareVersion materializationDelta materializationSingle-version structured queryCross-version structured querySingle-delta structured queryCross-delta structured queryLive
PromptDiff [19]+++
SemVersion [20]+++
Im, Lee, & Kim, 2012 [17]+++++
R&Wbase [21]++++
x-RDF-3X [22]+++
v-RDFCSA [23]++++++
OSTRICH [18]+++
Tanon & Suchanek, 2019 [24]++++++
time-agnostic-library[1]+++++++
Table 1. Comparative between time-agnostic-library and preexisting software to achieve materializations and time traversal queries on RDF datasets. (Scroll right to see Columns 6-8).

The OpenCitations Data Model and the time-agnostic-library software are the pre-requisites that will allow OpenCitations to involve third parties, for example members of staff in academic libraries, in the submission, curation and updating of OpenCitations bibliographic and citation data. At this stage, all entities in COCI have a single snapshot — the one made at the time of creation. However, since these entities may become modified, corrected or enriched over time, it is imperative to have appropriate software tools available for use by curators. With the time-agnostic-library software and its associated time-agnostic-browser, it will be possible for a curator to explore the entire history of the changes within an RDF dataset, to know when they were made, based on which source, and by which responsible agent, thus ensuring the reliability and verifiability of data, and facilitating any necessary further changes.

References

[1] A. Massari, time-agnostic-library. 2021. Available: https://archive.softwareheritage.org/swh:1:snp:d7fd1754377f45d16afb61efc770815b5a3c8f83

[2] A. Massari, time-agnostic-browser. 2021. Available: https://archive.softwareheritage.org/swh:1:dir:337f641375cca034eda39c2380b4a7878382fc4c

[3] J. D. Fernández, A. Polleres, and J. Umbrich, ‘Towards Efficient Archiving of Dynamic Linked’, in DIACRON@ESWC, Portorož, Slovenia: Computer Science, 2015, pp. 34–49.

[4] December, ‘Provenance XG Final Report’. 2010. Available: http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/

[5] T. Käfer, A. Abdelrahman, J. Umbrich, P. O’Byrne, and A. Hogan, ‘Observing Linked Data Dynamics’, in The Semantic Web: Semantics and Big Data, vol. 7882, P. Cimiano, O. Corcho, V. Presutti, L. Hollink, and S. Rudolph, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 213–227. doi: 10.1007/978-3-642-38288-8_15

[6] S. L. Garfinkel, ‘Wikipedia and the Meaning of Truth’, MIT Technology Review, 2008, [Online]. Available: https://stephencodrington.com/Blogs/Hong_Kong_Blog/Entries/2009/4/11_What_is_Truth_files/Wikipedia%20and%20the%20Meaning%20of%20Truth.pdf

[7] M.-R. Koivunen and E. Miller, ‘Semantic Web Activity’, W3C, Nov. 02, 2001. https://www.w3.org/2001/12/semweb-fin/w3csw

[8] F. Orlandi and A. Passant, ‘Modelling provenance of DBpedia resources using Wikipedia contributions’, Journal of Web Semantics, vol. 9, no. 2, pp. 149–164, Jul. 2011, doi: 10.1016/j.websem.2011.03.002.

[9] P. Dooley and B. Božić, ‘Towards Linked Data for Wikidata Revisions and Twitter Trending Hashtags’, in Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services, Munich Germany, Dec. 2019, pp. 166–175. doi: 10.1145/3366030.3366048.

[10] Yago Project, ‘Download data, code, and logo of Yago projects’, Yago, 2021. https://yago-knowledge.org/downloads (accessed Sep. 24, 2021).

[11] J. Umbrich, M. Hausenblas, A. Hogan, A. Polleres, and S. Decker, ‘Towards Dataset Dynamics: Change Frequency of Linked Open Data Sources’, in Proceedings of the WWW2010 Workshop on Linked Data on the Web, Raleigh, USA, 2010. Available: http://ceur-ws.org/Vol-628/ldow2010_paper12.pdf

[12] M. Daquino, S. Peroni, and D. Shotton, ‘The OpenCitations Data Model’, p. 836876 Bytes, 2020, doi: 10.6084/M9.FIGSHARE.3443876.V7.

[13] S. Peroni, D. Shotton, and F. Vitali, ‘A Document-inspired Way for Tracking Changes of RDF Data’, in Detection, Representation and Management of Concept Drift in Linked Open Data, Bologna, 2016, pp. 26–33. Available: http://ceur-ws.org/Vol-1799/Drift-a-LOD2016_paper_4.pdf

[14] I. Heibi, S. Peroni, and D. Shotton, ‘Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations’, Scientometrics, vol. 121, no. 2, pp. 1213–1228, Nov. 2019, doi: 10.1007/s11192-019-03217-6.

[15] K. Beck, Test-driven development: by example. Boston: Addison-Wesley, 2003.

[16] A. Massari, ‘time-agnostic-library: benchmark results on execution times and RAM’. Zenodo, Oct. 05, 2021. doi: 10.5281/ZENODO.5549648.

[17] D.-H. Im, S.-W. Lee, and H.-J. Kim, ‘A Version Management Framework for RDF Triple Stores’, Int. J. Softw. Eng. Knowl. Eng., vol. 22, pp. 85–106, 2012.

[18] R. Taelman, M. V. Sande, and R. Verborgh, ‘OSTRICH: Versioned Random-Access Triple Store’, in Companion Proceedings of the Web Conference 2018, 2018, pp. 127–130. Available: https://core.ac.uk/download/pdf/157574975.pdf

[19] N. F. Noy and M. A. Musen, ‘Promptdiff: A Fixed-Point Algorithm for Comparing Ontology Versions’, in Proc. of IAAI, 2002, pp. 744–750.

[20] M. Völkel, W. Winkler, Y. Sure, S. Kruk, and M. Synak, ‘SemVersion: A Versioning System for RDF and Ontologies’, 2005.

[21] M. V. Sande, P. Colpaert, R. Verborgh, S. Coppens, E. Mannens, and R. V. Walle, ‘R&Wbase: Git for triples’, 2013.

[22] T. Neumann and G. Weikum, ‘x-RDF-3X: Fast Querying, High Update Rates, and Consistency for RDF Databases’, Proceedings of the VLDB Endowment, vol. 3, pp. 256–263, 2010.

[23] A. Cerdeira-Pena, A. Farina, J. D. Fernandez, and M. A. Martinez-Prieto, ‘Self-Indexing RDF Archives’, in 2016 Data Compression Conference (DCC), Snowbird, UT, USA, Mar. 2016, pp. 526–535. doi: 10.1109/DCC.2016.40.

[24] T. Pellissier Tanon and F. Suchanek, ‘Querying the Edit History of Wikidata’, in The Semantic Web: ESWC 2019 Satellite Events, vol. 11762, P. Hitzler, S. Kirrane, O. Hartig, V. de Boer, M.-E. Vidal, M. Maleshkova, S. Schlobach, K. Hammar, N. Lasierra, S. Stadtmüller, K. Hose, and R. Verborgh, Eds. Cham: Springer International Publishing, 2019, pp. 161–166. doi: https://doi.org/10.1007/978-3-030-32327-1_32.

Posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, Open Citations | Tagged , , , , , , , , | Leave a comment

Coverage of open citation data approaches parity with Web of Science and Scopus

Guest blog post by Alberto Martín-Martín, Facultad de Comunicación y Documentación, Universidad de Granada, Spain <albertomartin@ugr.es>

In this post, as a contribution to Open Access Week, Alberto Martín-Martín shares his comparative analysis of COCI and other sources of open citation data with those from subscription services, and comments on their relative coverage.

Comprehensive bibliographic metadata is essential for the development of effective understanding and analysis across all phases of the research workflow. Commercial actors have historically filled the role of infrastructure providers of bibliographic and citation data, but their choice of subscription-based business models and/or restrictive user licenses has significantly limited how users and other parties can access, build upon, and redistribute the information available on those platforms. Locking bibliographic and citation metadata behind these barriers is problematic, as it hinders innovation and is an obstacle to reproducibility.

Fortunately, the process of digital transformation that scientific communication is currently undergoing is providing us with the tools to get closer to the ideal of science as a public good. One of the most successful initiatives in this area is Crossref, arguably the single most critical piece of research metadata infrastructure currently in existence. I consider the best thing about it to be its commitment to openness. Not only is Crossref responsible for minting many of the DOIs that are assigned to academic publications, but it also publishes metadata about these publications (for over 120+ million records in their latest public data file) without imposing any access or reuse limitations.

Crossref metadata has already boosted innovation in a variety of academic-oriented tools. New discovery services such as Dimensions, The Lens, and Scilit all take advantage of Crossref metadata to keep their indexes up to date with the latest publications. The open-source reference manager Zotero is able to pull metadata associated with a given DOI from Crossref’s servers, providing an easy way to populate one’s personal reference collection that is more reliable than using Google Scholar. The Unpaywall database uses Crossref metadata (among other data sources) to keep track of which documents are Open Access, and this data is in turn used by Unsub, a service that helps libraries make more informed decisions about their journal subscriptions.

Historically, citation indexing has been a functionality available only from a few subscription-based data sources (most notably Web of Science and Scopus), or from free but largely restricted sources (e.g., Google Scholar). In recent years, however, commercial exclusivity over citation data has been waning. Digital publishing workflows make it easier for publishers to deposit the list of cited references along with the rest of the metadata when they register a new document in Crossref, and many are already doing it. Crossref’s policy is to make these lists of references publicly available by default, although publishers can elect to prevent their public release. From this, it follows that if most publishers deposited their reference lists in Crossref and consented to make them open, a comprehensive open citation index, one that is free of the restrictions present in traditional platforms, could be built.

The Initiative for Open Citations (I4OC) is an advocacy group that has been working since 2017 to achieve this precise goal, and it has already managed to convince a large number publishers (over two thousand) to open the references they deposit in CrossRef. In the first half of 2021, Elsevier, the American Chemical Society, and Wolters Kluwer joined this group, so that today all the major scholarly publishers now support I4OC and have open references at Crossref, with the exception of IEEE (the Institute of Electrical and Electronics Engineers). Thanks to the efforts of I4OC and the collaboration of publishers, 88% of the publications for which publishers have deposited references in CrossRef are now open. This has allowed organizations such as OpenCitations (one of the founding members of I4OC) to create a non-proprietary citation index using these data, namely COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Other open citation indexes such as the NIH Open Citation Collection (NIH-OCC) and Refcat have also been recently released.

How do such open citation indexes compare to long-established indexes? In 2019, I set out with colleagues to analyze the coverage of citations contained within the most widely used academic bibliographic data sources (Web of Science, Scopus, and Google Scholar) to a selected corpus of 2,515 highly-cited English-language documents published in 2006 from 252 subject categories, and to compare this to the coverage provided by some of the more recent data sources (Microsoft Academic, Dimensions, and COCI). At that time, COCI was the smallest of the six indexes, containing only 28% of all citations. For comparison, Web of Science contained 52%, and Scopus contained 57%.

There are a number of reasons for those differences: first, at that point some of the larger commercial publishers including Elsevier, IEEE, and ACS, which routinely deposit references in Crossref, had not yet opened them. Second, many smaller publishers still do not deposit their reference lists in Crossref. Third, COCI only captures citation relationships between documents that have DOIs, thus missing citations to publications that lack them. Finally, while for our study data collection from all sources was carried out during May/June of 2019, COCI at that time had not been updated since November 2018, which increased its disadvantage when compared to other data sources with more frequent updates.

Since Elsevier is the largest academic publisher in the world, its recent opening of references at Crossref resulted in a significant increase in the total number of openly available Crossref references. The most recent version of COCI (dated 3 September 2021, and based on open references to works with DOIs within the Crossref dump dated August 2021) now contains both the processed references from Elsevier, and the references in the most recently published articles by ACS (the complete backfile of ACS references will appear in future versions of COCI).

Given these significant developments, how much has the picture changed? To find this out, I updated our 2019 analysis using the version of COCI released on September 3rd 2021 and the NIH-OCC dataset released in the same month. To carry out a reasonably fair comparison while reusing the data extracted in 2019 from the other sources, I employed the same corpus of target documents, and only used citations in which the citing document was published before the end of June 2019. The intention was to learn how much the coverage of open citation data has grown as a result of the subsequent opening of reference lists in Crossref that were not public in 2019, and similar efforts.

The combination of COCI’s and NIH-OCC’s September 2021 releases contained more than 1.62 million citations to our sample corpus of documents from all areas, a 91% increase over the 0.85 million citations that we were able to recover in 2019 from COCI alone. Considering the citations available in all data sources, 53% of all citations are now available from these two open sources under CC0 waivers, up from the 28% we found in 2019. This coverage now surpasses the 52% found by Web of Science, and is much closer to the 54% found by Dimensions, and the 57% covered by Scopus. The relative overlap between COCI and the other data sources has also significantly increased: in 2019 COCI found 47% of the citations available in Web of Science, whereas now open citation data sources find 87% of the WoS citations. In the case of Scopus, in 2019 COCI found 44% of the citations available in Scopus: the percentage available from open sources has now increased to 81%. The number of citations found by COCI but not present in the other data sources has also widened slightly. These data are presented graphically in Figure 1.

Fig. 1. Percentage of citations found by each database, relative to all citations (first row), and relative to the number of citations found by the other databases (subsequent rows).

Where are these new citations coming from? Well, as we might expect, references from articles published in Elsevier journals comprise the lion’s share of the newly found citations in open data sources (close to half of all new citations), as shown in Figure 2. But there are also some IEEE citations here. This is because until recently reference lists from IEEE publications were available in the ‘limited’ Crossref category to members of Crossref Metadata Plus, a paid-for service that provides a few additional advantages over the free services Crossref provides. As a member of Crossref Metadata Plus, OpenCitations obtained these reference lists while they were available and included them in COCI. Subsequently, IEEE decided to make their references completely closed, explaining why references from more recent IEEE publications are not included in COCI.

Fig 2. The increases between 2019 and 2021 of citations indexed by open sources (COCI + NIH-OCC) from the articles of different publishers

There can be no doubt that open citation data is of benefit to the entire academic community. Thanks to COCI, NIH-OCC, and similar initiatives, and despite some setbacks, we are already witnessing how open infrastructure can help us develop models and practices that are better aligned with the opportunities that our current digital environment offers and the challenges that our society faces.

Conclusion: The coverage of citation data available under CC0 waivers from open sources is now comparable to that from subscription sources such as Web of Science and Scopus, offering a viable alternative upon which to base open and reproducible metrics of academic performance.

Posted in Bibliographic references, Open academic analytics, open access, Open Citations, Open Science | Tagged , , , , , , , , , , , , , , , , , , , , | 1 Comment

Open Access Tage 2021: valuable insights from the libraries in the German-speaking region 

On September 27, OpenCitations’ director Silvio Peroni, together with Niels Stern (DOAB/OAPEN) and James MacGregor (PKP), held the online workshop “How Open Infrastructure Benefits Libraries” during the Open Access Tage 2021. Open-Access-Tage (Open Access Days) are the annual central platform for the steadily growing Open Access and Open Science community from Germany, Austria and Switzerland, and are aimed at all those involved with the possibilities, conditions and perspectives of scientific publishing.  

The workshop gathered three of the SCOSS-supported infrastructures to discuss how Open Infrastructures (OIs) could encourage the engagement of university libraries, and how they could be beneficial game-changing alternative to commercial infrastructures. This theme, which was also been presented during the last LIBER conference, was here discussed under a new perspective, by involving the specific case of the libraries from the German-speaking region. Their point of view particularly emerged during the second part of the workshop, during which the participants were divided into two breakout rooms to discuss two questions each. These are the answers and comments that emerged from the discussions.  

1. What would prevent or encourage libraries in the German-speaking region to support open infrastructures? 

The three main concepts held to be crucial in this field were transparency, promotion and governance.  

Transparency: German libraries and public institutions often deal with strict funding limitations relating to donations. It is therefore crucial for OIs (a) to present in a clear way how libraries can get involved and the money needed, (b) to communicate what they do and how they can add value to libraries compared to other services, and (c) to clarify the direct return and benefits on investments. These points would make it easier to recommend OIs internally, especially when people from subject-specific institutions are interested in subject-independent OIs. Point (b) leads to the Promotion issue: Open Infrastructures should promote themselves non only on a global level, by communicating their impact in the open research movement as against non-transparent propitiatory services, but also at a local level, by providing information about the usage (and the value) of their services at an institutional and/or national level. This case-by-case narration (with attention to the specific benefits) would make it easier for the institutions to evaluate the sustainability of the investment. An incentive to donate is being actively involved in the community governance, i.e.through a board membership.  

Nevertheless, is also necessary for libraries to “take courage” when investing in such OIs, and, when possible, to overcome administrative boundaries by forming consortia. Finally, of particular note was a desire to see locally-managed sub-communities that could speak specifically to the German (or whichever) language environment, much as ORCID arranges itself.  

2. Community Governance. What kind of involvement do you want to see, how do you want to be involved? 

Some common problems which prevent the institutions from being involved are (a) a general concern about the fact that negotiations with publishers are typically the main focus of OA discussions – leaving little time to focus on OIs and other open initiatives, and (b) a lack of time, or of guidelines, for evaluating the different infrastructures to invest in. This is why SCOSS was appreciated as an intermediary in the decision process, because of its own rigorous evaluation and selection mechanismThe community-funding approach proposed by SCOSS thus seems to be the preferred way by which to support OIs.  

Regarding community governance, one idea could be to involve interested scholars in the governance of the open infrastructures (with the library acting as an interface between the open infrastructure and the scholars) rather than only involving library staff – although this idea was argued against in the second group, as researchers are often percieved as too busy to be functional in operational infrastructure groups. What also emerged from this second question is an interest in community involvement on different levels, for example as a community of practices or through discussion boards, mailing lists, periodic meet-ups, workshops, newsletters, etc. The community could also be articulated into local sub-communities, as in the successful case of ORCID and ORCID_DE.  

Posted in Bibliographic references, open access, Open Citations, Open scholarship, Open Science | Tagged , , , , , , , , | Leave a comment

Save the dates: OpenCitations October events 

With the numerous September events in which the OpenCitations’ directors have recently been involved behind us, it is now time to announce the participation of our director Silvio Peroni in two October events. 

On Wednesday 6th, Silvio will take part in the Beilstein OpenScience Symposium 2021 (October 5-7), giving a short presentation “Open Citations, an Open Infrastructure to Provide Access to Global Scholarly Bibliographic and Citation Data” during the Poster Flash Talk Presentation (17:10-18:00 CEST). The Beilstein OpenScience Symposium is an annual event that gathers leaders in the FAIR and Open Data movement, covering a wide range of research fields, including biomedical research, physics and social science, and exploring how open data practices are transforming sectors outside academia. The 2021 online edition will present a series of talks addressing the many ways that data transparency contributes to the research progress. Among them, the poster presentations involve short oral presentations on Wednesday, 6th October, to accompany the posters that will be displayed throughout the entire symposium. Poster abstracts are available in the Abstract Book that can be downloaded on Beilstein Symposium’s website: https://www.beilstein-institut.de/en/symposia/open-science/program/ .

You can register for the event here: https://www.beilstein-institut.de/en/symposia/open-science/registration/

The poster and slides from the presentation are available on Zenodo:

Peroni, S. (2021, October 6). Open Citations, an Open Infrastructure to Provide Access to Global Scholarly Bibliographic and Citation Data—Poster Flash Talk Session slides. Beilstein Open Science Symposium 2021, Virtual Event. Zenodo. https://doi.org/10.5281/zenodo.5553025

Peroni, S., Shotton, D. W., & Di Giambattista, C. (2021). Open Citations, an Open Infrastructure to Provide Access to Global Scholarly Bibliographic and Citation Data. https://doi.org/10.5281/zenodo.5553040

The second event is the annual European Computer Science Summit (ECSS), organized by Informatics Europe and involving academics, industry leaders, decision makers and others interested in Informatics/Computer Science research and education in Europe. ECSS 2021, “Informatics for a Sustainable Future” (Oct. 25-27), will be held as a hybrid event, involving both online as well as on-site sessions held in Madrid, at the Facultad de Ciencias de la Actividad Física y del Deporte (INEF), Universidad Politécnica de Madrid located at Calle de Martín Fierro, 7

During the last day of the meeting (Oct. 27), Silvio Peroni will be speaking at the “National Informatics Associations Workshop”, an annual workshop organised by Informatics Europe in collaboration with the National Informatics Associations in Europe. This year the workshop will address the themes Informatics in Interdisciplinary Curricula and Research Evaluation in Informatics, thus focusing on an important question: how to recognise, assess and credit research contributions specific to Informatics, such as conference publications and software artefacts. Elaborating on this, Silvio’s talk (to be delivered in person, rather than online!) is entitled “Open citations in Informatics” (9:00 CEST).  

For further information and registrations: https://www.informatics-europe.org/ecss/registration/how-to-register.html

We thank Beilstein Institute and Informatics Europe for involving OpenCitations in these international events, which provide opportunities to promote the OpenCitations infrastructure and services in stimulating environments.

We hope to see you there

Posted in Bibliographic references, open access, Open Citations, Open Science, Uncategorized | Tagged , , , , , , | Leave a comment

OpenCitations in Five Hundred Words

Yesterday I gave a lightning talk at the 2021 OASPA Conference, with the title OpenCitations – what does the future hold? The poster accompanying my talk, published on Zenodo at https://doi.org/10.5281/zenodo.5526713, is reproduced below.

Poster for 2021 OASPA Conference Lightning Talk

Here is what I said:

= = =

Most of the talks at this conference have focussed on open access to textual content. But open bibliographic metadata is also vitally important, not least to enable the calculation of metrics that are both open and reproducible.

OpenCitations is a not-for-profit open infrastructure that provides such free access to global scholarly citations. We hold dear the values and principles that underpin Open Science, and are early adopters of the Principles of Open Scholarly Infrastructure (POSI) and the FAIR data principles.

Our largest citation index is COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, which currently indexes approximately 1.2 billion citations, released to the public domain under a CC0 waiver.

Our goal is for OpenCitations to provide open bibliographic citation information having scope, depth, accuracy and provenance that surpasses that of the commercial citation indexes, for access to which scholarly institutions presently pay enormous annual subscriptions.

I wish to mention just two of our planned developments:

OpenCitations Meta is a new database that will enable us to store in-house full bibliographic metadata about citing and cited publications. This will have two advantages: It will speed user queries, since we will no longer have to wait for responses to on-the-fly API calls to Crossref and ORCID to retrieve such metadata. More importantly, it will enable us to index the large number of references involving publications that do not have DOIs, something that for technical reasons is presently lacking.

Additionally, we plan to develop new citation indexes over other sources of open references, starting with DOCI, indexing references from DataCite, and NOCI, indexing the content of the NIH Open Citation Collection.

Our progress has until recently been severely manpower-limited. However, OpenCitations was fortunate to have been selected by SCOSS, the Global Sustainability Coalition for Open Science Services, as an open infrastructure providing a unique and valuable service, and worthy of crowdfunded financial support by the global stakeholder community of research institutes, academic libraries, funders and publishers.

As a result of the generous support so far provided or pledged by ~50 such institutions, we have already reached about one-third of our requested SCOSS budget, enabling us this year to appoint new staff to start our planned technical developments, to support our community outreach, and to help take our vision forward.

Such financial support is vital for our sustainability, since we generate no income from our provision of free data, services and software. We thus invite you too to contribute to OpenCitations.

However, we also seek community involvement in other ways: participation in the community-led governance of OpenCitations; help in developing our open source software and services; curatorial involvement to improve OpenCitations data; and collaborations with other like-minded infrastructures to develop federated access to open scholarly information of all types, thereby returning control over such information to the academic community that generated it in the first place.

If you would like to work with OpenCitations in any of these ways, please contact me.

= = =

Posted in Bibliographic references, Open Citations | Tagged , , , , | Leave a comment

Academia’s missing references

No-one is quite sure of the total number of scholarly publications within the global corpus. Indeed that number will be strongly influenced by the degree to which, in addition to books and journal articles, one includes within the definition of scholarly publications ‘grey literature’ such as reports published by official bodies, patents, etc. Consequentially, the total number of scholarly references within those publications is also unknown, and this number too will vary according to the inclusion criteria chosen. Furthermore, Crossref Event Data and similar datasets recording social media mentions of journal articles in blog posts and tweets extends the concept of a reference beyond that used in ‘conventional’ citation indexes such as COCI.

We celebrate the fact that well over one billion bibliographic citations are now openly available under CC0 waivers in NIH OCC (the National Institutes of Health Open Citation Collection) [1,2] and COCI (the OpenCitations Index of Crossref DOI-to-DOI Citations) [3]. Despite present gaps in their coverage, they include references to all the most important publications within the global corpus, because these will all have been cited multiple times.

Open references available from Crossref and other aggregators and indexes

Crossref, with over 1.6 billion open references, is the largest single source of such bibliographic metadata. Significant numbers of references are also available in a variety of other databases, repositories and indexes.

NIH OCC (the National Institutes of Health Open Citation Collection) is a merger of several citation databases, drawing on PubMed for crucial article metadata, and augmenting this with information from full-text articles that have been made freely available on the internet [1]. The CiteSeerX database, the arXiv preprint repository, and the Dryad data repository are examples of different types of infrastructure that also publish open bibliographic references, while there is further availability of article references from open aggregators such as DataCite and Wikidata. These may either use their own DOIs, DOIs from the Crossref DOI registration agency, or no DOIs at all. Either way, those references will not appear in Crossref.

What is lacking is semantic coherence and interoperability between these sources, permitting federated queries across them. This makes difficult the task of obtaining a comprehensive overview of the availability of open bibliographic references.

However, there are even more citations that are not yet freely and easily available anywhere in bulk, relating to the reference lists within publications of a number of distinct types. This blog post explores academia’s missing references – those that have not yet been documented within open freely accessible citation indexes – and what might be done to bring these into the public domain.

1 References that are closed at Crossref

Eight years ago, I wrote

“In this open-access age, it is a scandal that reference lists from journal articles — core elements of scholarly communication that permit the attribution of credit and integrate our independent research endeavours — are not readily and freely available for use by all scholars.” [4]

I stand by that statement, and, through OpenCitations [5], I have been working with colleagues to rectify the situation, as I described in my previous post. The Initiative for Open Citations (I4OC) can rightly be applauded for its part in encouraging almost all the major academic publishers who deposit references at Crossref to make them open.

The only major scholarly publisher not to be listed as an I4OC Participating Publisher is the Institute of Electrical and Electronics Engineers (IEEE), that, having deposited at Crossref reference lists for 58% of its preprints and publications, persists in keeping these deposited references closed and unavailable for indexing and re-use. It is to be hoped that IEEE will now realize that what it looses by not having references openly available outweighs any benefits it might have received from keeping them closed, and will join its fellow publishers in ensuring that its Crossref-deposited references are made open, both for its current issues and for its back-number journal articles. I thus call again upon IEEE to change its present position, as both Elsevier and the American Chemical Society had the courage to do recently, and to instruct Crossref to open all IEEE references. A single email to Crossref Support, with the instruction “Open all references”, is all it would take!

References in Crossref are either open, ‘limited’ or closed. Limited references are available to those subscribing to Crossref Metadata Plus, which includes OpenCitations, but not to the general public. Closed references are not freely available to anyone. The following table shows the number of Crossref works in each category, and the number of references within those categories.


WorksReferencesAverage references per work
Total126,627,618

Without references70,477,843
(55.7% of total)


With references56,149,775
(44.3% of total)
1,734,831,31130.9
Open49,274,155
(38.9% of total)
1,605,120,22932.6
Limited2,933,323
(2.3% of total)
66,459,70422.7
Closed3,942,297
(3.1% of total)
63,251,37816.0
Table 1. Numbers of works and their references recorded in Crossref on 31 July 2021.

2 References not deposited at Crossref for publications with Crossref DOIs

The number of works with Crossref DOIs that lack submitted references is surprisingly large. As of 31 July 2021, 70,477,843 publications (55.7% of all works recorded at Crossref) lacked deposited references (Table 1).

Crossref classifies all types of journal content, including editorials, book reviews and letters, as “journal articles”, thus some of these works without deposited references genuinely lack them. However, the majority are conventional journal articles and books with reference lists that the publishers have simply not deposited at Crossref along with the other metadata for these works.

The average number of references per Crossref work with deposited references is 30.9 (Table 1). If, to make allowance for those works that genuinely lack references, we assume a conservative average of 25 references per work for the 70,477,843 works lacking deposited references, this means that there are over 1.75 billion references within these works that have not been deposited at Crossref, and thus are not conveniently available for indexing and reuse.

These missing references relate both large publishers that have submitted references for only some of their publications, and small publishers that perhaps lack, or think they lack, the resources to deposit any reference lists in addition to the other metadata they are already sending to Crossref for each of their DOIs. However, there are several easy methods for depositing reference lists, as detailed by Crossref here. So I encourage all publishers who are supportive of Open Science to update their procedures and commence or complete the deposition of their publication reference lists at Crossref, starting with their current issues. Crossref Support will provide assistance if required. Note that a publisher does not have to subscribe to the Crossref Cited-by service to deposit its references!

3 Citations missing in COCI: open references in Crossref to publications lacking DOIs

COCI is the OpenCitations Index of Crossref DOI-to-DOI Citations, and, as the name suggests, it indexes Crossref open references from works with Crossref DOIs to other works that have DOIs [3]. It therefore does not index open references in Crossref to works that, for whatever reason, lack DOIs.

The most recent (September 2021) release of COCI, based on the August Crossref dump, contains 1,186,958,898 citations between 69,074,291 unique work, comprising 51,103,720 citing bibliographic resources bearing Crossref DOIs and 56,105,783 cited bibliographic resources. Of the cited bibliographic resources, 38,135,212 bear a DOI issued by Crossref and have open or limited references (thus also being COCI citing resources), while 17,970,571 either have a Crossref DOI but lack open or limited references or have a DOI issued by another DOI registration agency such as DataCite (thus not being COCI citing resources).

Note that in Crossref, the ratio of works with open or limited references to works without open or limited references is 0.7:1 (Table 1). However, in COCI, the ratio of cited works with Crossref DOIs containing open or limited references to all other cited works is 2.1:1. Thus works with Crossref DOIs containing open or limited references are three times more likely to be cited than works that either have a Crossref DOI but lacking open or limited references or have a DOI issued by another DOI registration agency. This is most likely because the most important journals from almost all the larger publishers now have open references. However, it is still a remarkable ratio.

Because references to works lacking DOIs are not included in COCI, the average number of bibliographic references per citing article in COCI is only 23.2, in contrast to the numbers of references to works of all types given in Table 1.

From these data, it can be seen that there are over 480 million open Crossref references to a wide variety of works lacking DOIs that OpenCitations does not index in COCI. This is because of an intentional and fundamental limitation in the structure of the Open Citation Identifier (OCI) [6], requiring both citing and cited publications to have identifiers of the same type, that lies at the heart of the functionality of our OpenCitations Indexes.

OpenCitations is currently developing a solution that, without compromising that intentional design limitation in OCIs, will nevertheless permit us to index and publish these ‘missing’ references as Linked Open Data. We will report on this development in due course.

Crossref Event Data is a Crossref service / database that records mentions of publications bearing Crossref DOIs in social media including tweets and blog posts, and in other non-traditional citation sources such as news items and Wikipedia articles. From today (Thursday 23rd September 2021), Crossref Event Data will start to include in its holdings open references from publications bearing Crossref DOIs to other publications bearing DOIs. Limited and closed references will not be included. Initially, open references from current publications will be included in Crossref Event Data, with open references from older works with DOIs being added later. In that respect it will come to resemble COCI, except that COCI also included ‘limited’ references, treats citations as first-class data entities with their own identifiers, and makes its citation data available in RDF as Linked Open Data, as well as via a REST API. Subsequently, Crossref Event Data will also record references to publications with other forms of identifier, as OpenCitations also plans to do.

4 References available on publishers’ web sites

Many publishers, particularly those of Open Access works, already make their publication reference lists openly available on their own web sites. While this is commendable, it is not sufficient, since, if scholarly references are not made available in a centralized aggregator such as Crossref from which they can be conveniently harvested in bulk for analysis and re-use, they are much more difficult to access.

Scraping references from the HTML of individual web sites is difficult, time-consuming and liable to be incomplete. While Microsoft Academic achieved considerable success in scraping references from publishers’ web sites, possible because of the special relationships these publishers have with the Microsoft search engine Bing, this service will unfortunately soon no longer be available, illustrating a problem inherent with scholarly infrastructures provided by commercial companies that do not adopt the Principles of Open Scholarly Infrastructures.

A consequence is that such publications will become increasingly ‘invisible’, as bibliographic and analytical services come to rely more and more on centrally available data.

6 References within PDFs of scholarly works lacking DOIs

There is a large but unknown quantity of reference-containing books, academic reports, patents and journal articles from publishers that, for their own good reasons, choose not to use DOIs. The text of many of these publications is already available in a marked-up machine-readable format such as JATS, used in preparation for publication, from which the reference lists could easily be extracted. Other publications are only available as PDFs, both as published Versions of Record, or as preprints deposited in a variety of preprint repositories such as arXiv or CORE. Mining reference lists out of PDFs required expertise in text mining and AI technologies, and is labour-intensive, since it usually required the tuning of extraction algorithms to handle the particular styles and formatting of individual journals, one at a time. Two stages are involved: first, the recognition and extraction of the text of the individual references from the PDF, and second the parsing of each text string into the component parts of the reference (author names, title, publication year, etc.) A considerable number of the citations within NIH-OCC have been obtained in this manner [1], commercial companies such as Lexical Intelligence specialize in this area, and publicly available software such as GROBID is available for the purpose. However, the overall task of extracting ‘missing’ academic references from the global PDF corpus is daunting in magnitude and would require a well-funded organization.

The correct way to proceed would be for each publisher to take responsibility for liberating the references of their own publications, whether or not the publications themselves are open access, and whether or not these references are already available in a marked-up machine-readable format or only within PDF documents. Then, if the publisher still chose not to use DOIs and to submit these metadata to Crossref, these references could be submitted directly to OpenCitations for aggregation and publication as Linked Open Data.

Conclusion

From the foregoing discussion it is clear that the academic community has a long way to go before the majority of scholarly citations, the products of their own labours, are openly available for analysis and re-use. We at OpenCitations are working to address these issues and to publish more of these missing citations. However, completion of the task will require a coordinated collaborative international effort.

Are you willing to be involved?

References

[1] B. Ian Hutchins et al. (2019). The NIH Open Citation Collection: A public access, broad coverage resource. PLoS Biol. 17 (10): e3000385. https://doi.org/10.1371/journal.pbio.3000385

[2] B. Ian Hutchins (2021). A tipping point for open citation data. Quantitative Science Studies 2 (2): 433–437. https://doi.org/10.1162/qss_c_00138

[3] Ivan Heibi, Silvio Peroni, David Shotton (2019). Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Scientometrics 121 (2): 1213-1228. https://doi.org/10.1007/s11192-019-03217-6

[4] David Shotton (2013). Open citations. Nature, 502 (7471): 295-297. http://dx.doi.org/10.1038/502295a

[5] Silvio Peroni, David Shotton (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1): 428-444. https://doi.org/10.1162/qss_a_00023

[6] Silvio Peroni, David Shotton (2019). Open Citation Identifier: Definition. Figshare. https://doi.org/10.6084/m9.figshare.7127816

Posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, open access, Open Citations | Tagged , , , , , , | 2 Comments

Save the dates: OpenCitations’ September events 

We are happy to announce OpenCitations’ participation in a number of online conferences and events during the next few weeks. Our directors Silvio Peroni and David Shotton will be speaking at the Open Science Fair 2021, the OASPA Conference 2021 and Open Access Tage.  

Open Science Fair 2021 (20-23 September) is an event organized by OpenAIRE, in collaboration with some key international initiatives in the area of Open Science: COAR, EIFL, Force11, LA Referencia, LIBER, OPERAS, Sparc, Sparc Europe. Like a real fair, the visitors can explore virtual pavilions, participating in various Keynote Talks, Parallel Sessions and Workshops dedicated to Open Science. Silvio Peroni will give two talks on Tuesday 21:  

  • In the Lightning Talk, “ScholeXplorer and OpenCitations as the new frontier of open citation indexing” (11:30 CEST), coauthored with Paolo Manghi (OpenAire), Alessia Bardi (CNR-ISTI) and Sandro La Bruzzo (CNR-ISTI), Silvio will present ScholeXplorer and OpenCitations, two of the services included in the MONITOR portfolio of the OpenAIRE-Nexus project. More information and registrations at: https://www.opensciencefair.eu/2021/lightning-talks/scholexplorer-and-opencitations-as-the-new-frontier-of-open-citation-indexing  
  • The Workshop “The perils of being invisible. Collective funding models for Open Science infrastructure” (16:30-18:00 CEST) “will help identify the main challenges of collective funding models for Open Science Infrastructure, as well as explore the path forward to make them more efficient”. Silvio Peroni, Niels Stern (DOAB/OAPEN) James MacGregor (PKP), Agata Morka (SPARC Europe/SCOSS), Jon Treadway (the Great North Wood Consulting), Jean-Francois Lutz (University of Lorraine) and Vanessa Proudman (SPARC Europe) will reflect on the evanescence of Open Science Infrastructure (OSI) in library budget considerations. The speakers will also promote interaction with other workshop participants in order to create a collective dialogue. You can register for the event here: https://www.opensciencefair.eu/2021/workshops/the-perils-of-being-invisible  

The OASPA Conference 2021 (21-23 September), entitled “Designing 21st Century Knowledge Sharing Systems”, will be dedicated to “many timely and fundamental topics relating to open scholarly communication”, including  “the ongoing impact of the pandemic”.  David Shotton will take part in the Poster Lightning Talks Session 3 (Thursday 23, 1-2 pm BST), with the title “OpenCitations – what does the future hold?”, a reflection on OpenCitations’ values, data, services, achievements so far, and plans for the future. For further information and registration: https://oaspa.org/conference/  

Silvio Peroni, together with James MacGregor (Public Knowledge Project) and Niels Stern (OAPEN) will hold the Workshop “How Open Infrastructure Benefits Libraries?” (September 27, 11:30-13 CEST) as part of the Open Access Tage 2021 (27-29 September), an annual event dedicated to Open Access initiatives and community. During the workshop, the speakers will investigate the social and economic value of open infrastructures for libraries. For more information and to register for the event: https://oat21.sched.com/event/kdFg/workshop-2-how-open-infrastructure-benefits-libraries  

We thank the organizers of these prestigious international events for having invited OpenCitations to participate. The Open Science resounds and grows through such community-centered initiatives.  

If you wish to learn more about Open Science, ongoing Open Access initiatives, and OpenCitations’ commitment to and activities within these areas, don’t miss the opportunity to participate in these on-line conferences … see you there! 

Posted in open access, Open Citations, Open scholarship, Open Science | Tagged , , , , , , | Leave a comment

California Digital Library invests in OpenCitations

OpenCitations is excited to announce that the California Digital Library (CDL) has joined our growing list of contributors.

CDL’s commitment to sustainable open scholarship has great value for the global scholarly community.  Through its investments and partnerships, CDL aims to create an international academic and librarian dialogue, trusting in the idea that “the university, its scholars and its libraries thrive when we transcend organizational boundaries and commit ourselves to shared investments”.

CDL’s contribution will generously support OpenCitations throughout 2021-2023. CDL funding in the fiscal year 2020-2021 also includes two other SCOSS-endorsed infrastructures, OAPEN and DOAB, the non-profit organization Open Access Switchboard, and the services PsyArXiv and SCOAP3 Books. As can be read in the recent post by Ellen Finnie, this investment reflects CDL’s “commitment to ’invest in open’ by allocating a portion of our collections funding to the development of open content and infrastructure in support of UC scholarship and teaching”.

OpenCitations team is grateful to be included in CDL’s ongoing investment in open infrastructure.  Thank you!

Posted in Open Citations, Open scholarship, Open Science | Tagged , , , , | Leave a comment