The OpenCitations Roadmap is now publicly available on Trello

Want to keep yourself updated about the ongoing activities of OpenCitations? We have now publicly released the OpenCitations Roadmap, available on Trello.com:

https://trello.com/b/RprHYoKL/opencitations

The OpenCitations Roadmap consists of a board fulfilled with colour-labelled cards which present the goals so far reached, the present projects and activities, and the future plans. By clicking on the cards, it is possible to visualize a description for each activity, the progress state, and who in the OpenCitations team is working on it.

The OpenCitations Roadmap covers all kinds of activities divided according to the scope, identified by the coloured labels, in particular:

  • light blue for the technical development, such as the development of the software for the creation of the new database OpenCitations Meta and of DOCI, the OpenCitations Index of DataCite open DOI-to-DOI citations, and the re-engineering of the infrastructure and the website;
  • green for the data model implementation;
  • yellow for the data development, such as the bi-monthly COCI releases;
  • purple for the events and outreach activities.

The cards also highlight the activities related to the two EC-funded projects OpenCitations is involved in, OpenAIRE Nexus (blue label) and RISIS2 (orange label). We thank the OpenAIRE team for the help and suggestions during the Roadmap review process.

The OpenCitations Roadmap is an open work in progress that will reflect the developments and growth of OpenCitations. At OpenCitations, we don’t want this Roadmap to be just an online ‘showcase’, but a room in which to share ideas and opinions. We invite you – the members of our community, our stakeholders, the other Open Science actors, researchers, and librarians, and anyone who is interested in OpenCitations activities – to add a comment or a question in the ‘Leave feedback‘ card. This will help us to better understand our strong and weak points, and to stay in touch with the needs and thoughts of the community.

In this way, supplementing the conventional communications channels of email and the social platforms (our blog, Twitter, LinkedIn), the OpenCitations Roadmap will become a new virtual place for dialogue, where you can directly contribute to improve OpenCitations.

Posted in Information visualization, Open Citations, Open Science, Web interface design | Tagged , , , , , , , , , , | Leave a comment

OpenCitations and EC funding: OpenAIRE Nexus and RISIS2

The incentives for new OpenCitations innovative solutions

Two years ago, in their canonical 2020 QSS paper on OpenCitations, Silvio Peroni and David Shotton anticipated the creation of the new database, OpenCitations Meta, able to “offer a faster and richer service” by storing bibliographic metadata “in house”. Meta would “avoid duplication of data by efficiently permitting us to keep […] a single copy of the metadata for each of the bibliographic entities involved as citing or cited entities in the different OpenCitations’ citation indexes”, would remove the requirement for potentially slow API calls to external metadata sources such as Crossref and ORCID, and would enable us to index citations involving entities lacking DOIs.

Important synergies to achieve goals

Today, thanks to the recent involvement of OpenCitations in two EC-funded projects, the OpenAIRE-Nexus Project (Horizon 2020 EU funded project, GA: 101017452) and the RISIS2 Project (Horizon 2020 EU funded project, GA: 824091), the development of OpenCitations Meta has commenced, with a planned release date later in 2022.

The OpenAIRE-Nexus project started in January 2021 to embrace and expand the operation of a portfolio of thirteen services, provided by OpenAIRE infrastructure, public institutions, organisations and universities, classified into three portfolios entitled PUBLISH, MONITOR, and DISCOVER. The OpenAIRE-Nexus portfolios focus on the demands of the three main categories of the research lifecycle.  Therefore, OpenAIRE-Nexus makes sure such services are integrated to provide a uniform Open Science Scholarly Communication package for the European Open Science Cloud (EOSC). Within the OpenAIRE Nexus project there is scope for producing not only support materials (factsheet, guides, video tutorials, demos) but also training sessions where the services in the three portfolios will be showcased, anticipating the EOSC onboarding process. The role of OpenCitations in the project is to provide open bibliographic citations, and interconnect and integrate (and vice versa) functionalities with the  OpenAIRE Research Graph and more OpenAIRE-Nexus services such as EpiSciences, OpenAIRE MONITOR) the core component of OpenAIRE infrastructure and services and of the EOSC Resource Catalogue. 

Additionally, we are happy to announce our recent involvement in the RISIS2 Project. The Research Infrastructure for Science and Innovation Policy Studies (RISIS) is a project funded by the European Union under a Horizon2020 Research and Innovation Programme. RISIS2 involves 18 partners working together to create and maintain a research infrastructure for the field of Science, Technology, and Industry (STI) Studies, and to build an advanced research community in this field. OpenCitations’ contributions to RISIS2 will include not only the creation of OpenCitations Meta but also the development of a new citation index of open references, the OpenCitations Index of DataCite Open Citations (DOCI), which will be based on the open reference holdings of DataCite and, together with COCI, will be cross-searchable through our unified OpenCitations API.

Lessons learnt so far

A year into the OpenAIRE-Nexus project, we have found that one of the most significant benefits for OpenCitations is our involvement with this wide cooperative network of European research infrastructures, services, and communities, within which we can exchange experiences, ideas, and knowledge, and discuss any challenges and outcomes with our colleagues. More importantly, OpenCitations becomes positioned within the Open Science ecosystem, as a valuable innovative infrastructure with strong proof of integration and interoperable operations. Being part of the OpenAIRE-Nexus team has opened up more future challenges and expectations, and raised the bar for the inclusion of more functionalities of value. Thanks to the dedication of its efficient communication team, OpenAIRE is also helping us by communicating OpenCitations services to additional users and stakeholders, by inclusion within the comprehensive OpenAIRE services catalogue, by releasing an OpenCitations factsheet and by permitting us to present the latest information on OpenCitations through established events (i.e. Open Science FAIR 2022). FAIR and openness of information is our motto, and we strongly promote this through all our activities.

Expanding our team

As announced in our previous blog post “Five reasons why 2021 has been a great year for OpenCitations”, the support we receive from the EU as part of OpenAIRE-Nexus has enabled our recent appointment of Arcangelo Massari, a software developer who is now playing a crucial role in the creation and development of OpenCitations Meta.

As the year 2022 progresses, we look forward to bringing you further information about other new goals for OpenCitations, made possible by the support we receive from our numerous partnerships.

Posted in open access, Open Citations, Open scholarship, Open Science | Tagged , , , , , , , , , , , | Leave a comment

Five reasons why 2021 has been a great year for OpenCitations

2021 is just behind us. Since January is “the Monday of the months”, as F. Scott Fitzgerald once wrote[1], it’s a good time to take stock of what happened at OpenCitations during the past year.

Among the numerous events, achievements and challenges that 2021 brought with it, we want to highlight five milestones which make us proud to look back:

1. We extended our coverage to well over one billion citations

During 2021, OpenCitations’ largest index COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations) was able to include for the first time the citation links involving references that had been opened at Crossref by Elsevier and the American Chemical Society, thereby greatly expanding its coverage. The last release of COCI (November 2021) is based on open references to works with DOIs within the Crossref dump dated October 2021, and, as a result, COCI now contains information on more than 1.23 billion citations involving almost 70 million publications.
A recent analysis by Alberto Martìn-Martìn (Facultad de Comunicación y Documentación, Universidad de Granada, Spain), published on the OpenCitations Blog in October, shows that the citation coverage provided by OpenCitations is approaching parity with that of the leading commercial citation indexes, Web of Science and Scopus, offering a viable alternative upon which to base open and reproducible metrics of academic performance.

2. OpenCitations team grew

Last summer, we appointed Claudio Fabbri as our Administrator and Research Manager to take responsibility for the day-to-day administrative and financial activities of OpenCitations; Chiara Di Giambattista as Communications Director and Community Development Manager to take care of all communications and community interactions made on behalf of OpenCitations; and Giuseppe Grieco as our new Software and Systems Developer to take charge of technical development related to the OpenCitations services.

Thanks to the support from the OpenAIRE Nexus project, the team has also recently welcomed Arcangelo Massari as our new Software Developer to take care of the development of the new database OpenCitations Meta. We anticipate further appointments during 2022!

Our International Advisory Board met in November, and we thank its members for the valuable advice they provided. The Board will meet again later this month.

3. We participated in many international meetings

During the past year, OpenCitations’ directors Silvio Peroni and David Shotton took part in numerous international conferences, webinars and workshops, including the LIBER Annual Conference 2021, the OS Fair 2021, OASPA 2021 and FORCE2021. These provided excellent opportunities to describe and promote OpenCitations, to reach out to new potential stakeholders, and to discuss with other experts the main themes of our activities and plans as they relate to Open Science.

The year ended with a bang, with the announcement during the closing session of FORCE2021 that the 2021 Open Publishing Award for Open Data had been awarded to OpenCitations.

4. We received a world of support

In 2021, thanks to our involvement in the SCOSS funding campaign and to our commitment to reaching out to the libraries and universities potentially interested in OpenCitations, we gathered a wide international community of stakeholders and supporters around us. We are deeply thankful to the 6 consortia and 56 institutions across the globe which are now supporting us financially, thus making it possible for us to enhance our services and expand our team. You can find the full list of our supporters on the OpenCitations website and in this recent Thank You video:

Additionally, in January 2021, we started our involvement in the EC-funded OpenAIRE Nexus project, bringing us into closer collaboration with our European colleagues and infrastructures, including OpenAIRE. The main aim of the project is to create a framework of services for assisting in publishing research, monitoring its impact, helping promote its discovery, and integrating it into the European Open Science Cloud (EOSC) “for the benefit of the open science community worldwide”. In OpenCitations, we’re thrilled to be part of this collaborative project by providing open bibliographic citations as part of the open data components of OpenAIRE and the EOSC.

5. We set the stage for future developments

Thanks to the research grants and the support and endorsement we have received from the international scholarly community, we are now working on a variety of new services, thus setting our goals for the coming years. In particular, we want to enhance OpenCitations partnerships and dialogue with the scholarly community; to collaborate with colleagues to develop new services that will expand our citation coverage, including new OpenCitations indexes of NIH-OCC, of DataCite and of other sources of open references, that will all be searchable through a single API; and to create OpenCitations Meta, our new database that will hold comprehensive bibliographic metadata of the publications involved in our indexes citations, thereby enabling faster query responses and the ability to host citations involving publications lacking DOIs[2].

[1] F. Scott Fitzgerald (2002). The beautiful and damned (page 50 in the original 1922 edition); United Kingdom: Dover Publications. https://www.google.it/books/edition/The_Beautiful_and_Damned/-tUoAwAAQBAJ?hl=en&gbpv=0

[2] Silvio Peroni, David Shotton; OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies 2020; 1 (1): 428–444. doi: https://doi.org/10.1162/qss_a_00023

Posted in Data publication, Open Citations, Open scholarship, Open Science | Tagged , , , , , , , , , , , , | Leave a comment

Open citations in Informatics: current status and lines of research

This post was first published on QWERTY: musings from the rabbit hole, a blog by Silvio Peroni

A few months ago, I was invited to have a talk at the European Computer Science Symposium on an aspect of my research I particularly care about, that of open citations. What I tried to address during the presentation concerned the current status of open citation availability in a particular domain, Informatics, by using two open datasets, i.e. DBLP for gathering bibliographic metadata about relevant publications and OpenCitations’ COCI for identifying citations where such publications are involved. This post briefly introduces the preliminaries and results obtained from the material used to prepare the talk.

Open citations and where to find them

A citation is a conceptual directional link between a citing entity and a cited entity which is defined by means of specific textual devices contained in the text of the citing entity, e.g. a bibliographic reference denoted by an in-text reference pointer (e.g. “[3]” or “(Doe et al, 2021)”). While reasons for citing may vary, citations are used in academia for acknowledging others’ work and enabling building trails of relations defining how science evolves in time.

The data needed to describe a citation should include, at least, a representation of such a conceptual link and the basic bibliographic metadata to identify the citing and cited entities, i.e. those typically used for defining bibliographic references such as authors’ names, year of publication, the title of the work, venue of publication, pages, identifiers, etc. We say that a citation is open when these citation data are in the public domain and can be retrieved freely (via the HTTP protocol) in a structured and machine-readable format (e.g. JSON or RDFwithout accessing the source of citing article defining it, which, potentially, could be behind a paywall.

OpenCitations [full disclosure: I am one of its directors] is one of the founders of the Initiative for Open Citations (I4OC) and one of the open scholarly infrastructures providing open citation data through several channels (REST APIs, SPARQL endpoints, Web interfaces, full dumps in different formats). As of 31 December 2021, it makes available more than 1.2 billion open DOI-to-DOI citation links between more than 69.5 million bibliographic resources, which are mainly journal articles but also include books, book chapters, datasets, and other DOI-identified resources. The entities involved in such citations come from different domains, spanning from Medicine articles to Humanities publications, and have recently approached parity with those included in well-known proprietary services such as Web of Science and Scopus.

What about Informatics

Such a huge mass of open citations available enables us to analyse citation coverage in different scholarly disciplines, e.g. to understand which publishers contributed to the availability of open citation data in a discipline and to check what are the citation trails between different disciplines. However, to compute such citation coverage, we need to have some information that allows us to identify when a particular bibliographic resource involved in a citation belongs to the particular discipline we want to analyse. We can use information about the subject categories of publications (e.g. that of Web of Science), if included in citation indexes, to identify the discipline(s) of a given bibliographic resource. Unluckily, OpenCitations does not provide this information and, as such, we need to rely on external repositories for gathering subject categories of publications, e.g. collections of bibliographic metadata of disciplinary publications.

In the context of Informatics, there is at least one well-known resource gathering and exposing bibliographic metadata of a large part of Computer Science publications, i.e. DBLP. As of 30 December 2021, DBLP contains more than 5.9 million publications published in 1,781 journals and in the proceedings of 5,621 conferences, involving more than 2.9 million authors that are manually curated (and disambiguated) by the DBLP team.

DBLP can be used as a proxy to understand if a particular publication belongs to the Computer Science subject category. Through it, it is possible to understand how many citations in OpenCitations involve Computer Science publications by comparing the DOIs of citing and cited entities with those available in DBLP. In particular, using the OpenCitations’ COCI September 2021 dump and DBLP October dump, I found that more than 80 million citations in COCI involved at least one of the 4,637,865 entities in DBLP (considering only journal articles, conference proceedings papers, books and book chapters). As shown in Figure 1, only 39% of these citations are between citing and cited entities both included in DBLP, while the rest of them either come from or go to publications not listed in DBLP – that, potentially, could not be Computer Science publications.

Figure 1. A Venn diagram showing how many citations involving Computer Science publications (obtained from DBLP) are included in OpenCitations.

Additional information about the publishers of such DBLP entities, retrieved by querying the Crossref API and the DataCite API with entities’ DOIs, are shown in Table 1. IEEE is the publisher with the biggest number of entities of those considered for this study, and its entities are involved in more than 18.9 million incoming and 21.5 million outgoing citations. The other bigger publishers, in terms of entities and citations, are Springer, Elsevier, ACM and Wiley. It is worth mentioning that the two publishers responsible for publishing mainly Computer Science journals and a relatively low number of conference proceedings (if any), i.e. Elsevier and Wiley, are those providing the highest number of openly-available references per publication (on average, around 29 and 37 cited works for each publication respectively).

PublisherDBLP entitiesCOCI incoming citationsCOCI outgoing citations
IEEE1,730,48518,930,05521,582,093
Springer1,012,53418,482,13211,179,566
Elsevier574,86015,536,20717,019,716
ACM433,1883,695,2556,050,342
Wiley89,6623,350,1833,357,065
Table 1. The DBLP entities retrieved in the study grouped by their publisher and their incoming and outgoing citations according to COCI.

Future developments

Of course, this study does not provide full coverage of open citations in Computer Science but just a preliminary insight. First, as anticipated below, DBLP does not have the complete coverage of all CS-related publications since there are some venues that are not listed there (yet). Thus, some relevant open citations could not be extracted from COCI if these involve as citing and cited entity non-DBLP publications that belong to the CS domain. However, it is worth mentioning that no bibliographic and citation database (including commercial and proprietary ones) has a full disciplinary coverage anyway and DBLP is, probably, the most comprehensive collection of Computer Science publications metadata (something that could be assessed in future analysis).

Along the same lines, the index of open citations used, i.e. COCI, does not contain all the citations defined in CS publications, but only DOI-to-DOI citations as retrievable from Crossref data. Although Crossref is the biggest DOI provider and it is used by the majority of the big publishers,citations defined in publications with a non-Crossref DOI (e.g. DataCite) and those not having any DOI assigned (e.g. the papers published in CEUR Workshop Proceedings) are not included in COCI and, consequently, have not been used in the analysis. However, OpenCitations plans to extend its data coverage adding more sources in the next years. Thus, it would be interesting to replicate the same analysis in the future to see if and how much the coverage increase, at least in the context of Computer Science publications.

Still about coverage, currently (i.e. 31 December 2021) the only publisher of those included in Table 1 which is not providing open references through Crossref is IEEE. Indeed, while COCI includes several citations involving IEEE publications as citing entities, there is no availability of such citation after October 2018, when IEEE decided not to allow anymore Crossref Metadata Plus users to access these reference data.

Finally, analysing the preliminary results of this study, it would be interesting to understand which are the main subject categories of non-DBLP publications included in the 61% of citations shown in Figure 1 (e.g. by using the Scimago Journal and Country Rank database to retrieve their subject categories) to understand what are the citation dynamics between Informatics and other disciplines. However, I will leave the answer to this question to future analysis.

A final remark on reproducibility

Since several of the suggestions provided above start from the idea of either replicating or extending this study with additional materials and insights, it is important that all data and software used to perform the analysis are available online to permit its reproducibility. To this end, I have published both the software and the data retrieved online with open licenses to enable anyone to reuse it freely for any purpose.

Posted in Open Citations, Open Science | Tagged , , , , , , | Leave a comment

OpenCitations receives the Open Publishing Award in Open Data

What role does ‘open’ play in making this project special?”

This apparently easy, but not banal, question was asked in the Open Publishing Awards nomination form, and at OpenCitations we prefaced our answer to it by stating “For OpenCitations, ‘open’ is the crucial value and the final purpose.” We consider the free availability of bibliographic citation data to be a necessary condition for the establishment of an open knowledge graph, and believe that having citations open helps achieve a more transparent, accessible and comprehensive research practice.

Since 2019, the Open Publishing Awards, founded and organized by the Coko Foundation and sponsored by OASPA, Crossref and Cloud68.Co, “celebrate software and content in publishing that use open licenses but also, importantly, provide a chance to reflect on the strategic value of openness”. The award judges considered open access projects divided into five categories: Open Publishing Lifetime Contribution, Open Content, Open Publishing Models, Open Source Software and Open Data.

It is in this final Open Data category of the Open Publishing Awardsthat OpenCitations was selected, as an infrastructure that perfectly represents the open principles, from among the few semantic web and linked open data initiatives currently available in the scholarly communication landscape. The award was announced in the Open Publishing Awards Ceremony, during the closing session of the FORCE2021 conference “Joining Forces to Advance the Future of Research Communication” (7-9 December). You can learn more about the Awards and the other projects selected here: https://openpublishingawards.org/results/2021/index.html

The greatest honour for OpenCitations was receiving the following comment given on behalf of the jury panel, which included open source and scholarly communication experts:

“At the time of writing this review, the largest database provided by OpenCitations contains more than 1.23 billion citations. Compiling this database in a license-friendly way is a feat on its own, but combine that with OpenCitations’ persistence (established 11 years ago), their active and consistent involvement with the community, and the number of works that were made possible by their effort (Google Scholar lists 1440 results), it is clear that OpenCitations is one of the fundamental projects in open publishing, specifically in open scientific publishing”.

We are proud and humbled to count the Open Publishing Award in Open Data among the acknowledgements so far received by OpenCitations. Despite the term “award”, the Open Publishing Awards, in fact, don’t aim to proclaim winners, but rather to “shake the hands” of some projects which seem to be following (and tracing) a right path towards a more open knowledge. All the projects awarded help by defining more concretely what “open”means, and at the same time their example encourages awareness on the variety of the open publishing projects, and a reflection about the common values and goals that gather so many different people, institutions and organizations.

Recognizing the commitment to the openness of knowledge and research of the not-for-profit and collaborative projects like OpenCitations is about community, not competition.

As Silvio recently stated:

OpenCitations is a plural. Together, we are OpenCitations.”

Posted in Data publication, open access, Open Citations, Open Science | Tagged , , , , , | Leave a comment

Performing live time-traversal queries on RDF datasets

Guest post by Arcangelo Massari, University of Bologna

In this post, Arcangelo Massari, who recently graduated in Digital Humanities and Digital Knowledge under Professor Silvio Peroni at the University of Bologna, shares the results of his master thesis.

A particular problem in information retrieval is that of obtaining data from an evolving dataset, independent of the time at which that item of data was added, changed or removed. To permit such time-independent queries to be performed over evolving RDF datasets, I have developed two new pieces of open source software, time-agnostic-library [1] and time-agnostic-browser [2], that are now available from the OpenCitations GitHub repository.

The time-agnostic-library is a Python library to perform live time-traversal queries on RDF datasets. Time-traversal means being agnostic about time: a SPARQL query that is not run on the current state of the collection but over its entire history or over a specified timespan of that history [3]. This tool allows materializations – obtaining all versions of an entity over time, or its status at a given time. Furthermore, SPARQL queries can be performed to get the delta between two or more versions of one or more resources. Thereby, the time-agnostic-library realizes all the retrieval functionalities described in the taxonomy by Fernández et al. [3].

To complement this query software, the time-agnostic-browser is a web application built on top of the time-agnostic-library to achieve the same results via a graphical user interface.

The primary purpose of these developments is to offer a system for browsing the provenance [4] of RDF statements across time: who produced them, when, where the information was taken from, and what changes were made compared to the previous state of the resource. Knowledge of such information is essential because data changes over time, either because of the natural evolution of concepts or due to the correction of mistakes. Indeed, the latest version of knowledge may not be the most accurate. Such phenomena are particularly tangible in the Web of Data, as highlighted in a study by the Dynamic Linked Data Observatory, which noted the modification of about 38% of the nearly 90,000 RDF documents monitored for 29 weeks, and the permanent disappearance of 5% of them [5] (Figure 1).

Figure 1. Donut chart showing the results of the study conducted by the Dynamic Linked Data Observatory on the evolution of RDF documents [5].

Additionally, the truthfulness of data cannot be assessed without provenance records and a system to query them. In fact, the truth value of an assertion on the Web is never absolute, as demonstrated by Wikipedia, which in its official policy on the subject states: “The threshold for inclusion in Wikipedia is verifiability, not truth.” [6]. The Semantic Web does not alter that condition, and trustworthiness has to be evaluated by each application by probing the context of the statements [7]. It is a challenging task and thus, in the Semantic Web Stack, trust is the highest and most complex level to satisfy, subsuming all the previous ones (Figure 2).

Figure 2.The Semantic Web layers [7]. Trust is the uppermost level of the stack, subsuming all the others.

Notwithstanding these premises, at present the most extensive RDF datasets – DBPedia [8], Wikidata [9], Yago [10], and the Dynamic Linked Data Observatory [11] – do not use RDF to track changes and record the provenance of such changes. Instead, they all adopt backup-based archiving policies. Some of them, such as Yago 4, record provenance but not changes. As far as citation databases are concerned, OpenCitations is the only infrastructure to implement change-tracking mechanisms and to record full RDF provenance records for each data entity. Among the leading players in this field, neither Web of Science nor Scopus have adopted similar solutions.

In accordance with the OpenCitations Data Model (OCDM) [12], a provenance snapshot is generated by OpenCitations every time a bibliographical entity is created or modified. Each snapshot (prov:Entity) records the responsible agent (prov:wasAttributedTo), the generation time (prov:generatedAtTime), the invalidation time (prov:invalidatedAtTime), the primary source (prov:hadPrimarySource), and a link to the previous snapshot (prov:wasDerivedFrom), using terms from the Provenance Ontology. In addition, OCDM introduced a system to simplify restoring an entity’s status at a given time, by saving the delta between two versions as a SPARQL update query (prov:hasUpdateQuery) [13] (Figure 3). This approach enables one to restore an entity to a specific timepoint (snapshot) in a straightforward way by applying the inverse operations, i.e., deletions instead of additions, etc.

Figure 3. Provenance in the OpenCitations Data Model.

This solution is concretely used in all the datasets related to the OpenCitations infrastructure, such as COCI, an open index containing almost 1.2 billion DOI-to-DOI citation links derived from the open reference data available in Crossref [14]. It is important to note that this OpenCitations provenance model is generic and reusable in any other context. Since the time-agnostic-library leverages OCDM, it too is generic and can be used for any RDF dataset that tracks changes and provenance as OpenCitations does.

The time-agnostic-library is released under the ISC license and is downloadable through pip [1]. Test-driven development was adopted as a software development process during its creation [15]. It makes three main classes available to the user: AgnosticEntity, VersionQuery, and DeltaQuery, for materializations, version queries, and delta queries, respectively (Listing 1).

Listing 1. Code template to achieve materializations, time-traversal queries, and delta queries.

All three operations can be performed over the entire available history of the dataset, or by specifying a time interval via a tuple in the form (START, END).

The time-agnostic-browser [2] is also released under the ISC license and can be run as a Flask application. It is organized into two macro-sections: “Explore” and “Query”. In the former, a text input accepts a URI. By submitting it, the entire history of the corresponding resource is displayed. In the latter, a text area receives a SPARQL query, which is resolved on all dataset states. Its main added value is hiding the triples and the complexity of the underlying RDF model: predicate URIs, as well as subjects and objects, appear in a human-readable format. Moreover, all the entities are displayed as links, providing shortcuts to reconstruct the history of the related resources (Figure 4).

Figure 4. Graphical user interface of an entity history reconstruction through the time-agnostic-browser.

The efficiency of time-agnostic-library was measured with two types of benchmarks [16], one on execution times and the other on the amount of computer memory (RAM) required by ten different use cases, each repeated ten times to produce significant results and avoid outliers. In light of these benchmarks, time-agnostic-library has proven effective for any materialization. Regarding structured queries, they are swift if all subjects are known or deductible. On the other hand, the presence of unknown subjects in the user’s SPARQL query involves the identification of all present and past entities that satisfy that pattern, and so requires a more significant amount of time and resources. Specifically, all materializations and the cross-version structured query with known subjects required about half a second and about 50 MB of RAM; conversely, with unknown subjects, 581 seconds and 519 MB of RAM are required. It can be concluded that the proposed software can be used effectively in all cases where the subject is known, that is, for any materialization or formulated SPARQL queries without isolated triple patterns containing unknown subjects.

Other software solutions for such problems have been proposed. Table 1 shows the list of available software to perform materializations and time-traversal queries on RDF datasets. As can be observed, time-agnostic-library is the only one to support all retrieval functionalities without requiring pre-indexing processes. This feature makes it particularly suitable for use in scenarios with large amounts of data that often change over time. Moreover, compared to the approach of Im, Lee and Kim [17] and OSTRICH [18], the OpenCitations Data Model only requires storing the current state of the dataset, rather than the original one, allowing one to query the latest version, without additional computational effort to first re-create the original version.

SoftwareVersion materializationDelta materializationSingle-version structured queryCross-version structured querySingle-delta structured queryCross-delta structured queryLive
PromptDiff [19]+++
SemVersion [20]+++
Im, Lee, & Kim, 2012 [17]+++++
R&Wbase [21]++++
x-RDF-3X [22]+++
v-RDFCSA [23]++++++
OSTRICH [18]+++
Tanon & Suchanek, 2019 [24]++++++
time-agnostic-library[1]+++++++
Table 1. Comparative between time-agnostic-library and preexisting software to achieve materializations and time traversal queries on RDF datasets. (Scroll right to see Columns 6-8).

The OpenCitations Data Model and the time-agnostic-library software are the pre-requisites that will allow OpenCitations to involve third parties, for example members of staff in academic libraries, in the submission, curation and updating of OpenCitations bibliographic and citation data. At this stage, all entities in COCI have a single snapshot — the one made at the time of creation. However, since these entities may become modified, corrected or enriched over time, it is imperative to have appropriate software tools available for use by curators. With the time-agnostic-library software and its associated time-agnostic-browser, it will be possible for a curator to explore the entire history of the changes within an RDF dataset, to know when they were made, based on which source, and by which responsible agent, thus ensuring the reliability and verifiability of data, and facilitating any necessary further changes.

References

[1] A. Massari, time-agnostic-library. 2021. Available: https://archive.softwareheritage.org/swh:1:snp:d7fd1754377f45d16afb61efc770815b5a3c8f83

[2] A. Massari, time-agnostic-browser. 2021. Available: https://archive.softwareheritage.org/swh:1:dir:337f641375cca034eda39c2380b4a7878382fc4c

[3] J. D. Fernández, A. Polleres, and J. Umbrich, ‘Towards Efficient Archiving of Dynamic Linked’, in DIACRON@ESWC, Portorož, Slovenia: Computer Science, 2015, pp. 34–49.

[4] December, ‘Provenance XG Final Report’. 2010. Available: http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/

[5] T. Käfer, A. Abdelrahman, J. Umbrich, P. O’Byrne, and A. Hogan, ‘Observing Linked Data Dynamics’, in The Semantic Web: Semantics and Big Data, vol. 7882, P. Cimiano, O. Corcho, V. Presutti, L. Hollink, and S. Rudolph, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 213–227. doi: 10.1007/978-3-642-38288-8_15

[6] S. L. Garfinkel, ‘Wikipedia and the Meaning of Truth’, MIT Technology Review, 2008, [Online]. Available: https://stephencodrington.com/Blogs/Hong_Kong_Blog/Entries/2009/4/11_What_is_Truth_files/Wikipedia%20and%20the%20Meaning%20of%20Truth.pdf

[7] M.-R. Koivunen and E. Miller, ‘Semantic Web Activity’, W3C, Nov. 02, 2001. https://www.w3.org/2001/12/semweb-fin/w3csw

[8] F. Orlandi and A. Passant, ‘Modelling provenance of DBpedia resources using Wikipedia contributions’, Journal of Web Semantics, vol. 9, no. 2, pp. 149–164, Jul. 2011, doi: 10.1016/j.websem.2011.03.002.

[9] P. Dooley and B. Božić, ‘Towards Linked Data for Wikidata Revisions and Twitter Trending Hashtags’, in Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services, Munich Germany, Dec. 2019, pp. 166–175. doi: 10.1145/3366030.3366048.

[10] Yago Project, ‘Download data, code, and logo of Yago projects’, Yago, 2021. https://yago-knowledge.org/downloads (accessed Sep. 24, 2021).

[11] J. Umbrich, M. Hausenblas, A. Hogan, A. Polleres, and S. Decker, ‘Towards Dataset Dynamics: Change Frequency of Linked Open Data Sources’, in Proceedings of the WWW2010 Workshop on Linked Data on the Web, Raleigh, USA, 2010. Available: http://ceur-ws.org/Vol-628/ldow2010_paper12.pdf

[12] M. Daquino, S. Peroni, and D. Shotton, ‘The OpenCitations Data Model’, p. 836876 Bytes, 2020, doi: 10.6084/M9.FIGSHARE.3443876.V7.

[13] S. Peroni, D. Shotton, and F. Vitali, ‘A Document-inspired Way for Tracking Changes of RDF Data’, in Detection, Representation and Management of Concept Drift in Linked Open Data, Bologna, 2016, pp. 26–33. Available: http://ceur-ws.org/Vol-1799/Drift-a-LOD2016_paper_4.pdf

[14] I. Heibi, S. Peroni, and D. Shotton, ‘Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations’, Scientometrics, vol. 121, no. 2, pp. 1213–1228, Nov. 2019, doi: 10.1007/s11192-019-03217-6.

[15] K. Beck, Test-driven development: by example. Boston: Addison-Wesley, 2003.

[16] A. Massari, ‘time-agnostic-library: benchmark results on execution times and RAM’. Zenodo, Oct. 05, 2021. doi: 10.5281/ZENODO.5549648.

[17] D.-H. Im, S.-W. Lee, and H.-J. Kim, ‘A Version Management Framework for RDF Triple Stores’, Int. J. Softw. Eng. Knowl. Eng., vol. 22, pp. 85–106, 2012.

[18] R. Taelman, M. V. Sande, and R. Verborgh, ‘OSTRICH: Versioned Random-Access Triple Store’, in Companion Proceedings of the Web Conference 2018, 2018, pp. 127–130. Available: https://core.ac.uk/download/pdf/157574975.pdf

[19] N. F. Noy and M. A. Musen, ‘Promptdiff: A Fixed-Point Algorithm for Comparing Ontology Versions’, in Proc. of IAAI, 2002, pp. 744–750.

[20] M. Völkel, W. Winkler, Y. Sure, S. Kruk, and M. Synak, ‘SemVersion: A Versioning System for RDF and Ontologies’, 2005.

[21] M. V. Sande, P. Colpaert, R. Verborgh, S. Coppens, E. Mannens, and R. V. Walle, ‘R&Wbase: Git for triples’, 2013.

[22] T. Neumann and G. Weikum, ‘x-RDF-3X: Fast Querying, High Update Rates, and Consistency for RDF Databases’, Proceedings of the VLDB Endowment, vol. 3, pp. 256–263, 2010.

[23] A. Cerdeira-Pena, A. Farina, J. D. Fernandez, and M. A. Martinez-Prieto, ‘Self-Indexing RDF Archives’, in 2016 Data Compression Conference (DCC), Snowbird, UT, USA, Mar. 2016, pp. 526–535. doi: 10.1109/DCC.2016.40.

[24] T. Pellissier Tanon and F. Suchanek, ‘Querying the Edit History of Wikidata’, in The Semantic Web: ESWC 2019 Satellite Events, vol. 11762, P. Hitzler, S. Kirrane, O. Hartig, V. de Boer, M.-E. Vidal, M. Maleshkova, S. Schlobach, K. Hammar, N. Lasierra, S. Stadtmüller, K. Hose, and R. Verborgh, Eds. Cham: Springer International Publishing, 2019, pp. 161–166. doi: https://doi.org/10.1007/978-3-030-32327-1_32.

Posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, Open Citations | Tagged , , , , , , , , | Leave a comment

Coverage of open citation data approaches parity with Web of Science and Scopus

Guest blog post by Alberto Martín-Martín, Facultad de Comunicación y Documentación, Universidad de Granada, Spain <albertomartin@ugr.es>

In this post, as a contribution to Open Access Week, Alberto Martín-Martín shares his comparative analysis of COCI and other sources of open citation data with those from subscription services, and comments on their relative coverage.

Comprehensive bibliographic metadata is essential for the development of effective understanding and analysis across all phases of the research workflow. Commercial actors have historically filled the role of infrastructure providers of bibliographic and citation data, but their choice of subscription-based business models and/or restrictive user licenses has significantly limited how users and other parties can access, build upon, and redistribute the information available on those platforms. Locking bibliographic and citation metadata behind these barriers is problematic, as it hinders innovation and is an obstacle to reproducibility.

Fortunately, the process of digital transformation that scientific communication is currently undergoing is providing us with the tools to get closer to the ideal of science as a public good. One of the most successful initiatives in this area is Crossref, arguably the single most critical piece of research metadata infrastructure currently in existence. I consider the best thing about it to be its commitment to openness. Not only is Crossref responsible for minting many of the DOIs that are assigned to academic publications, but it also publishes metadata about these publications (for over 120+ million records in their latest public data file) without imposing any access or reuse limitations.

Crossref metadata has already boosted innovation in a variety of academic-oriented tools. New discovery services such as Dimensions, The Lens, and Scilit all take advantage of Crossref metadata to keep their indexes up to date with the latest publications. The open-source reference manager Zotero is able to pull metadata associated with a given DOI from Crossref’s servers, providing an easy way to populate one’s personal reference collection that is more reliable than using Google Scholar. The Unpaywall database uses Crossref metadata (among other data sources) to keep track of which documents are Open Access, and this data is in turn used by Unsub, a service that helps libraries make more informed decisions about their journal subscriptions.

Historically, citation indexing has been a functionality available only from a few subscription-based data sources (most notably Web of Science and Scopus), or from free but largely restricted sources (e.g., Google Scholar). In recent years, however, commercial exclusivity over citation data has been waning. Digital publishing workflows make it easier for publishers to deposit the list of cited references along with the rest of the metadata when they register a new document in Crossref, and many are already doing it. Crossref’s policy is to make these lists of references publicly available by default, although publishers can elect to prevent their public release. From this, it follows that if most publishers deposited their reference lists in Crossref and consented to make them open, a comprehensive open citation index, one that is free of the restrictions present in traditional platforms, could be built.

The Initiative for Open Citations (I4OC) is an advocacy group that has been working since 2017 to achieve this precise goal, and it has already managed to convince a large number publishers (over two thousand) to open the references they deposit in CrossRef. In the first half of 2021, Elsevier, the American Chemical Society, and Wolters Kluwer joined this group, so that today all the major scholarly publishers now support I4OC and have open references at Crossref, with the exception of IEEE (the Institute of Electrical and Electronics Engineers). Thanks to the efforts of I4OC and the collaboration of publishers, 88% of the publications for which publishers have deposited references in CrossRef are now open. This has allowed organizations such as OpenCitations (one of the founding members of I4OC) to create a non-proprietary citation index using these data, namely COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Other open citation indexes such as the NIH Open Citation Collection (NIH-OCC) and Refcat have also been recently released.

How do such open citation indexes compare to long-established indexes? In 2019, I set out with colleagues to analyze the coverage of citations contained within the most widely used academic bibliographic data sources (Web of Science, Scopus, and Google Scholar) to a selected corpus of 2,515 highly-cited English-language documents published in 2006 from 252 subject categories, and to compare this to the coverage provided by some of the more recent data sources (Microsoft Academic, Dimensions, and COCI). At that time, COCI was the smallest of the six indexes, containing only 28% of all citations. For comparison, Web of Science contained 52%, and Scopus contained 57%.

There are a number of reasons for those differences: first, at that point some of the larger commercial publishers including Elsevier, IEEE, and ACS, which routinely deposit references in Crossref, had not yet opened them. Second, many smaller publishers still do not deposit their reference lists in Crossref. Third, COCI only captures citation relationships between documents that have DOIs, thus missing citations to publications that lack them. Finally, while for our study data collection from all sources was carried out during May/June of 2019, COCI at that time had not been updated since November 2018, which increased its disadvantage when compared to other data sources with more frequent updates.

Since Elsevier is the largest academic publisher in the world, its recent opening of references at Crossref resulted in a significant increase in the total number of openly available Crossref references. The most recent version of COCI (dated 3 September 2021, and based on open references to works with DOIs within the Crossref dump dated August 2021) now contains both the processed references from Elsevier, and the references in the most recently published articles by ACS (the complete backfile of ACS references will appear in future versions of COCI).

Given these significant developments, how much has the picture changed? To find this out, I updated our 2019 analysis using the version of COCI released on September 3rd 2021 and the NIH-OCC dataset released in the same month. To carry out a reasonably fair comparison while reusing the data extracted in 2019 from the other sources, I employed the same corpus of target documents, and only used citations in which the citing document was published before the end of June 2019. The intention was to learn how much the coverage of open citation data has grown as a result of the subsequent opening of reference lists in Crossref that were not public in 2019, and similar efforts.

The combination of COCI’s and NIH-OCC’s September 2021 releases contained more than 1.62 million citations to our sample corpus of documents from all areas, a 91% increase over the 0.85 million citations that we were able to recover in 2019 from COCI alone. Considering the citations available in all data sources, 53% of all citations are now available from these two open sources under CC0 waivers, up from the 28% we found in 2019. This coverage now surpasses the 52% found by Web of Science, and is much closer to the 54% found by Dimensions, and the 57% covered by Scopus. The relative overlap between COCI and the other data sources has also significantly increased: in 2019 COCI found 47% of the citations available in Web of Science, whereas now open citation data sources find 87% of the WoS citations. In the case of Scopus, in 2019 COCI found 44% of the citations available in Scopus: the percentage available from open sources has now increased to 81%. The number of citations found by COCI but not present in the other data sources has also widened slightly. These data are presented graphically in Figure 1.

Fig. 1. Percentage of citations found by each database, relative to all citations (first row), and relative to the number of citations found by the other databases (subsequent rows).

Where are these new citations coming from? Well, as we might expect, references from articles published in Elsevier journals comprise the lion’s share of the newly found citations in open data sources (close to half of all new citations), as shown in Figure 2. But there are also some IEEE citations here. This is because until recently reference lists from IEEE publications were available in the ‘limited’ Crossref category to members of Crossref Metadata Plus, a paid-for service that provides a few additional advantages over the free services Crossref provides. As a member of Crossref Metadata Plus, OpenCitations obtained these reference lists while they were available and included them in COCI. Subsequently, IEEE decided to make their references completely closed, explaining why references from more recent IEEE publications are not included in COCI.

Fig 2. The increases between 2019 and 2021 of citations indexed by open sources (COCI + NIH-OCC) from the articles of different publishers

There can be no doubt that open citation data is of benefit to the entire academic community. Thanks to COCI, NIH-OCC, and similar initiatives, and despite some setbacks, we are already witnessing how open infrastructure can help us develop models and practices that are better aligned with the opportunities that our current digital environment offers and the challenges that our society faces.

Conclusion: The coverage of citation data available under CC0 waivers from open sources is now comparable to that from subscription sources such as Web of Science and Scopus, offering a viable alternative upon which to base open and reproducible metrics of academic performance.

Posted in Bibliographic references, Open academic analytics, open access, Open Citations, Open Science | Tagged , , , , , , , , , , , , , , , , , , , , | 4 Comments

Open Access Tage 2021: valuable insights from the libraries in the German-speaking region 

On September 27, OpenCitations’ director Silvio Peroni, together with Niels Stern (DOAB/OAPEN) and James MacGregor (PKP), held the online workshop “How Open Infrastructure Benefits Libraries” during the Open Access Tage 2021. Open-Access-Tage (Open Access Days) are the annual central platform for the steadily growing Open Access and Open Science community from Germany, Austria and Switzerland, and are aimed at all those involved with the possibilities, conditions and perspectives of scientific publishing.  

The workshop gathered three of the SCOSS-supported infrastructures to discuss how Open Infrastructures (OIs) could encourage the engagement of university libraries, and how they could be beneficial game-changing alternative to commercial infrastructures. This theme, which was also been presented during the last LIBER conference, was here discussed under a new perspective, by involving the specific case of the libraries from the German-speaking region. Their point of view particularly emerged during the second part of the workshop, during which the participants were divided into two breakout rooms to discuss two questions each. These are the answers and comments that emerged from the discussions.  

1. What would prevent or encourage libraries in the German-speaking region to support open infrastructures? 

The three main concepts held to be crucial in this field were transparency, promotion and governance.  

Transparency: German libraries and public institutions often deal with strict funding limitations relating to donations. It is therefore crucial for OIs (a) to present in a clear way how libraries can get involved and the money needed, (b) to communicate what they do and how they can add value to libraries compared to other services, and (c) to clarify the direct return and benefits on investments. These points would make it easier to recommend OIs internally, especially when people from subject-specific institutions are interested in subject-independent OIs. Point (b) leads to the Promotion issue: Open Infrastructures should promote themselves non only on a global level, by communicating their impact in the open research movement as against non-transparent propitiatory services, but also at a local level, by providing information about the usage (and the value) of their services at an institutional and/or national level. This case-by-case narration (with attention to the specific benefits) would make it easier for the institutions to evaluate the sustainability of the investment. An incentive to donate is being actively involved in the community governance, i.e.through a board membership.  

Nevertheless, is also necessary for libraries to “take courage” when investing in such OIs, and, when possible, to overcome administrative boundaries by forming consortia. Finally, of particular note was a desire to see locally-managed sub-communities that could speak specifically to the German (or whichever) language environment, much as ORCID arranges itself.  

2. Community Governance. What kind of involvement do you want to see, how do you want to be involved? 

Some common problems which prevent the institutions from being involved are (a) a general concern about the fact that negotiations with publishers are typically the main focus of OA discussions – leaving little time to focus on OIs and other open initiatives, and (b) a lack of time, or of guidelines, for evaluating the different infrastructures to invest in. This is why SCOSS was appreciated as an intermediary in the decision process, because of its own rigorous evaluation and selection mechanismThe community-funding approach proposed by SCOSS thus seems to be the preferred way by which to support OIs.  

Regarding community governance, one idea could be to involve interested scholars in the governance of the open infrastructures (with the library acting as an interface between the open infrastructure and the scholars) rather than only involving library staff – although this idea was argued against in the second group, as researchers are often percieved as too busy to be functional in operational infrastructure groups. What also emerged from this second question is an interest in community involvement on different levels, for example as a community of practices or through discussion boards, mailing lists, periodic meet-ups, workshops, newsletters, etc. The community could also be articulated into local sub-communities, as in the successful case of ORCID and ORCID_DE.  

Posted in Bibliographic references, open access, Open Citations, Open scholarship, Open Science | Tagged , , , , , , , , | Leave a comment

Save the dates: OpenCitations October events 

With the numerous September events in which the OpenCitations’ directors have recently been involved behind us, it is now time to announce the participation of our director Silvio Peroni in two October events. 

On Wednesday 6th, Silvio will take part in the Beilstein OpenScience Symposium 2021 (October 5-7), giving a short presentation “Open Citations, an Open Infrastructure to Provide Access to Global Scholarly Bibliographic and Citation Data” during the Poster Flash Talk Presentation (17:10-18:00 CEST). The Beilstein OpenScience Symposium is an annual event that gathers leaders in the FAIR and Open Data movement, covering a wide range of research fields, including biomedical research, physics and social science, and exploring how open data practices are transforming sectors outside academia. The 2021 online edition will present a series of talks addressing the many ways that data transparency contributes to the research progress. Among them, the poster presentations involve short oral presentations on Wednesday, 6th October, to accompany the posters that will be displayed throughout the entire symposium. Poster abstracts are available in the Abstract Book that can be downloaded on Beilstein Symposium’s website: https://www.beilstein-institut.de/en/symposia/open-science/program/ .

You can register for the event here: https://www.beilstein-institut.de/en/symposia/open-science/registration/

The poster and slides from the presentation are available on Zenodo:

Peroni, S. (2021, October 6). Open Citations, an Open Infrastructure to Provide Access to Global Scholarly Bibliographic and Citation Data—Poster Flash Talk Session slides. Beilstein Open Science Symposium 2021, Virtual Event. Zenodo. https://doi.org/10.5281/zenodo.5553025

Peroni, S., Shotton, D. W., & Di Giambattista, C. (2021). Open Citations, an Open Infrastructure to Provide Access to Global Scholarly Bibliographic and Citation Data. https://doi.org/10.5281/zenodo.5553040

The second event is the annual European Computer Science Summit (ECSS), organized by Informatics Europe and involving academics, industry leaders, decision makers and others interested in Informatics/Computer Science research and education in Europe. ECSS 2021, “Informatics for a Sustainable Future” (Oct. 25-27), will be held as a hybrid event, involving both online as well as on-site sessions held in Madrid, at the Facultad de Ciencias de la Actividad Física y del Deporte (INEF), Universidad Politécnica de Madrid located at Calle de Martín Fierro, 7

During the last day of the meeting (Oct. 27), Silvio Peroni will be speaking at the “National Informatics Associations Workshop”, an annual workshop organised by Informatics Europe in collaboration with the National Informatics Associations in Europe. This year the workshop will address the themes Informatics in Interdisciplinary Curricula and Research Evaluation in Informatics, thus focusing on an important question: how to recognise, assess and credit research contributions specific to Informatics, such as conference publications and software artefacts. Elaborating on this, Silvio’s talk (to be delivered in person, rather than online!) is entitled “Open citations in Informatics” (9:00 CEST).  

For further information and registrations: https://www.informatics-europe.org/ecss/registration/how-to-register.html

We thank Beilstein Institute and Informatics Europe for involving OpenCitations in these international events, which provide opportunities to promote the OpenCitations infrastructure and services in stimulating environments.

We hope to see you there

Posted in Bibliographic references, open access, Open Citations, Open Science, Uncategorized | Tagged , , , , , , | Leave a comment