The blog of OpenCitations (http://opencitations.net), an independent infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies.
We are happy to announce OpenCitations’ participation in a number of online conferences and events during the next few weeks. Our directors Silvio Peroni and David Shotton will be speaking at the Open Science Fair 2021, the OASPA Conference 2021 and Open Access Tage.
Open Science Fair 2021 (20-23 September) is an event organized by OpenAIRE, in collaboration with some key international initiatives in the area of Open Science: COAR, EIFL, Force11, LA Referencia, LIBER, OPERAS, Sparc, Sparc Europe. Like a real fair, the visitors can explore virtual pavilions, participating in various Keynote Talks, Parallel Sessions and Workshops dedicated to Open Science. Silvio Peroni will give two talks on Tuesday 21:
The Workshop “The perils of being invisible. Collective funding models for Open Science infrastructure” (16:30-18:00 CEST) “will help identify the main challenges of collective funding models for Open Science Infrastructure, as well as explore the path forward to make them more efficient”. Silvio Peroni, Niels Stern (DOAB/OAPEN)James MacGregor (PKP), Agata Morka (SPARC Europe/SCOSS), Jon Treadway (the Great North Wood Consulting), Jean-Francois Lutz (University of Lorraine) and Vanessa Proudman (SPARC Europe) will reflect on the evanescence of Open Science Infrastructure (OSI) in library budget considerations. The speakers will also promote interaction with other workshop participants in order to create a collective dialogue. You can register for the event here: https://www.opensciencefair.eu/2021/workshops/the-perils-of-being-invisible
The OASPA Conference 2021 (21-23 September), entitled “Designing 21st Century Knowledge Sharing Systems”, will be dedicated to “many timely and fundamental topics relating to open scholarly communication”, including “the ongoing impact of the pandemic”. David Shotton will take part in the Poster Lightning Talks Session 3 (Thursday 23, 1-2 pm BST), with the title “OpenCitations – what does the future hold?”, a reflection on OpenCitations’ values, data, services, achievements so far, and plans for the future. For further information and registration: https://oaspa.org/conference/
Silvio Peroni, together with James MacGregor (Public Knowledge Project) and Niels Stern (OAPEN) will hold the Workshop “How Open Infrastructure Benefits Libraries?” (September 27, 11:30-13 CEST) as part of the Open Access Tage 2021 (27-29 September), an annual event dedicated to Open Access initiatives and community. During the workshop, the speakers will investigate the social and economic value of open infrastructures for libraries. For more information and to register for the event: https://oat21.sched.com/event/kdFg/workshop-2-how-open-infrastructure-benefits-libraries
We thank the organizers of these prestigious international events for having invited OpenCitations to participate. The Open Science resounds and grows through such community-centered initiatives.
If you wish to learn more about Open Science, ongoing Open Access initiatives, and OpenCitations’ commitment to and activities within these areas, don’t miss the opportunity to participate in these on-line conferences … see you there!
CDL’s commitment to sustainable open scholarship has great value for the global scholarly community. Through its investments and partnerships, CDL aims to create an international academic and librarian dialogue, trusting in the idea that “the university, its scholars and its libraries thrive when we transcend organizational boundaries and commit ourselves to shared investments”.
CDL’s contribution will generously support OpenCitations throughout 2021-2023. CDL funding in the fiscal year 2020-2021 also includes two other SCOSS-endorsed infrastructures, OAPEN and DOAB, the non-profit organization Open Access Switchboard, and the services PsyArXiv and SCOAP3 Books. As can be read in the recent post by Ellen Finnie, this investment reflects CDL’s “commitment to ’invest in open’ by allocating a portion of our collections funding to the development of open content and infrastructure in support of UC scholarship and teaching”.
OpenCitations team is grateful to be included in CDL’s ongoing investment in open infrastructure. Thank you!
We’re now proud to announce the September 2021 release of COCI, which is based on open references to works with DOIs within the Crossref dump dated August 2021. This new release extends COCI with more than 92 Million additional citations, giving a total number of more than 1.18 Billion DOI-to-DOI citation links.
This latest release includes citations from the most recent articles published by the American Chemical Society, whose bibliographic references were opened in February 2021. The ACS back number citations will be available in the next COCI release, when a new processing of all the Crossref data will be completed.
Ivan Heibi, Silvio Peroni & David Shotton (2019). Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Scientometrics, 121 (2): 1213-1228. DOI: https://doi.org/10.1007/s11192-019-03217-6
Finally, just a reminder that the bibliographic and citation data in COCI:
can be queried using the OpenCitations Indexes SPARQL endpoint;
Now that OpenCitations is hosting over one billion freely available scholarly bibliographic citations, this is perhaps an opportune moment to look back to the start of this initiative. A little over eleven years ago, on 24 April 2010, I spoke at the Open Knowledge Foundation Conference, OKCon2010, in London, on the topic
OpenCitations: Publishing Bibliographic Citations as Linked Open Data
I reported that, earlier that same week, I had applied to Jisc for a one-year grant to fund the OpenCitations Project (opencitations.net). Jisc (at that time ‘The JISC’, the Joint Information Systems Committee) was tasked by the UK government, among other things, to support research and development in information technology for the benefit of the academic community.
The purpose of that original OpenCitations R&D project was to develop a prototype in which we:
harvested citations from the open access biomedical literature in PubMed Central;
described and linked them using CiTO, the Citation Typing Ontology ;
encoded and organized them in an RDF triplestore; and
I told those at the conference that in this demonstration project, with limited JISC funding, we could not hope to “boil the whole ocean”, but that nevertheless there would be substantial benefits from even partial coverage of citation data from the scholarly literature:
We could show the way and establish best practice.
Despite partial coverage, all key papers would most likely be cited several times.
The overall topological structure of the citation network would be revealed.
We would create a ‘benchmark’ corpus of high-quality RDF citation data that could be used to develop analytical and visualization tools.
We could show the value of open citation data in helping scholars to discover full text articles of all types, and thus encourage subscription-access publishers to release their reference metadata.
The important thing, I said, was to make a start!
The Jisc OpenCitations Project
That JISC grant application was funded, and the project, to last for a year with modest funding of £100K, started in my lab in the Department of Zoology at Oxford University on 1st June 2010, and was subsequently extended for a further six months.
Using data from the Open Access subset of PubMed Central, we created the first prototype release of the OpenCitations Corpus of linked bibliographic citation data, containing 6,529,815 independent bibliographic records of both citing and cited entities, comprising references to ~20% of all post-1980 articles recorded in PubMed, including those to all the most important highly cited papers in every field of biomedical endeavour.
This achievement was almost entirely the result of the excellent work by our chief data wrangler Alex Dutton, whose skill and natural feel for linked data did wonders for this project. Ben O’Steen, Graham Klyne and Alistair Miles made important contributions.
The project also resulted in many other development, described here, most which were developed or at least initiated during a short but wonderfully productive collaboration with Silvio Peroni, who spent six months with me in 2010 as a doctoral student intern from the University of Bologna, to which he subsequently returned to complete his thesis and develop his academic career.
the deconstruction and re-development of the original version of CiTO into a suite of orthogonal and complementary ontologies covering the whole domain of scholarly publishing – the SPAR (Semantic Publishing and Referencing) Ontologies [2, 3];
the mapping of various existing metadata schemas into RDF using SPAR, including the DataCite Metadata Schema, and subsequently JATS, now the default NISO standard for XML markup of scholarly documents) ; and
After the Jisc funding ended and I, after a long career in biological teaching and research, formally retired from the Department of Zoology at the Oxford University, members of the initial OpenCitations team moved on to other things. Like so many grant-funded academic project whose initial financial support had dried up, OpenCitations could have foundered at that stage, as an interesting prototype but with too little content to be useful. However, the concept of providing an open alternative to proprietary citation indexes was too important to abandon. But how could it be transitioned into something enduring and useful, particularly when as a matter of principle one had decided that the citation data should be made freely available, thus precluding income generation by charging for ‘premium’ services or the formation of a commercial spin-off?
Finally, I realized that something radical needed to be done to move OpenCitations forward. I had maintained a lively collaboration with Silvio Peroni at the University of Bologna, resulting between 2011 and 2014 in the publication of 18 articles and conference papers concerning the SPAR ontologies, ontology development, documentation and visualization, and related topics, and in 2015 I invited him to start working with me directly on OpenCitations. It was the best decision I could have made. We decided to take the initial concept and re-implement it from the bottom up. OpenCitations gave Silvio a major computer science project to which he could apply his considerable talent, and soon resulted in the development of a revised RDF data model for describing citation data, the OpenCitations Data Model (OCDM)  and a suite of new software tools to harvest, organise and publish citations at linked open data . The credit for almost all the subsequent conceptual and technical developments within OpenCitations, which have incrementally led to our present situation, is due to Silvio Peroni, and the scholarly community is indebted to him for the intelligence, skill and diligent application he has given to OpenCitations over the past six years. I am truly honoured to have Silvio as co-Director of OpenCitations, and wish to take this opportunity to acknowledge his contributions and to thank him publicly.
Our work on OpenCitations at that stage, summarized in , would not have been possible without the enthusiastic support of Silvio’s senior colleague Fabio Vitali and of the Department of Computer Science and Engineering at the University of Bologna, which not only provided a stimulating environment for Silvio’s post-doctoral work, but also supplied computing services and infrastructure at no charge to OpenCitations. It was also greatly helped by Professor David De Roure of Oxford University, who gave me an academic home and a formal affiliation within the Oxford e-Research Centre after my retirement from the Department of Zoology, which enabled me to continue to hold research grants.
A significant breakthrough came in January 2018 with our decision to treat citations as first-class data entities, each with its own persistent identifier (PID), the Open Citations Identifier (OCI) . This gave Silvio the freedom to envision a new kind of database, a citation index in which each citation had its own metadata, including citation timespan, citation categorization (e.g. self-citation), and of course the DOIs of the citing and cited publications. The creation of this new database was possible only with the incredible effort by Ivan Heibi, who served as a Research Fellow in the project funded by the Alfred P. Sloan Foundation at that time, and who was entirely responsible for developing the first version of the code necessary for creating such a database. Having harvested all the open references from Crossref metadata dumps, Silvio and Ivan created COCI, the OpenCitations Index of Crossref DOI-to-DOI Citations, which immediately became our principal source of open citations, the original OpenCitations Corpus being retained as a ‘sandbox’ in which to experiment with new data representations, for example those required for the Open Biomedical Citations in Context Corpus. Access to COCI was facilitated by Silvio’s development of a REST API, using his software tool RAMOSE (Restful API Manager Over SPARQL Endpoints), which enables the easily configurable deployment of a REST API over any SPARQL endpoint to an RDF triplestore
A significant breakthrough came in January 2018 with our decision to treat citations as first-class data entities, each with its own persistent identifier (PID), the Open Citations Identifier (OCI) . This gave Silvio the freedom to envision a new kind of database, a citation index in which each citation had its own metadata, including citation timespan, citation categorization (e.g. self-citation), and of course the DOIs of the citing and cited publications. The creation of this new index was possible only with the incredible effort by Ivan Heibi, who served as a Research Fellow in the project funded by the Alfred P. Sloan Foundation at that time, and who was entirely responsible for developing the first version of the code necessary for creating such a database. Having harvested all the open references from Crossref metadata dumps, Silvio and Ivan created COCI, the OpenCitations Index of Crossref DOI-to-DOI Citations, which immediately became our principal source of open citations, the original OpenCitations Corpus being retained as a ‘sandbox’ in which to experiment with new data representations, for example those required for the Open Biomedical Citations in Context Corpus. Access to COCI was facilitated by Silvio’s development of a REST API, using his software tool RAMOSE (Restful API Manager Over SPARQL Endpoints), which enables the easily configurable deployment of a REST API over any SPARQL endpoint to an RDF triplestore . We were able to organize our all data, both ‘traditional’ and new, and to encode it in RDF, thanks to the comprehensive OpenCitations Data Model , itself based on our SPAR Ontologies , which we evolved as necessary to accommodate new data representation requirements.
During this period we published a number of definitions, conference papers and journal articles documenting these advances, details of which can be found here. Of these, the most recent canonical publication describing OpenCitations as an infrastructure for open scholarship, and its datasets, tools, services and activities, is Peroni and Shotton (2020) . We also established the Research Centre for Open Scholarly Metadata at the University of Bologna, primarily to handle administrative, financial and academic aspects of OpenCitations activities.
The problem remained: how to sustain the OpenCitations infrastructure financially. We were greatly helped by Bilder, Lin and Neylon’s formulation of the Principles of Open Scholarly Infrastructures (POSI) , in which they clearly pointing out that reliance solely on grant funding for specific projects was not the answer. OpenCitations compliance with POSI is described here. We were thus immensely grateful that SPARC Europe and other institutions had the wisdom to establish SCOSS (The Global Sustainability Coalition for Open Science Services) to facilitate the crowd-sourced financial support of useful open infrastructures by the scholarly community, including academic libraries, government agencies and other stakeholders. OpenCitations applied for SCOSS support in 2019, which led to the selection of OpenCitations for support in the SCOSS second round.
The donations we are now starting to receive from such stakeholders, and the new staff that this funding has recently allowed us to hire, signal the start of our transition from a financially vulnerable academic project to a sustainable open scholarly infrastructure of real value to the community.
The work of opening more of the global citation graph now requires two things:
that each publisher takes responsibility for ensuring that the references from all of its journal articles and books are submitted, together with all other bibliographic metadata, to open scholarly bibliographic metadata aggregators such as Crossref and DataCite, from which they can be indexed into open citation indexes of sufficient quality, depth of detail and breadth of coverage that these offer genuine alternatives to the expensive proprietary citation indexing services upon which the academic community presently relies; and
that the entire scholarly stakeholder community re-directs a fraction of the enormous sums currently spent on its subscriptions to proprietary bibliographic services in order to support Open Science infrastructures such as OpenCitations that making citations and other forms of scholarly metadata and objects freely available.
 Peroni S, Lapeyre DA and Shotton D (2012) From Markup to Linked Data: Mapping NISO JATS v1.0 to RDF using the SPAR (Semantic Publishing and Referencing) Ontologies. Proc. 2012 JATS Conference, National Library of Medicine, Bethesda, Maryland, USA (October 2012): 16-17. http://www.ncbi.nlm.nih.gov/books/NBK100491/
 Silvio Peroni, David Shotton, Fabio Vitali (2017). One Year of the OpenCitations Corpus: Releasing RDF-based scholarly citation data into the Public Domain. In The Semantic Web – ISWC 2017 (Lecture Notes in Computer Science Vol. 10588, pp. 184–192). Springer, Cham. https://doi.org/10.1007/978-3-319-68204-4_19
In his call for open citations, Dario Taraborelli hailed the scholarly citation graph (in which the nodes (vertices) are individual academic publications and the links (edges) represent bibliographic citations from one publication to another) as one of humankind’s most important intellectual achievements.
We all understand that the inclusion within our own academic publications of bibliographic references to the works of others is one of the most explicit ways of acknowledging the thoughts, discoveries, achievements and influences of other scholars, and their contributions to our own work. Not only does what we gain from their publications enable us to make intellectual progress, by “standing on the shoulders of giants” as Newton once famously observed , but the influence of these publications extends forward in time across the entire intellectual landscape, like gigantic shadows cast at sunset, whether or not those influenced by these publications have occasion to reference them in their own works.
At the anterior margin of a crawling cell, cellular protrusive extension (for example of a pseudopodium) is achieved by the catalysed polymerization of new filaments of the cytoskeletal protein actin from attachment sites on an existing stationary actin filament network, pushing the cell margin forward . The scholarly citation network (or citation graph, the two terms here being used interchangeably) is similarly dynamic and temporally directional, being extended forward as new works of scholarship are published. Extension of knowledge is achieved by the catalytic inspiration provided by existing academic publications, themselves temporally stationary within the expanding citation network, leading to the publication of new works of scholarship that cite these previous publications and thus extend the citation network further into the future. The citation graph is thus not just an acyclic directed graph, but an acyclic temporally directed graph. Indeed, it is this temporal aspect of the citation network that is one of its most important features.
To use another analogy, the human genealogical tree is inherently multidimensional and difficult to represent pictorially in its entirety, because each new birth brings together the family trees of the child’s two parents. However, unless the parents are seriously promiscuous, the resulting genealogical tree is not impossibly complex. In contrast, the scholarly citation network is much more highly interlinked, since each new publication cites not just two but many preceding (‘parent’) publications, which themselves may beget many other citations.
Visualization of the global scholarly citation graph, or portions of it, is thus inherently difficult, and the important temporal aspect of the graph is the one ignored by almost every method used for visualizing aspects of that graph. Existing methods may take the broad view, showing the links, and the strength of those links, between one scholarly domain and another, thus visualizing the ‘structure of science’. Alternatively, they may take a more detailed view of a small section of the graph, visualize the proximity of individual publications to one another. Often a radial display is chosen for this, that shows in closest proximity those papers directly referenced by the selected publication in the centre, then at a greater radius those papers referenced by the cited papers shown in the inner circle, and so on. Because of the graph’s complexity, such displays quickly looses intelligibility after two citation links.
Among a small number of visualization applications that do not ignore the temporal aspect of the graph is Citeology, a temporally based citation network visualization tool developed some years ago by Justin Matejka and colleagues at the design software company Autodesk . Unfortunately, this innovative software prototype was not central to that company’s mission, development ceased, and the Citeology Java app is no longer available. However, in his last email to me, Justin Matejka kindly offered to help others re-create this application.
There is thus an urgent need for innovative new open-source visualization tools that will clearly and dynamically display portions of the global citation graph, for example the direct and indirect citation connections between any two publications or any two individuals, along the temporal axis of publication date. Developers within the open science community please step forward!
“Infrastructure at its best is invisible. We tend to only notice it when it fails. If successful, it is stable and sustainable. Above all, it is trusted and relied on by the broad community it serves. Trust must run strongly across each of the following areas: running the infrastructure (governance), funding it (sustainability), and preserving community ownership of it (insurance)”.
OpenCitations too espouses POSI and, in January 2021, we monitored the extent of our own compliance with POSI, the results of which are shown in the following diagram.
Coverage across the research enterprise
We gather citations from global scholarship
Advisory board currently lacks executive power and is not elected
Membership open to all those espousing open science
Everything is open
OpenCitations lobbies to achieve open scholarly citations and bibliographic metadata; it does not engage in political or financial lobbying
Since all our data open, others can recreate our service
Formal incentives to fulfill mission & wind-down
No formal plan for wind-down has yet been drawn up
Time-limited funds used only for time-limited activities
Grant income should be used solely for grantprojects
Goal to generate surplus
Goal not yet realized – income so far too limited
Goal to create contingency fund to support operations for 12 months
Goal not yet realized – income so far too limited
Mission-consistent revenue generation
Membership fees and solicited donations
Revenue based on services, not data
All data and services freely given to community, and thus do not generate income
All software under open source licenses
All data available under CC0 waiver
All data available via REST APIs, SPARQL endpoints, query interfaces and data dumps
We will not patent anything: OpenCitations’ infrastructure is free to replicate
We at OpenCitations are proud of the results reached in the Insurance area, but realise that we still have some was to go in the other areas. Although the general situation is already satisfying, we are working to strengthen our weak points.
We want to express our gratitude to the 18 institutional members and customers of the Consortium of Swiss Academic Libraries which have now pledged 89,250 euros to support OpenCitations over the next three years. This generous donation is part of a total funding of 320,250 euros destined for the three services currently being promoted by SCOSS: DOAB and OAPEN,PKP, and OpenCitations.
The Consortium of Swiss Academic Libraries involves all cantonal universities, the ETH Domain, the Swiss National Library and other institutions from the fields of education and research as well as from the public sector, with the core task of licensing of e-resources (electronic journals, databases, eBooks) for its members and customers.
As can be read in this post, Susanne Aerni, Head of Consortial Services commented on the pledge: “This pledge exemplifies the broad Swiss commitment to vital infrastructure for Open Access and Open Science. All Swiss Universities, all institutions of the ETH-domain, some Universities of Applied Science, CERN, and the Swiss National Science Foundation support these three vital services through the Consortium of Swiss Academic Libraries.”
Thank you, Switzerland, for your support to OpenCitations!
“The competitive benefits of closing access to citation data diminish with each new citation released to the public domain, but the benefits of open data remain. Going forward, citation data is almost completely public domain”.
With these words, from the article “A tipping point for open citationsdata” (July 15, 2021), Ian Hutchins celebrated the threshold crossing of one billion citations on public-domain databases in February 2021.
Now, a new significant milestone has been reached. We are enthusiastic to announce that COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations has just been extended with 334 million additional citations. Its most recent release, the COCI July 2021 release, now contains a total of 1.09 billion DOI-to-DOI citation links derived from open references within Crossref,which includes the references of articles deposited or opened in Crossref between November 2020 and January 2021.
These numbers make us proud, and confirm the essential value of the Initiative for Open Citations (I4OC). Since 2018, the mission of I4OC has been to persuade publishers to provide open citation data by means of the Crossref platform. The I4OC untiring commitment has led the major academic publishers to a progressive change of heart regarding open citations, and the scholarly community to a deeper interest in this openness.
These factors contributed to the creation of COCI in 2018, the first open citation index created by OpenCitations, in which we applied the concept of citations as first-class data entities (Heibi I., Peroni S., Shotton D., 2019). Over the last three years, COCI has been extended in a series of releases, by harvesting citations mostly from Crossref data dumps, starting from an initial coverage of 300 million citations (First release).
A crucial event that preceded (and delayed!) this latest COCI release was Elsevier’s endorsement in the DORA Declaration on Research Assessment in December 2020, thereby making “reference lists for all articles published in Elsevier journals openly available via Crossref so they can be available for reuse. This means other important initiatives like I4OC can draw on this metadata”. As described in our previous post, Elsevier’s welcome commitment led to the opening of many previously closed references from its numerous academic journals submitted to Crossref. Now, after an extended period of data ingestion and processing, all these newly opened Elsevier references are available at OpenCitations within COCI.
Elsevier’s involvement has both an effective and a symbolical value. Even if publishing more than one billion citations is a thrilling achievement, and – as Hutchins wrote – we are now at a tipping point with regard to open citations data, this milestone is not the last stop. Together with the other organizations and projects that participate in the Initiative for Open Citations, we will keep claiming the urgency for the remaining academic publishers to join our cause, and sharing our values with the whole academic community to make all existing citations data freely open and accessible. Recalling what Dario Taraborelli wrote in the conclusion of his article “The citation graph is one of humankind’s most important intellectual achievements“, “the world is waiting for the citation graph to become a public good”.
Here’s some information about their backgrounds, positions and roles at OpenCitations.
Claudio Fabbri (Administrator and Research Manager):
Claudio has taken responsibility for all the day-to-day administrative activities of OpenCitations, including financial arrangements with those organisations supporting OpenCitations and bureaucratic interactions involving our host institution, the University of Bologna.
“After graduating in philosophy, I worked for several years as a librarian in university libraries. Considering my professional background, it came naturally to me to get in touch first with the Open Access movement and then, in a broader view, with Open Science principles. Now, as the Research Manager for OpenCitations, I’m proud to lead the administrative work for this great infrastructure. I’m interested in the history of science, especially how scientific communication has changed over the years”.
Chiara Di Giambattista (Communications Director and Community Development Manager):
Chiara takes care of all communications and social interactions made on behalf of OpenCitations, in terms of outreach to the scholarly community, contacts with present and potential supporting organisations, and OpenCitations’ communications via social media. She is developing a comprehensive mission statement and communications strategy for OpenCitations, designed to enhance the community’s understanding of OpenCitations’ global mission and awareness of its financial needs, to increase usage of OpenCitations services, and to foster collaborative interactions with like-minded infrastructures and institutions.
“I graduated in Visual Arts at the University of Bologna with the thesis “Presenting the future in the Digital Age. The semiotic convergence between art and IxD design in the practice of Future Casting”. Being deeply passionate about the adaptive potential of language in all its forms, from classical linguistics to visual semiotics, I have various experiences in web journalism and publishing. My role in OpenCitations is to improve the engagement of OpenCitations with the scholarly community, to develop OpenCitations’ social media platforms (LinkedIn and Twitter) and the OpenCitations blog, and to communicate and promote the values, services and impact of OpenCitations throughout the world”.
Giuseppe Grieco (Software and Systems Developer):
As a skilled computer scientist, Giuseppe will take charge of all the technical development related to the OpenCitations services, data acquisition, and bibliographic metadata and citations provision.
“I have a degree in computer science. I have been passionate about this field since I was thirteen years old, when I started to learn the basics of algorithms and programming. Most recently, I have been dedicating myself to artificial intelligence by pursuing Masters studies at the University of Pisa. I have always tried to take every opportunityto put into practice what I’ve been studying . Now, with my appointment at OpenCitations, I will do so by developing upgrades and enhancements to the existing software infrastructure”.
More new people will be getting involved in OpenCitations in the autumn. Watch this space!
The interconnection between Wikipedia and Wikidata is now larger than ever.
The Wikipedia Citations dataset currently includes around 30M citations from Wikipedia pages to a variety of sources – of which 4M are to scientific publication. The increase of the connection with external data services and the provision of structured data to one of the key elements of Wikipedia articles has two significant benefits: first of all, a better discoverability of relevant encyclopedic articles related to scholarly studies; furthermore, the enacting of Wikipedia as a social authority and policy hub which would enable policymakers to assess the importance of an article, person, research group and institution by looking at how many Wikipedia articles cite them.
These are the motivations behind the “Wikipedia Citations in Wikidata” project, supported by a grant from the WikiCite Initiative. From January 2021 until the end of April, the team of Silvio Peroni (director of OpenCitations), Giovanni Colavizza, Marilena Daquino, Gabriele Pisciotta and Simone Persiani from the University of Bologna (Department of Classical Philology and Italian Studies) has been working in developing a codebase to enrich Wikidata with citations to scholarly publications that are currently referenced in English Wikipedia. This codebase consists of four software modules in Python and integrates new components (a classifier to distinguish citations by cited source and a look-up module to equip citations with identifiers from Crossref or other APIs). In so doing, Wikipedia Citations extends upon prior work which only focused on citations already equipped with identifiers.
In the first two steps of the workflow (extractor and converter) the mapping between the various ways Wikipedia citations are represented in Wikipedia articles and the OpenCitations Data Model (OCDM) has been implemented and then enriched with a component responsible to find new identifiers to the entities in a dataset compliant with OCDM (enricher), while in the pusher step the mapping between the OCDM and Wikidata has been enabled, and the code has been finally released in GitHub.
The extensive documentation that has accompanied the release of the codebase is crucial for one of the principal aims of the project, I.e., the adoption and reuse of the codebase by the community in other relevant Wikimedia projects, while the engagement of various communities (Wikidata, libraries, scholars…) is favored on one side by offering an increased number of citations data included in Wikidata, on the other side by blogging and sharing the updates on Twitter and public mailing lists
This project, whose ambitious purpose is to make Wikipedia contents better discoverable and enrich Wikidata with a ready-to-use corpus for further analysis or for developing new services, is opened to future perspectives. The intention is to use the software to create a dataset of Wikipedia English citations to understand, in particular, how many new entities (i.e., citing Wikipedia pages, cited articles and venues, authors) should be added to Wikidata in order to upload all the set of extracted citations, with the result of adding a massive amount of new bibliographic-related entities to the dataset.
The first steps have been taken, now we aim to extend the engagement of the community involved, especially those scholars that leverage Wikidata in existing services, and to interact with the scholars, libraries and institutions interested in a new approach to research, focused on people (from individuals to research groups) and their intellectual relevance.