Crowdsourcing open citations with CROCI

Crowdsourcing open citations with CROCI
An analysis of the current status of open citations, and a proposal

Author(s)
Ivan Heibiivan.heibi2@unibo.it
Digital Humanities Advanced Research Centre (DHARC), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
Silvio Peronisilvio.peroni@unibo.it
Digital Humanities Advanced Research Centre (DHARC), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
David Shottondavid.shotton@oerc.ox.ac.uk
Oxford e-Research Centre, University of Oxford, Oxford, United Kingdom

Keywords: Open citations; COCI; CROCI; Crossref; I4OC; OpenCitations

Copyright notice: This work is licensed under a Creative Commons Attribution 4.0 International License. You are free to share (i.e. copy and redistribute the material in any medium or format) and adapt (e.g. remix, transform, and build upon the material) for any purpose, even commercially, under the following terms: attribution, i.e. you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. The licensor cannot revoke these freedoms as long as you follow the license terms.

Notes: submitted as research-in-progress paper to the 17th International Conference on Scientomentrics and Informetrics (ISSI 2019, https://www.issi2019.org). PDF version available on arXiv.

Abstract

In this paper, we analyse the current availability of open citations data in one particular dataset, namely COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations; http://opencitations.net/index/coci) provided by OpenCitations. The results of these analyses show a persistent gap in the coverage of the currently available open citation data. In order to address this specific issue, we propose a strategy whereby the community (e.g. scholars and publishers) can directly involve themselves in crowdsourcing open citations, by uploading their citation data via the OpenCitations infrastructure into our new index, CROCI, the Crowdsourced Open Citations Index.

Introduction

The availability of open scholarly citations is a public good, which is of intrinsic value to the academic world as a whole (Shotton, 2013; Peroni et al. 2015; Shotton, 2018), and is particularly crucial for the scientometrics and informetrics community, since it supports reproducibility (Sugimoto et al., 2017) and enables fairness in research by removing such citation data from behind commercial paywalls (Schiermeier, 2017). Despite the positive early outcome of the Initiative for Open Citations (I4OC, https://i4oc.org), namely that almost all major scholarly publishers now release their publication reference lists, with the result that more than 500 million citations are now open via the Crossref API (https://api.crossref.org), and despite the related ongoing efforts of sister infrastructures and initiatives such as OpenCitations (http://opencitations.net) and WikiCite/Wikidata (https://www.wikidata.org), many scholarly citations are not freely available. While these initiatives have the potential to disrupt the traditional landscape of citation availability, which for the past half-century has been dominated by commercial interests, the present incomplete coverage of open citation data is one of the most significant impediments to open scholarship (van Eck et al., 2018).

In this work, we analyse the current availability of open citations data (Peroni and Shotton, 2018) within one particular dataset, namely COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations; http://opencitations.net/index/coci). This dataset is provided by OpenCitations, a scholarly infrastructure organization dedicated to open scholarship and the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies. Launched in July 2018, COCI is the first of the Indexes proposed by OpenCitations (http://opencitations.net/index) in which citations are exposed as first-class data entities with accompanying properties. It has already seen widespread usage (over nine hundred thousands API calls since launch, with half of these in January 2019), and has been adopted by external services such as VOSviewer (van Eck and Waltman, 2010).

In particular, in this paper we address the following research questions (RQs):

  1. What is the ratio between open citations vs. closed citations within each category of scholarly entities included in COCI (i.e. journals, books, proceedings, datasets, and others)?
  2. Which are the top twenty publishers in terms of the number of open citations received by their own publications, according to the citation data available in COCI?
  3. To what degree are the publishers highlighted in the previous analysis themselves contributing to the open citations movement, according to the data available in Crossref?

The results of these analyses show a persistent gap in the coverage of the currently available open citation data. To address this specific issue, we have developed a novel strategy whereby members of the community of scholars, authors, editors and publishers can directly involve themselves in crowdsourcing open citations, by uploading their citation data via the OpenCitations infrastructure into our new index, CROCI, the Crowdsourced Open Citations  Index.

Methods and material

To answer the RQs mentioned above, we used open data and technologies coming from various parties. Specifically, the open CC0 citation data we used came from the CSV dump of most recent release of COCI dated 12 November 2018 (OpenCitations, 2018), which contains 449,840,503 DOI-to-DOI citation links between 46,534,705 distinct bibliographic entities. The Crossref dump we used for the production of this most recent version of COCI was dated 3 October 2018, and included all the Crossref citation data available at that time in both the ‘open’ dataset (accessible by all) and the ‘limited’ dataset (accessible only to users of the Crossref Cited-by service and to Metadata Plus members of Crossref, of which OpenCitations is one – for details, see https://www.crossref.org/reference-distribution/).

We additionally extracted information about the number of closed citations to each of the 99,444,883 DOI-identified entities available in the October Crossref dump. This number was calculated by subtracting the number of open citations to each entity available within COCI from the value “is-referenced-by-count” available in the Crossref metadata for that particular cited entity, which reports all the DOI-to-DOI citation links that point to the cited entity from within the whole Crossref database (including those present in the Crossref ‘closed’ dataset).

Furthermore, we extracted the particular publication type of each entity, so as to identify it either as a journal article, or as a book chapter, etc. We determined these publication types for all the DOI-identified entities available in the Crossref dump we used. We then identified the publisher of each entity, by querying the Crossref API using the entity’s DOI prefix. This allowed us to group the number of open citations and closed citations to the articles published by that particular publisher, and to determine the top twenty publishers in terms of the number of open citations that their own publications had received.

Finally, we again queried the Crossref API, this time using the DOI prefixes of the citing entities, to check the participation of these top twenty publishers in terms of the number of open citations they were themselves publishing in response to the open citation movement sponsored by I4OC. Details of all these analyses are available online in CC0 (Heibi et al., 2019).

Results

First (RQ1) we determined the numbers of open citations and closed citations received by the entities in the Crossref dump. All the entity types retrieved from Crossref were aligned to one of following five categories: journal, book, proceedings, dataset, other, as illustrated in (Heibi et al., 2019). The outcomes are summarised in Figure 1, where it is evident that the number of open citations available in COCI is always greater than the number of closed citations to these entities to which COCI does not have access, for each of the publication categories considered, with the categories proceedings and dataset having the largest ratios.

Figure 1. The number of open citations (available in COCI) vs. closed citations (according to Crossref data) of the cited entities within COCI, analyzed and grouped according to five distinct categories. [Note that the vertical axis has a logarithmic scale].

Analysis of the Crossref data show that there are in total ~4.1 million DOIs that have received no open citations and at least one closed citation.  Conversely, there are ~10.7 million DOIs that have received no closed citations and at least one open citation in COCI. Most of the papers in both these categories have received very few citations.

The outcome of the second analysis (RQ2) shows which publishers are receiving the most open citations. To this end, we considered all the open citations recorded in COCI, and compared them with the number of closed citations to these same entities recorded in Crossref. Figure 2 shows the top twenty publishers that received the greatest number of open citations. Elsevier is the first publisher according to this ranking, but it also records the highest number of closed citations received (~97M vs. ~105.5M). The highest ratio in terms of open citations vs. closed citations was recorded by IEEE publications (ratio 6.25 to 1), while the lowest ratio was for the American Chemical Society (ratio 0.73 to 1).

Figure 2. The top twenty publishers sorted in decreasing order according to the number of open citations the entities they published have received, according to the open citation data within COCI. We accompany this count with the number of closed citations to the entities published by each of them according to the values available in Crossref, the total numbers of citations to these publishers’ entities, and the percentages of these totals that are open or closed.

Considering the twenty publishers listed in Figure 2, we wanted additionally to know their current support for the open citation movement (RQ3). The results of this analysis (made by querying the Crossref API on 24 January 2019) are shown in Figure 3. Among the top ten publishers shown in Figure 2, i.e. those who themselves received the largest numbers of open citations, only five, namely Springer Nature, Wiley, the American Physical Society, Informa UK Limited, and Oxford University Press, are participating actively in the open publication of their own citations through Crossref.

Figure 3. The contributions to open citations made by the twenty publishers listed in Figure 2, as of 24 January 2018, according to the data available through the Crossref API. The counts listed in this table refers to the number of publications for which each publisher has submitted metadata to Crossref that include the publication’s reference list. The categories closed, limited and open refer to publications for which the reference lists are not visible to anyone outside the Crossref Cited-by membership, are visible only to them and to Crossref Metadata Plus members, or are visible to all, respectively. Additional information on this classification of Crossref reference lists is available at https://www.crossref.org/reference-distribution/. The final column in the table shows the total number of publications for which the publisher has submitted metadata to Crossref, whether or not those metadata include the reference lists of those publications.

It is noteworthy that JSTOR contributes very few references to Crossref, while the many citations directed towards its own holdings place JSTOR twelfth in the list of publishers receiving open citations (Figure 2).  However, as the last column of Figure 3 shows, all the major publishers listed here are failing to submit reference lists to Crossref for a large number of the publications for which they submit metadata, that number being the difference between the value in the last column for that publisher and the combined values in the preceding three columns.  JSTOR is the worst in this regard, submitting references with only 0.53% of its deposits to Crossref, while the American Physical Society is the best, submitting references with 96.54% of its publications recorded in Crossref.

Additional information about these analyses, including the code and the data we have used to compute all the figures, is available as a Jupyter notebook at https://github.com/sosgang/pushing-open-citations-issi2019/blob/master/script/croci_nb.ipynb.

It should be stressed that a very large number of potentially open citations are totally missing in the Crossref database, and consequently from COCI, for the simple reason that many publishers, particularly smaller ones with limited technical and financial resources, but also all the large ones shown in Figure 3 and most of the others, are simply not depositing with Crossref the reference lists for any or all of their publications.

Discussion

According to the data retrieved, the open DOI-to-DOI citations available in COCI exceed the number of closed DOI-to-DOI citations recorded in Crossref for every publication category, as shown in Figure 1. The journal category is the one receiving the most open citations overall, as expected considering the historical and present importance of journals in most areas of the scholarly ecosystem. However, the number of closed citations to journal articles within Crossref is also of great significance, since these 322 million closed citations represent 43% of the total.

It is important to note that about one third of these closed citations to journal articles (according to Figure 2) are references to entities published by Elsevier, and that references from within Elsevier’s own publications constitute the largest proportion of these closed citations, since Elsevier is the largest publisher of journal articles. Thus Elsevier’s present refusal to open its article references is contributing significantly to the invisibility of Elsevier’s own publications within the corpus of open citation data that is being increasingly used by the scholarly community for discovery, citation network visualization and bibliometric analysis.

It is also worth mentioning the discrepancy between the citations available in COCI, which comes from the data contained in the open and limited Crossref datasets as of 3 October 2018, and those available within those same Crossref datasets as of 24 January 2019. The most significant difference relates to IEEE. While the citations present in COCI include those from IEEE publications to other entities prior to November 2018 (since in October 2018 its article metadata with references were present within the Crossref limited dataset), in November 2019 this scholarly society decided to close the main part of its Crossref references, and thus from that moment they became unavailable to Crossref Metadata Plus members such as OpenCitations, as highlighted in Figure 3. Thus IEEE citations from articles whose metadata was submitted to Crossref after the date of this switch to closed will no longer be automatically ingested into COCI.

To date, the majority of the citations present in Crossref that are not available in COCI comes from just three publishers: Elsevier, the American Chemical Society and University of Chicago Press (Figure 3). In fact, considering the average value of 18.6 DOI-to-DOI citation links for each citing entity – calculated by dividing the total number of citations in COCI by the number of citing entities in the same dataset – these three publishers are holding more than 214 million DOI-to-DOI citations that could potentially be opened. (The IEEE citation data which was in the Crossref ‘limited’ category as of October 2018 are actually included in COCI, although those from that organization’s more recent publications will no longer be, as mentioned above).

We think it is deeply regrettable and almost incomprehensible that any professional organization, learned society or university press, whose primary mission is to serve the interests of the practitioners, scholars and readers it represents, should choose not open all its publications’ reference lists as a public good, whatever secondary added-value services it chooses to build on top of the citations that those reference lists contain.

CROCI, the Crowdsourced Open Citations Index

The results of the Initiative for Open Citations (I4OC) have been remarkable, since its efforts have led to the liberation of millions of citations in a relatively short time.  However, many more citations, the lifeblood of the scholarly communication, are still not available to the general public, as mentioned in the previous section. Some researchers and journal editors, in particular, have recently started to interact with publishers that are not participating in I4OC, in attempts to convince them to release their citation data. Remarkable examples of these activities are the petition promoted by Egon Willighagen (https://tinyurl.com/acs-petition) addressed to the American Chemical Society, and the several unsuccessful requests made to Elsevier by the Editorial Board of the Journal of Informetrics, which eventually resulted in the resignation of the entire Editorial Board on 10 January 2019 in response to Elsevier’s refusal to address their issues (http://www.issi-society.org/media/1380/resignation_final.pdf).

To provide a pragmatic alternative that would permit the harvesting of currently closed citations, so that they could then be made available to the public, we at OpenCitations have created a new OpenCitations Index: CROCI, the Crowdsourced Open Citations Index, into which individuals identified by ORCiD identifiers may deposit citation information that they have a legal right to submit, and within which these submitted citation data will be published under a CC0 public domain waiver to emphasize and ensure their openness for every kind of reuse without limitation. Since citations are statements of fact about relationships between publications (resembling statements of fact about marriages between individual persons), they are not subject to copyright, although their specific textual arrangements within the reference lists of particular publications may be.  Thus the citations from which the reference list of an author’s publication has been composed may legally be submitted to CROCI, although the formatted reference list cannot be. Similarly, citations extracted from within an individual’s electronic reference management system and presented in the requested format may be legally submitted to CROCI, irrespective of the original sources of these citations.

To populate CROCI, we ask researchers, authors, editors and publishers to provide us with their citation data organised in a simple four-column CSV file (“citing_id”, “citing_publication_date”, “cited_id”, “cited_publication_date”), where each row depicts a citation from the citing entity (“citing_id”, giving the DOI of the cited entity) published on a certain date (“citing_publication_date”, with the date value expressed in ISO format “yyyy-mm-dd”), to the cited entity (“cited_id”, giving  the DOI of the cited entity) published on a certain date (“cited_publication_date”, again with the date value expressed in ISO format “yyyy-mm-dd”). The submitted dataset may contain an individual citation, groups of citations (for example those derived from the reference lists of one or more publications), or entire citation collections. Should any of the submitted citations be already present within CROCI, these duplicates will be automatically detected and ignored.

The date information given for each citation should be as complete as possible, and minimally should be the publication years of the citing and cited entities. However, if such date information  is unavailable, we will try to retrieve it automatically using OpenCitations technology already available. DOIs may be expressed in any of a variety of valid alternative formats, e.g. “https://doi.org/10.1038/502295a”, “http://dx.doi.org/10.1038/502295a”, “doi: 10.1038/502295a”, “doi:10.1038/502295a”, or simply “10.1038/502295a”.

An example of such a CVS citations file can be found at https://github.com/opencitations/croci/blob/master/example.csv. As an alternative to submissions in CSV format, contributors can submit the same citation data using the Scholix format (Burton et al., 2017) – an example of such format can be found at https://github.com/opencitations/croci/blob/master/example.scholix.

Submission of such a citation dataset in CSV or Scholix format should be made as a file upload either to Figshare (https://figshare.com) or to Zenodo (https://zenodo.org). For provenance purposes, the ORCID personal identifier of the submitter of these citation data should be explicitly provided in the metadata or in the description of the Figshare/Zenodo object. Once such a citation data file upload has been made, the submitter should inform OpenCitations of this fact by adding an new issue to the GitHub issue tracker of the CROCI repository (https://github.com/opencitations/croci/issues).

OpenCitations will then process each submitted citation dataset and ingest the new citation information into CROCI. CROCI citations will be available at http://opencitations.net/index/croci using an appropriate REST API and SPARQL endpoint, and will additionally be published as periodic data dumps in Figshare, all releases being under CC0 waivers. We propose in future to enable combined searches over all the OpenCitations indexes, including COCI and CROCI.

We are confident that the community will respond positively to this proposal of a simple method by which the number of open citations available to the academic community can be increased, in particular since the data files to be uploaded have a very simple structure and thus should be easy to prepare. In particular, we hope for submissions of citations from within the reference lists of authors’ green OA versions of papers published by Elsevier, IEEE, ACS and UCP, and from publishers not already submitting publication metadata to Crossref, so as to address existing gaps in open citations availability. We look forward to your active engagement in this initiative to further increase the availability of open scholarly citations.

Acknowledgements

The authors would like to thank the SoS Gang (https://sosgang.github.io) for their support, and for having make available a space (https://github.com/sosgang/pushing-open-citations-issi2019) within which to share openly all the scripts and data developed for this study.

Postscript

On the 5th February 2019, after these analyses had been concluded and the text of this paper had been finalized, Crossref published an announcement that DOIs were missing from approximately 11% of its references (https://www.crossref.org/blog/underreporting-of-matched-references-in-crossref-metadata/), because of an historical fault in the manner in which Crossref automatically processes references upon deposit by publishers.  This fault means that the absolute numbers of the open and closed DOI-to-DOI citations given in this paper are significantly lower than they should be. While this does not invalidate the comparisons we have reported here, it is clearly regrettable. In April 2019, after Crossref have completed the re-processing of their reference data to include the missing DOIs, we will create an updated version of COCI, and will then recompute and republish the data presented here to include those citations for which Crossref is presently failing to correctly assigned DOIs to the cited entities.

References

Burton, A., Fenner, M., Haak, W. & Manghi, P. (2017). Scholix Metadata Schema for Exchange of Scholarly Communication Links (Version v3). Zenodo. DOI: https://doi.org/10.5281/zenodo.1120265

Heibi, I., Peroni, S. & Shotton, D. (2019). Types, open citations, closed citations, publishers, and participation reports of Crossref entities. Version 1. Zenodo. DOI: https://doi.org/10.5281/zenodo.2558257

OpenCitations (2018). COCI CSV dataset of all the citation data. Version 3. Figshare. DOI: https://doi.org/10.6084/m9.figshare.6741422.v3

Peroni, S., Dutton, A., Gray, T. & Shotton, D. (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71: 253-277. DOI: https://doi.org/10.1108/JD-12-2013-0166

Peroni, S. & Shotton, D. (2018). Open Citation: Definition. Version 1. Figshare. DOI: https://doi.org/10.6084/m9.figshare.6683855

Schiermeier, Q. (2017). Initiative aims to break science’s citation paywall. Nature News. DOI: https://doi.org/10.1038/nature.2017.21800

Shotton, D. (2013). Open citations. Nature, 502: 295-297. DOI: https://doi.org/10.1038/502295a

Shotton, D. (2018). Funders should mandate open citations. Nature, 553: 129. DOI: https://doi.org/10.1038/d41586-018-00104-7

Sugimoto, C. R., Waltman, L., Larivière, V., van Eck, N. J., Boyack, K. W., Wouters, P. & de Rijcke, S. (2017). Open citations: A letter from the scientometric community to scholarly publishers. ISSI. http://www.issi-society.org/open-citations-letter/ (last visited 26 January 2018)

van Eck, N.J. & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2): 523-538. DOI: https://doi.org/10.1007/s11192-009-0146-3

van Eck, N.J., Waltman, L., Larivière, V. & Sugimoto, C. R. (2018). Crossref as a new source of citation data: A comparison with Web of Science and Scopus. CWTS Blog. https://www.cwts.nl/blog?article=n-r2s234 (last visited 26 January 2018)

Advertisements
Posted in Citations as First-Class Data Entities, Data publication, Open Citations | Tagged , , , , , | Leave a comment

The OpenCitations Enhancement Project – final report

The OpenCitations Enhancement Project
Final report for the Alfred P. Sloan Foundation

Report period: 1st May 2017 – 30 November 2018.
Report written: 30th December 2018

Background

OpenCitations (http://opencitations.net) is a scholarly infrastructure organization dedicated to open scholarship and the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies, and engaged in advocacy for semantic publishing and open citations [Peroni and Shotton, 2018b]. It provides the OpenCitations Data Model [Peroni and Shotton, 2018d], the SPAR (Semantic Publishing and Referencing) Ontologies [Peroni and Shotton, 2018e] for encoding scholarly bibliographic and citation data in RDF, and open software of generic applicability for searching, browsing and providing APIs over RDF triplestores. It has developed the OpenCitations Corpus (OCC) [Peroni et al., 2017] of open downloadable bibliographic and citation data recorded in RDF, and a system and resolution service for Open Citation Identifiers (OCIs) [Peroni and Shotton, 2018c], and it is currently developing a number of Open Citation Indexes using the data openly available in third-party bibliographic databases.

The Directors of OpenCitations are David Shotton, Oxford e-Research Centre, University of Oxford (david.shotton@opencitations.net), and Silvio Peroni, Digital Humanities Advanced Research Centre, Department of Classical Philology and Italian Studies, University of Bologna (silvio.peroni@opencitations.net). We are committed to open scholarship, open data, open access publication, and open source software. We espouse the FAIR data principles developed by Force11, of which David Shotton was a founding member (https://www.force11.org/group/fairgroup/fairprinciples), and the aim of the Initiative for OpenCitations (I4OC, https://i4oc.org), of which both David Shotton and Silvio Peroni were founding members, to promote the availability of citation data that is structured, separable, and open.

Project personnel and roles

Ivan Heibi – Research Fellow

Ivan Heibi was appointed to the 12 months Research Fellowship position funded by the Sloan Foundation.

Ivan has been responsible for the development of new visualization and programming interfaces for exploring and making sense of the citation data included in the OCC and in the new OpenCitations Indexes, for the main part of the scripts related to the population and regular maintenance of COCI (The OpenCitations Index of Crossref open DOI-to-DOI citations, the first of the OpenCitations Indexes), and for conference presentations and paper writing.

Silvio Peroni – Lead Applicant and Principal Investigator

Silvio has been responsible for project management, for interview, appointment and supervision of the work of Ivan Heibi, for all aspects of software coding and technical developments required for the OpenCitations Corpus (OCC), the OpenCitations Indexes, and the Open Citation Identifier Resolution Service, for the ordering and management of new Sloan-funded hardware, and for conference presentations, paper writing and other forms of outreach and dissemination (e.g. blog and social networks).

David Shotton – Consultant Co-Investigator

David has been responsible for project management, interaction with publishers, conference presentations, paper writing, other forms of outreach and dissemination (e.g. blog and social networks), for web site and data model revision, and for independent usability evaluation, stress-testing and design feedback of new user interfaces and applications.

Project management

Project management has been straightforward, as should be the case, given the small size of our team. It has involved more than 1500 e-mail exchanges between David Shotton and Silvio Peroni, about 500 e-mail exchanges between Silvio Peroni and Ivan Heibi since November 2017, two dozen or so video conferences, some of which have involved collaborators, and an extended face-to-face meeting during the WikiCite 2017 Conference in Vienna at the start of the project and the Workshop on Open Citations 2018 in Bologna.

We have together harmoniously developed the concept of and vision for OpenCitations as an infrastructure organization, the structure and content of the OpenCitations web site (http://opencitaitons.net), the classes and properties of our supporting SPAR ontologies (http://www.sparontologies.net), and the community of collaborators and users of our developments. This has involved outreach and dissemination at a number of international research conferences, involvement with publishers through the Initiative for Open Citations, and associated publications.

We have studied and preliminarily tested a new scalable architecture centred on one powerful independent physical server, that both stores and handles all the data in the Corpus and in the new OpenCitations Indexes, and also offers adequate performance for query services. This server is supplemented by 30 additional small physical machines, Raspberry Pi 3Bs, working in parallel, each in charge of ingesting a defined set of reference lists and feeding the ingested data to the central server for further processing and storage as RDF in our Blazegraph triplestore.

Current status of the OpenCitations services

Currently, we release two different datasets – the OpenCitations Corpus and COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations – and several interfaces so as to make these data queryable from different access points.

Functionality and holdings

As of 29th December 2018, the OpenCitations Corpus (OCC) contains information about 13,964,148 citation links to 7,565,367 cited resources, ingested from 326,743 citing bibliographic resources obtained from the Open Access corpus of Europe PubMed Central and from the citation data imported from the EXCITE project. The main part of the development effort in the past months has been spent in implementing ingestion strategies that allow partners to provide us citation data, stored according the OpenCitations Data Model [Peroni and Shoton, 2018b], so as to be added directly into the Corpus. In September 2018, we successfully completed the ingestion of the initial data coming from the EXCITE project (citations from social sciences scholarly papers published by German publishers), and we are actively interacting with the LOC-DB project and the Venice Scholar Index so as to add their data to the OCC as well. In ongoing work, we are also collaborating with arXiv and EXCITE to harvest all the references from all PDF documents in the arXiv ePrints collection, to record these in RDF according to the OpenCitations Data Model, and to ingest them into the OpenCitations Corpus, as well as creating a new Index of these citations.

The first of these OpenCitations Indexes is COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations, an RDF dataset containing details of all the citations that are specified by the open references to DOI-identified works present in Crossref, as of the latest COCI update. COCI does not index Crossref references that are closed, nor Crossref references to entities that lack DOIs. These citations are treated as first-class data entities, with accompanying properties including the citations timespan and possible kinds of self-citation characteristics, modelled according to the index data model described in the OpenCitations Indexes page. COCI was launched in July 2018, and the most recent update of COCI is dated 12 November 2018. It presently contains 449,840,503 citations between 46,534,705 bibliographic resources. COCI is the first citation index released by OpenCitations, being a bibliographic index recording citations between publications that permits the user to establish which later documents cite earlier documents, and to create citation graphs of these citations.).

While full coverage of the scholarly citation graph depicted by the aforementioned datasets (or as full as practically possible) is required for the calculation of certain bibliometric indicators such as journal impact factors and individual h-indexes (Hirsch numbers), partial coverage while OCC grows is still of value, since it includes citations of all the most important biomedical papers obtained from the Open Access Subset off PubMed Central. These can be easily recognized by their large number of inward citation links, and can be used to explore the development of disciplines and research trends. In addition, COCI, with its wider scope, has sufficient coverage to be used for large-scale bibliometrics analysis.

Purchase of new hardware, and testing and development of new software

Because of unavoidable academic teaching commitments for Silvio Peroni, the installation of the new Sloan-funded hardware for OCC, purchased in autumn 2017, had to be postponed. All the services (old and new) of OpenCitations were successfully transferred to our new server in October 2018. In order to implement such transition, the ingestion process of the OCC was halted, so as to allow Silvio Peroni to properly test the new hardware and to extend the existing ingestion software so as to be usable within the new parallel processing architecture. The final tests are currently running, and we will recommence the full ingestion process of the OCC using the new hardware configuration, with its greatly enhanced ingest rate, in January 2019. In the meantime, the new infrastructure has been used to allow us to create COCI, so far the largest RDF dataset of open citation data available worldwide.

In addition, we have completed the transition of all the OpenCitations software from the old GitHub repository (i.e. https://github.com/essepuntato/opencitations) to a new GitHub organization, namely https://github.com/opencitations. This organisation includes several repositories which permit third parties to initiate the whole suite of OpenCitations software on a local machine. This is of key importance for the resilience of this open source project.

User interfaces

SPARQL, the query language used to interrogate RDF triplestores, is a quite powerful language. However, one needs appropriate skills with Semantic Web technologies to master it for solving even easy search tasks. Thus normal web users are unable to use appropriately such technologies if not appropriately instructed, leaving all these technologies in the hands of a limited number of experts. The datasets made available by OpenCitations suffered similar issues.

Initially, the goal we had was to develop ad-hoc user interfaces to abstract the complexities of the SPARQL endpoints into well-designed Web interfaces that anyone could use. During the development, though, we thought it would be better to develop generic frameworks for building customizable interfaces that allow one to expose, in a more human-understandable way, RDF data stored in any RDF triplestore and accessible through any SPARQL-endpoint, so as to forster reuse of such software in contexts that, in principle, might go far beyond the OpenCitations domain.

To this end we developed three different open software applications:

  • OSCAR [Heibi et al., 2018a] [Heibi et al., 2018b], a Javascript application for creating textual search interfaces to RDF data;
  • LUCINDA, another Javascript application for creating Web browsers over RDF data;
  • RAMOSE, a Python application that permits one to easily create and serve a conventional HTTP REST API over a SPARQL endpoint.

All these tools provide a configurable mechanism – by means of one single textual configuration file – that allows one to generate Web-based interfaces to any SPARQL endpoint. In practice, these tools are flexible Web-interface makers to RDF data, and as such represent a significant advance available to the entire community.

All these applications have been used to produce several interfaces to all the datasets released by OpenCitations. In particular, we have created user-friendly textual search interfaces (via OSCAR) both for the OCC (see the search box now on the OpenCitations home page, and the related search page) and for COCI (see the related search page). We have additionally developed browsing applications (via LUCINDA) to permit humans an easier navigation of all the entities included in the OCC (e.g. see the bibliographic resource br/1791056) and in COCI (e.g. see the citation oci:02001010806360107050663080702026306630509-02001010806360107050663080702026305630301). Finally, we have also implemented REST HTTP APIs (via RAMOSE) for simplifying the queries to both datasets, the OCC and COCI, by Web developers with no expertise in Semantic Web technologies.

In addition, in order to demonstrate its flexibility, we have also created two web pages using OSCAR, LUCINDA, and RAMOSE for permitting similar tasks (text query, browsing, and REST APIs) on the scholarly data in Wikidata / WikiCite – another project recently funded by the Alfred P. Sloan Foundation. These interfaces have been introduced in two distinct event: during the hack day of the Workshop on Open Citations 2018 and during the WikiCite 2018 Conference.

A further prototypical interface / service has been recently proposed so as to try to gather additional open citation data to include in the OCC, involving users of the scholarly domain such as editors and researchers. This application is called BCite [Daquino et al., 2018]. BCite is designed to provide a full workflow for citation discovery, allowing users to specify the references as provided by the authors of an article, to retrieve them in the required format and style, to double-check their correctness, and, finally, to create new open citation data according to the OpenCitations Data Model [Peroni and Shotton, 2018d], so as to permit their future integration into the OCC. While presently only a prototype, we received several commendations for this tool, and we are currently studying funding strategies to develop a full standalone application that can be used by anyone and that allows users to directly interact with the OCC, so as to upload new data into the Corpus.

Open Citation Identifiers

During the reporting period, it became increasingly evident to us that citations deserved treating as First Class Data Entities, which would give the following advantages:

  • All the information regarding each citation would be available in one place.
  • Citations become easier to describe, distinguish, count and process.
  • If available in aggregate, citations become easier to analyze using bibliometric methods, for example to determine how citation time spans vary by discipline.

Four developments were required to make this possible:

  • The metadata describing the citation must be definable in a machine-readable manner.
  • Such metadata must be storable, searchable and retrievable.
  • Each citation must be identifiable, using a globally unique Persistent Identifier.
  • There must be a Web-based resolution service that takes the identifier as input and returns a description of the citation.

To this end, we have achieved:

  • the first requirement by the addition of appropriate classes and properties to CiTO, the Citation Typing Ontology, and by the addition of a new member of the class datacite:ResourceIdentifierScheme in the DataCite Ontology, namely Open Citation Identifier;
  • the second requirement by modifying the OpenCitations Data Model [Peroni and Shotton, 2018d] so that citations can be properly described using these new ontology terms within the OpenCitations Corpus;
  • the third requirement by creating the syntax for this new Open Citation Identifier (OCI) [Peroni and Shotton, 2018c], and enabling the creation of such identifiers, both to specify citations within the OpenCitations Corpus, and also (importantly) to specify citations described in Wikidata (by QIDs) and in Crossref (by DOIs); and
  • the fourth requirement by creating a resolving service for OCIs at http://opencitations.net/oci and additional software (in Python) for retrieving information about a particular citation identified by an OCI.

To date, OCIs have been actively used both in the OCC and in COCI to identify all the citations they contain. In addition, we now plan to create and publish additional OpenCitations Indexes of all the citations DataCite, Wikidata and (of course) the OCC, which we hope will be of great benefit to bibliometricians in their analysis of citation networks, self-citation, etc., and in the calculation of citation metrics. OCIs have been recognised by the EU Project FREYA as unique global identifiers for bibliographic citations.

Collaborations and Users

OpenCitations Data Model

We are collaborating with the following groups and academic projects, both to promote the use of the OpenCitations Data Model (OCDM), and to provide a publication venue for the citation data that they are liberating from the scholarly literature:

  • Matteo Romanello of the Digital Humanities Laboratory at the University of Lausanne is using OCDM for modelling citations of the classical literature within ancient Venetian documents in the context of the Venice Scholar Index, and is currently working on producing a dataset of citation data compliant with OCDM so as to be ingested in the OCC.
  • Two DFG-funded German projects that are extracting citations from Social Science publications:
    • The Linked Open Citations Database (LOC-DB) at the University of Mannheim is using OCDM to model their data, with the aim of producing them accordigly with such model so as to be ingested in the OCC.
    • Steffen Staab (University of Koblenz) and Philipp Mayr (GESIS) are running the EXCITE Project, which uses OCDM to model their citation data, and already adopted the OCC as their publication platform. In fact, in September 2018, ~1 million citations coming from the EXCITE Project were successfully ingested in the OCC.
  • Sergey Parinov is technically leading CitEcCyr, which is an open repository of citation relationships obtained from research papers in the Russian language and Cyrillic script. This project intends to model its citations using the OCDM, and will use the OpenCitations Corpus as its publication platform.

Users of OpenCitations data

The following project and organizations have let us know that they are using data from the OCC and COCI:

  • Wikidata includes alignments between several bibliographic entries with OCC resources;
  • OpenAIRE imported OCC metadata about articles into their LOD database;
  • Daniel Ecer and Lisa Knoll of eLife performed analytics on the OCC data;
  • Ontotext demonstrated SPARQL query federation between Springer Nature LOD and OCC;
  • Anna Kamińska published a bibliometrics case study of PLOS ONE articles in OCC;
  • Daniel Himmelstein processed OpenCitations data to create DOI-to-DOI citation tables;
  • Thiago Nunes and Daniel Schwabe are using OCC to exemplify their XPlain framework;
  • Antonina Dattolo and Marco Corbatto are using the OCC as source for VisualBib framework;
  • Nees Jan van Eck and Ludo Waltman extended VOSviewer so as to use data in the OCC + COCI, that will be officially published in the next release of the tool;
  • Barney Walker developed Citation Gecko, a graph-based citation discovery tool based on the OCC and COCI for retrieving citation data about the papers;
  • Philipp Zumstein developed a Zotero plugin that gives information about open citations using COCI;
  • Dominique Rouger developed a Web application that provides a visual graph representation of citation links in COCI.

Website statistics

From May 2017 to November 2018, the official OpenCitations website has been accessed ~5.8M times – we have excluded from this list the hits done by well-known spiders and crawlers. It is worth mentioning that the pages related to the data available and the services for querying them (i.e. “/corpus”, “/sparql”, and “/index” in the following diagram) have together gained a very high percentage of the overall accesses, showing that the main reason people access the OpenCitations website is to explore and use the data in the OCC and in COCI. It is also clear how the introduction of COCI brought additional accesses to the OpenCitations services, and the trend is increasing – e.g. in December 2018 (not shown in the following diagram) we got more than 200M accesses to “/index”, mainly related to the use of the COCI REST APIs.

Community outreach

From May 2017 to November 2018, the documents (i.e. [Peroni and Shotton, 2018b] [Peroni and Shotton, 2018c] [Peroni and Shotton, 2018d]) and the dumps of the OCC and of COCI available on Figshare (see http://opencitations.net/download) have been viewed 37,960 times and downloaded 3,842 times. The figure below summarizes how many views and downloads such resources have received month by month. For example, the latest version of COCI in CSV has been downloaded 239 times since its release in November (see https://doi.org/10.6084/m9.figshare.6741422.v3).

In the past nineteen months, the posts published by the official Twitter account of OpenCitations have been engaged by 599,200 distinct Twitter accounts, the Twitter profile (@opencitations) has been visited 12,719 times and has been mentioned in 565 tweets written by others, and it has collected an additional 1,925 followers. The diagram below shows all these statistics month by month.

Although we made relatively few new blog posts in the reporting period (there have been twenty more since then), from May to November 2018, the blog dedicated to OpenCitations (https://opencitations.wordpress.com) received 19,454 visits by 15,184 distinct users. As shown in the diagram below, the biggest peaks in terms of visits has been in July 2018 (the month when we launched COCI) and in September 2018 (the month of the Workshop on Open Citations).

Users of the SPAR Ontologies

The SPAR Ontologies [Peroni and Shotton, 2018e] are in use by about 40 other projects and organizations, including:

  • The United States Global Change Information System, which encodes federal information relating to climate change, makes extensive use of SPAR ontology terms.
  • The United Nations Document Ontology (UNDO) has been specifically aligned with FaBiO.
  • Wikidata has many classes that have been alighted with FaBiO or CiTO.
  • DBPedia’s DataID ontology uses the FaBiO and DataCite ontologies.
  • W3C’s Data on the Web Best Practices: Dataset Usage Vocabulary uses SPAR Ontologies.

For the full list, see http://www.sparontologies.net/uptake.

To date, as far as we are aware, more than 677 papers have been published that cite or use one or more of the SPAR ontologies. For the full list, see http://www.sparontologies.net/uptake#publications.

OpenCitations and the Initiative for Open Citations

OpenCitations and the Initiative for Open Citations, despite the similarity of title, are two distinct organizations. The primary purpose of OpenCitations is to host and build the OpenCitations Corpus (OCC) and the OpenCitations Indexes, as long as additional service to browse, query, and analyse citation data. In contrast, the Initiative for Open Citations (I4OC, https://i4oc.org) is separate and independent organization, whose founding was spearheaded by Dario Taraborelli of the WikiMedia Foundation. OpenCitations is one of several founding members of the Initiative for Open Citations, as documented at https://i4oc.org/#founders. I4OC is a pressure group to promote the unrestricted availability of scholarly citation data, but does not itself host citation data.

Because open reference lists are necessary for the population of OCC, we at OpenCitations have devoted considerable effort to promoting I4OC’s aims, and we host the I4OC web site on behalf of that community.

Within a short space of time, I4OC has persuaded most of the major scholarly publishers to open their reference lists submitted to Crossref, so that the proportion of all references submitted to Crossref that are now open has risen from 1% to over 50%. These are now available for OpenCitations to harvest into the OpenCitations Corpus and publish in RDF, as well as for others to harvest and use as they wish [Shotton, 2018].

Publications during the reporting period

Scholarly papers

Marilena Daquino, Ilaria Tiddi, Silvio Peroni, David Shotton (2018). Creating Open Citation Data with BCite. In Emerging Topics in Semantic Technologies – ISWC 2018 Satellite Events: 83-93. DOI: https://doi.org/10.3233/978-1-61499-894-5-83, OA at http://ceur-ws.org/Vol-2184/paper-01.pdf

Ivan Heibi, Silvio Peroni, David Shotton (2018). Enabling text search on SPARQL-endpoints through OSCAR. Submitted for publication to Data Science – Methods, Infrastructure, and Applications. OA at at https://w3id.org/people/essepuntato/papers/oscar-datascience2019/

Ivan Heibi, Silvio Peroni, David Shotton (2018). OSCAR: A Customisable Tool for Free-Text Search over SPARQL Endpoints. In Semantics, Analytics, Visualization – 3rd International Workshop, SAVE-SD 2017, and 4th International Workshop, SAVE-SD 2018, Revised Selected Papers: 121-137. DOI: https://doi.org/10.1007/978-3-030-01379-0_9, OA at https://w3id.org/people/essepuntato/papers/oscar-savesd2018.html

Silvio Peroni, David Shotton (2018). OpenCitations: enabling the FAIR use of open citation data. In Proceedings of the GARR Conference 2017 – The data way to Science – Selected Papers. DOI: https://doi.org/10.26314/GARR-Conf17-proceedings-19

Silvio Peroni, David Shotton (2018). The SPAR Ontologies. In Proceedings of the 17th International Semantic Web Conference (ISWC 2018): 119-136. DOI: https://doi.org/10.1007/978-3-030-00668-6_8

Silvio Peroni, David Shotton, Fabio Vitali (2017). One year of the OpenCitations Corpus: Releasing RDF-based scholarly citation data into the Public Domain. In Proceedings of the 16th International Semantic Web Conference (ISWC 2017): 184-192. DOI: https://doi.org/10.1007/978-3-319-68204-4_19, OA at https://w3id.org/people/essepuntato/papers/oc-iswc2017.html

David Shotton (2018). Funders should mandate open citations. Nature 553: 129. https://doi.org/10.1038/d41586-018-00104-7

Additional documents

Silvio Peroni, David Shotton (2018). Open Citation: Definition. Figshare. DOI: https://doi.org/10.6084/m9.figshare.6683855.v1

Silvio Peroni, David Shotton (2018). Open Citation Identifier: Definition. Figshare. DOI: https://doi.org/10.6084/m9.figshare.7127816.v1

Silvio Peroni, David Shotton (2018). The OpenCitations Data Model. Figshare. DOI: https://doi.org/10.6084/m9.figshare.3443876.v5

Blog posts

6 May 2017: Querying the OpenCitations Corpus

15 May 2017: The Sloan Foundation funds OpenCitations

24 Nov 2017: Milestone for I4OC – open references at Crossref exceed 50%

24 Nov 2017: Elsevier references dominate those that are not open at Crossref

28 Nov 2017: Openness of non-Elsevier references

9 Jan 2018: The new Crossref reference distribution policy

9 Jan 2018: Barriers to comprehensive reference availability

15 Jan 2018: Funders should mandate open citations

16 Jan 2018: Oxford University Press opens its references!

29 Jan 2018: OpenCitations and the Initiative for Open Citations: A Clarification

19 Feb 2018: Citations as First-Class Data Entities: Introduction

22 Feb 2018: Citations as First-Class Data Entities: Citation Descriptions

25 Feb 2018: Citations as First-Class Data Entities: The OpenCitations Data Model

4 Mar 2018: Citations as First-Class Data Entities: The OpenCitations Corpus

12 Mar 2018: Citations as First-Class Data Entities: Open Citation Identifiers

15 Mar 2018: Citations as First-Class Data Entities: The Open Citation Identifier Resolution Service

23 Mar 2018: Early adopters of the OpenCitations Data Model

17 Apr 2018: Workshop on Open Citations

12 Jul 2018: COCI, the OpenCitations Index of Crossref open DOI-to-DOI references

19 Nov 2018: New release of COCI: 450M DOI-to-DOI citation links now available

Conference presentations and other outreach

Workshop on Open Citations 2018

OpenCitations, the EXCITE Project and Europe PubMed Central ran the first Workshop on Open Citations (Twitter: @workshop_oc) at the University of Bologna in Bologna, Italy, on 3-5 September 2018. It was organised as follows:

  • Day One and Day Two: Formal presentations and discussions on the creation, availability, uses and applications of open bibliographic citations, and of bibliometric studies based upon them;
  • Day Three: A Hack Day on Open Citations to see what services can be prototyped using large volumes of open citation data.

The workshop involved 60 participants, including researchers, computer scientists, scholarly publishers, academic administrators, research funders and policy makers. The workshop was organised around the following topics:

  • Opening up citations: Initiatives, collaborations, methods and approaches for the creation of open access to bibliographic citations;
  • Policies and funding: Strategies, policies and mandates for promoting open access to citations, and transparency and reproducibility of research and research evaluation;
  • Publishers and learned societies: Approaches to, benefits of, and issues surrounding the deposit, distribution, and services for open bibliographic metadata and citations;
  • Projects: Metrics, visualizations and other projects. The uses and applications of open citations, and bibliometric analyses and metrics based upon them.

All the sessions were recorded and are available on the official YouTube channel of the University of Bologna, and are linked (together with slides) at the website of the workshop – https://workshop-oc.github.io.

A further Workshop on Open Citations is being planned for autumn 2019.

Presentations

We have made conference presentations on OpenCitations, the Initiative for Open Citations, and Open Citation Identifiers at the following international conferences and workshops:

WikiCite Conference 2017, Vienna, 23 May 2017, https://www.slideshare.net/essepuntato/opencitations (Silvio Peroni and David Shotton)

COASP 9, 9th Conference of Open Access Scholarly Publishing, Lisbon, 20 September 2017, https://www.slideshare.net/essepuntato/the-initiative-for-open-citations-and-the-opencitations-corpus (David Shotton)

SemSci 2017, 1st International Workshop on Enabling Open Semantic Science, Vienna, 21 October 2017, https://w3id.org/people/essepuntato/presentations/the-open-citations-revolution.html (Silvio Peroni)

ISWC 2017, 16th International Semantic Web Conference, Vienna, 24 October 2017, https://w3id.org/people/essepuntato/presentations/oc-iswc2017.html (Silvio Peroni)

FORCE 2017, Research Communication and e-Scholarship Conference, Berlin, 27 October 2017, http://w3id.org/people/essepuntato/presentations/oc-force2017.html (Silvio Peroni)

Linked Open Citation Database (LOC-DB) Workshop, Mannheim, 7 November 2017, https://locdb.bib.uni-mannheim.de/wordpress/wp-content/uploads/2017/10/Shotton-LOC-DB-Mannheim.pdf (David Shotton)

GARR Conference 2017, Venice, 16 November 2017, https://www.eventi.garr.it/it/documenti/conferenza-garr-2017/presentazioni-2/232-conf2017-presentazione-peroni/file (Silvio Peroni)

OpenCon 2017, Oxford, 1 December 2017, https://doi.org/10.6084/m9.figshare.5844981.v1 (David Shotton)

PIDapalooza Conference of Persistent Identifiers, Girona, 24 January 2018, https://doi.org/10.6084/m9.figshare.5844972.v2 (David Shotton)

2018 International Workshop on Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination, Lyon, 24 April 2018, https://doi.org/10.6084/m9.figshare.7531577.v1 (Ivan Heibi)

Workshop on Open Citations 2018, Bologna, 3 September 2018, https://workshop-oc.github.io/presentations/D1S3_David_Shotton.pdf (David Shotton)

Workshop on Open Citations 2018, Bologna, 4 September 2018, https://docs.google.com/presentation/d/1mybQmjhFY6kLtTE1TdONaxsl0nSjmRGOSCnFMTwfzWQ/edit?usp=sharing (Silvio Peroni)

The 5th Conference on Scholarly Publishing in the Context of Open Science (PUBMET 2018), Zadar, 20 September 2018, https://doi.org/10.6084/m9.figshare.7110653.v3 (Silvio Peroni)

The 17th International Semantic Web Conference (ISWC 2018), Monterey, 12 October 2018, https://doi.org/10.6084/m9.figshare.7151759.v1 (Silvio Peroni)

WikiCite Conference 2018, Berkeley, 27 November 2018, https://doi.org/10.6084/m9.figshare.7396667.v1 (Ivan Heibi)

A further presentation on Open Citation Identifiers will be given at the 2019 PIDapalooza Conference of Persistent Identifiers in Dublin in January 2019.

Tweets

We have tweeted about the project and related matters under the names @opencitations, @dshotton, @essepuntato, @ivanHeiB, @workshop_oc, and @i4oc_org.

Future sustainability

While presently the OpenCitations Corpus has only partial coverage, our aim is that OpenCitations should become a comprehensive source of open citation information from all disciplines of scholarly endeavour, used on a daily basis by scholars worldwide, to equal or better the commercial offerings from Clarivate Analytics (Web of Science) and Elsevier (Scopus).

We also wish to develop effective graphical user interfaces to explore the citation network, and analytical tools over our open data. Since the OCC and COCI data are all open and available for others also to build such tools, we anticipate that such developments will best be undertaken collaboratively, under some open community organization, and indeed such development is currently being undertaken in collaboration with colleagues from CWTS at the University of Leiden, famous for their development of VOSviewer.

In order to fully support open scholarship, OpenCitations need to mature from being an academic research and development project to become a recognised scholarly infrastructure service such as PubMed. We wish to avoid becoming a commercial company, and see our development better served by being ‘adopted’ by a major established scholarly institution such as national or university library or an internationally recognised centre providing scholarly bibliographic services, that has already shown a commitment to open scholarship, where the interaction between that institution and OpenCitations would be mutually beneficial. To this end, we are currently in the mid-phase of negotiations with two institutions.

Conclusions

The grantees wish to express their deep gratitude to the Alfred P. Sloan Foundation for financial support enabling them to undertake the OpenCitations Enhancement Project, without which the rapid developments reported here would not have been possible.

References

Marilena Daquino, Ilaria Tiddi, Silvio Peroni, David Shotton (2018). Creating Open Citation Data with BCite. In Emerging Topics in Semantic Technologies – ISWC 2018 Satellite Events: 83-93. DOI: https://doi.org/10.3233/978-1-61499-894-5-83, OA at http://ceur-ws.org/Vol-2184/paper-01.pdf

Ivan Heibi, Silvio Peroni, David Shotton (2018a). Enabling text search on SPARQL-endpoints through OSCAR. Submitted for publication to Data Science – Methods, Infrastructure, and Applications. OA at at https://w3id.org/people/essepuntato/papers/oscar-datascience2019/

Ivan Heibi, Silvio Peroni, David Shotton (2018b). OSCAR: A Customisable Tool for Free-Text Search over SPARQL Endpoints. In Semantics, Analytics, Visualization – 3rd International Workshop, SAVE-SD 2017, and 4th International Workshop, SAVE-SD 2018, Revised Selected Papers: 121-137. DOI: https://doi.org/10.1007/978-3-030-01379-0_9, OA at https://w3id.org/people/essepuntato/papers/oscar-savesd2018.html

Silvio Peroni, Alexander Dutton, Tanya Gray, David Shotton (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71: 253-77. DOI: https://doi.org/10.1108/JD-12-2013-0166, OA at http://speroni.web.cs.unibo.it/publications/peroni-2015-setting-bibliographic-references.pdf

Silvio Peroni, David Shotton (2018a). OpenCitations: enabling the FAIR use of open citation data. In Proceedings of the GARR Conference 2017 – The data way to Science – Selected Papers. DOI: https://doi.org/10.26314/GARR-Conf17-proceedings-19

Silvio Peroni, David Shotton (2018b). Open Citation: Definition. Figshare. DOI: https://doi.org/10.6084/m9.figshare.6683855.v1

Silvio Peroni, David Shotton (2018c). Open Citation Identifier: Definition. Figshare. DOI: https://doi.org/10.6084/m9.figshare.7127816.v1

Silvio Peroni, David Shotton (2018d). The OpenCitations Data Model. Figshare. DOI: https://doi.org/10.6084/m9.figshare.3443876.v5

Silvio Peroni, David Shotton (2018e). The SPAR Ontologies. In Proceedings of the 17th International Semantic Web Conference (ISWC 2018): 119-136. DOI: https://doi.org/10.1007/978-3-030-00668-6_8

Silvio Peroni, David Shotton, Fabio Vitali (2016a). Building Citation Networks with SPACIN. Knowledge Engineering and Knowledge Management – EKAW 2016 Satellite Events, EKM and Drift-an-LOD, Revised Selected Papers: 162-166. DOI: https://doi.org/10.1007/978-3-319-58694-6_23, OA at https://w3id.org/oc/paper/spacin-demo-ekaw2016.html

Silvio Peroni, David Shotton, Fabio Vitali (2016b). Freedom for bibliographic references: OpenCitations arise. In Proceedings of 2016 International Workshop on Linked Data for Information Extraction (LD4IE 2016): 32-43. http://ceur-ws.org/Vol-1699/paper-05.pdf

Silvio Peroni, David Shotton, Fabio Vitali (2017). One year of the OpenCitations Corpus: Releasing RDF-based scholarly citation data into the Public Domain. In Proceedings of the 16th International Semantic Web Conference (ISWC 2017): 184-192. DOI: https://doi.org/10.1007/978-3-319-68204-4_19, OA at https://w3id.org/people/essepuntato/papers/oc-iswc2017.html

David Shotton (2013). Open citations. Nature, 502: 295-297. https://doi.org/10.1038/502295a

David Shotton (2018). Funders should mandate open citations. Nature 553: 129. https://doi.org/10.1038/d41586-018-00104-7

Posted in Citations as First-Class Data Entities, Data publication, Open Citation Identifiers, Open Citations, Semantic Publishing, Web interface design | Tagged , , , , , , , , | Leave a comment

The Wellcome Trust funds OpenCitations

The Open Biomedical Citations in Context Corpus funded by the Wellcome Trust

The Wellcome Trust, which funds research in big health challenges and campaigns for better science, has agreed to fund The Open Biomedical Citations in Context Corpus, a new project to enhance the OpenCitations Corpus, as part of the Open Research Fund programme.

As readers of this blog will know, the OpenCitations Corpus is an open scholarly citation database that freely and legally makes available accurate citation data (academic references) to assist scholars with their academic studies, and to serve knowledge to the wider public.

Objectives

The Open Biomedical Citations in Context Corpus, funded by the Wellcome Trust for 12 months from March 2019, will make the OpenCitations Corpus (OCC) more useful to the academic community by significantly expanding the kinds of citation data held within the Corpus, so as to provide data for each individual in-text reference and its semantic context, making it possible to distinguish references that are cited only once from those that are cited multiple times, to see which references are cited together (e.g. in the same sentence), to determine in which section of the article references are cited (e.g. Introduction, Methods), and, potentially, to retrieve the function of the citation.

At OpenCitations, we will achieve these objectives in the following ways:

  • by extending the OpenCitations Data Model so as to describe how the in-text reference data should be modeled in RDF for inclusion in the OpenCitations Corpus;
  • by develping scripts for extracting in-text references from articles within the Open Access Subset of biomedical literature hosted by Europe PubMed Central;
  • by extending the existing ingestion workflow so as to add the new in-text reference data into the Corpus;
  • by developing appropriate user interfaces for querying and browsing these new data.

Personnel

We are looking for a post-doctoral computer scientist / research engineer specifically to achieves the aforementioned objectives. This post-doctoral appointment will start the 1st of March 2019. We seek a highly intelligent, skilled and motivated individual who is expert in Python, Semantic Web technologies, Linked Data and Web technologies. Additional expertise in Web Interface Design and Information Visualization would be highly beneficial, plus a strong and demonstrable commitment to open science and team-working abilities.

The minimal formal requirement for this position is a Masters degree in computer science, computer science and engineering, telecommunications engineering, or equivalent title, but it is expected that the successful applicant will have had research experience leading to a doctoral degree. The position has a net salary (exempt from income tax, after deduction of social security contributions) in excess of 23K euros per year.

The formal advertisement for this post – which will be held at the Digital Humanities Advanced Research Centre (DHARC), Department of Computer Classical Philology and Italian Studies, University of Bologna, Italy, under the supervision of Dr Silvio Peroni – is published online, and it is accompanied by the activity plan (in Italian and English). The application must be presented exclusively online by logging in the website https://concorsi.unibo.it (default in Italian, but there is a link to switch the language in English). People who do not have a @unibo.it email account must register to the platform. The deadline for application is the 25th January 2019 at 15:00 Central Europe Time. Please feel free to contact Silvio Peroni (silvio dot peroni at unibo dot it) for further information.

People involved

The people formally involved in the projects are:

  • Vincent Larivière – École de Bibliothéconomie et des Sciences de l’Information, Université de Montréal, Canada;
  • Silvio Peroni (Principal Investigator) – Digital Humanities Advanced Research Centre (DHARC), Department of Computer Classical Philology and Italian Studies, University of Bologna, Italy, and Director of OpenCitations;
  • David Shotton – Oxford e-Research Centre, University of Oxford, Oxford, UK, and Director of OpenCitations;
  • Ludo Waltman – Centre for Science and Technology Studies (CWTS), Leiden University, Netherlands.

In addition, the project is supported by Europe PubMed Central (EMBL-EBI, Hinxton, UK).

Posted in Open Citations, Uncategorized | Tagged , | 4 Comments

New release of COCI: 450M DOI-to-DOI citation links now available

As introduced in a previous blog post, COCI is the OpenCitations Index of Crossref open DOI-to-DOI references, all released as CC0 material. It is our first OpenCitations Index of open citations, in which we have applied the concept of citations as first-class data entities to index the contents of one of the major databases ofopen scholarly citation information, namely Crossref, and to render and make available this information in machine-readable RDF.

We are now proud to announce a new release of COCI, the second, which now contains almost 450 million DOI-to-DOI citation links coming from both ‘the ‘Open’ and the ‘Limited’ sets of Crossref reference data.  This represents an increase of 42% in the number of indexed citations, compared with the initial release of COCI on 4th June 2018, which indexed 316,243,802 citations involving 45,145,889 bibliographic resources. In addition, the data model for COCI has now been extended so as to state directly the presence of journal self-citations and author self-citations.

Extended data model

The previous data model used for storing the citation data in COCI – which is itself a subset of the OpenCitations Data Model – has been extended so as to keep track of two particular types of self-citation, as shown in the following figure.

The new data model used in COCI for describing its citation data, which includes classes for describing two kinds of self-citations, i.e. journal self-citations and author self-citations.

Generally speaking, a self-citation is citation in which the citing and the cited entities have something significant in common with one another, over and beyond their subject matter. The two kinds of self-citations we are now tracking are:

  • journal self-citation (class cito:JournalSelfCitation), i.e. a citation in which the citing and the cited entities are published in the same journal. This information has been obtained by comparing the ISSNs of the journals where two journal articles related by a citation have been published, as provided by Crossref. If they share the same ISSN, then the citation is described as journal self-citation;
  • author self-citation (class cito:AuthorSelfCitation), i.e. a citation in which the citing and the cited entities have at least one author in common. This information has been obtained by comparing the ORCIDs associated to the authors of a citing bibliographic entity with the ORCIDs of the authors of the cited entity. In this case, if any ORCID is shared, then the citation is described as author self-citation.  This categorization excludes authors bearing the same name where the ORCIDs are not known, since, while these instances may be author self-citations, they may alternatively merely represent name coincidences of distinct individuals.

It is worth mentioning that, while the ISSN information are usually present in the data returned by Crossref, the presence of ORCID id data associated with the authors of the various paper represented in Crossref is presently very limited, so that the number of recorded author self-citations in COCI is likely to be a considerable underestimate.

In this new release, COCI contains 449,842,374 citations, of which 30,114,696 are recorded as journal self-citations and 251,699 are recorded as author self-citations.

Extended REST API

The REST API for querying COCI has been extended so as to return information about the aforementioned self-citations. In particular, the response to the operations “references” and “citations” now has two more fields, i.e. “journal_sc” and “author_sc”, that are set to “yes” if the citation returned is a journal self-citation or an author self-citation respectively, or “no” otherwise.

Using the capabilities of the REST API, it is also possible to keep in or exclude from the result set those citations that are (or are not) one of the aforementioned types of self-citation. For instance, the following call

https://w3id.org/oc/index/coci/api/v1/citations/10.1002/pol.1987.140251103?filter=journal_sc:yes

returns all the citations having the article with DOI “10.1002/pol.1987.140251103” that are journal self-citations.

Conclusions

In this blog post we have introduced the second release of COCI, the OpenCitations Index of Crossref open DOI-to-DOI references, a citation index which now contains almost 450 million open citations created from the ‘Open’ and ‘Limited’ references included within Crossref.

As a reminder, all the data in COCI:

We plan soon to extend the OpenCitations Indexes by adding indexes of citations coming from other source datasets, including Wikidata and DataCite.

Posted in Citations as First-Class Data Entities, Data publication, Open Citation Identifiers, Open Citations | Tagged , , | 1 Comment

COCI, the OpenCitations Index of Crossref open DOI-to-DOI references

In a previous series of blog posts we proposed the treatment of bibliographic citations as first-class data entities, permitting citations to be endowed with descriptive properties. In doing so, we outlined some specific requirements, namely that the citations should be machine readable, should conform to a specific data model (in this case the OpenCitations Data Model), should be stored in an accessible database under an open license, and should be identified using global persistent identifiers (specifically Open Citation Identifiers) which are resolvable using an identifier resolution service (namely the Open Citation Identifier Resolution Service).

In this blog post, we introduce COCI, the OpenCitations Index of Crossref open DOI-to-DOI references1, our first open citation index, in which we have applied the concept of citations as first-class data entities to index the contents of one of the major open databases of scholarly citation information, namely Crossref, and to render and make available this information in machine-readable RDF.

Crossref contains metadata about publications (mainly academic journal articles) that are identified using Digital Object Identifiers (DOIs).  For about half of these publications Crossref also stores the reference lists of these articles submitted by the publishers (for discussion of why this is not true for all the publications, see this previous blog post).  Many of these references are to other publications bearing DOIs that are also described in Crossref, while others are to publications that lack DOIs and do not have Crossref descriptions.

COCI is an index of all the open DOI-to-DOI citations present in Crossref, and presently includes more than 300 million citations, obtained by parsing the open reference lists of the articles deposited there. COCI is available at http://opencitations.net/index/coci, and is released under a CC0 waiver.  COCI does not index Crossref references that are not open, nor Crossref open references to entities that lack DOIs.

What is an open citation index?

A citation index is a bibliographic index recording citations between publications, allowing the user to establish which later documents cite earlier documents. Several citation indexes are already available, some of which are freely accessible but not downloadable (e.g. Google Scholar), while others can be accessed only by paying significant access fees (e.g. Web of Science and Scopus). An open citation index contains only data about open citations, as defined in [1].

OpenCitations is a scholarly infrastructure organization dedicated to the promotion of semantic publishing by the use of semantic web (linked data) technologies, and engaged in advocacy for semantic publishing and open citations. It provides the OpenCitations Data Model and the SPAR (Semantic Publishing and Referencing) Ontologies for encoding scholarly bibliographic and citation data in RDF, and open software of generic applicability for searching, browsing and providing REST APIs over RDF triplestores. It has developed the OpenCitations Corpus (OCC) of open downloadable bibliographic and citation data recorded in RDF, and a system and resolution service for Open Citation Identifiers (OCIs), and it is currently developing a number of Open Citation Indexes using the data openly available in third-party bibliographic databases.

These Open Citation Indexes have the following characteristics in common:

  1. The citations they contain are all open [1].
  2. The citations are treated as first-class data entities;
  3. Each citation is identified by an Open Citation Identifier (OCI), which has a simple structure: the lower-case letters “oci” followed by a colon, followed by two numbers separated by a dash (e.g. oci:1-18);
  4. The citation metadata are recorded in RDF, based on the OpenCitations Data Model [2];
  5. The RDF statements for each citation record the basic properties shown in the following figure, which is based on the Citation Typing Ontology (CiTO) for describing the data, and the Provenance Ontology (PROV-O) for the provenance information.
The data model used for describing the citation data included in any Open Citation Index.

The data model used for describing the citation data included in any Open Citation Index.

Parsing the Crossref collection

Over the past few months, we have parsed the entire Crossref bibliographic database to extract all the DOI-to-DOI citations included in the dataset, as well as additional information about each citation, specifically its creation date (i.e. the publication date of the citing entity) and the citation time span (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity, to an accuracy determined by these publication dates as recorded in Crossref). These data for the open citations are now made available in COCI.

Each citation is described as an individual of the class cito:Citation and is identified by an URL structured as follows:

https://w3id.org/oc/index/coci/ci/[[OCI]]

The parameter [[OCI]] refers to the numerical part of the Open Citation Identifier (OCI) assigned to the citation, i.e. two numbers separated by a dash, in which the first number identifies the citing work and the second number identifies the cited work. For instance:

https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-02001010806360107050663080702026305630301

For citations extracted from Crossref in which the citing and cited works are identified by DOIs, which includes all the COCI citations, the OCI is created in the following manner:

  1. Each case-insensitive DOI is first normalized to lower case letters.
  2. Then, after omitting the initial “doi:10.” prefix, the alphanumeric string of the DOI is converted reversibly to a pure numerical string using the simple two-numeral lookup table for numerals, lower case letters and other characters presented at https://github.com/opencitations/oci/blob/master/lookup.csv.
  3. Finally, each converted numeral is prefixes by a “020”, which indicates that Crossref is the supplier of the original metadata of the citation (as indicated at http://opencitations.net/oci)

Currently COCI contains 316,243,802 citations and 45,145,889 bibliographic resources. We plan to update COCI at least every six months as more open DOI-to-DOI citations appear in Crossref.

How to access the citation data in COCI

All the data in COCI are available for inspection, download and reuse in the following ways.

SPARQL endpoint

By querying the COCI SPARQL endpoint at https://w3id.org/oc/index/coci/sparql. If you access this URL with a browser, a GUI will be shown, in which is an editable text box that enables the user to compose and execute a SPARQL query. In addition, the COCI SPARQL endpoint can be queried using the REST protocol, e.g. (via curl):

curl -L -H "Accept: text/csv" "https://w3id.org/oc/index/coci/sparql?query=PREFIX%20cito%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fspar%2Fcito%2F%3E%0ASELECT%20%3Fcitation%20%3Fcreation%20%7B%20%3Fcitation%20a%20cito%3ACitation%20%3B%20cito%3AhasCitationCreationDate%20%3Fcreation%20%7D%20LIMIT%201"

The above GET call executes the following simple SPARQL query:

PREFIX cito: <http://purl.org/spar/cito/>
SELECT ?citation ?creation { 
    ?citation a cito:Citation ; 
        cito:hasCitationCreationDate ?creation 
} 
LIMIT 1

This query returns the IRI of one citation accompanied by its creation date in CSV format, as shown as follows:

citation,creation
https://w3id.org/oc/index/coci/ci/02001000002360105020963000103015801090909000259040238024003010138381018136310232701044203370037122439026315-02001000002361027293701070800030100060007,1999-02

Instead, the following SPARQL query should be used to get the information about a particular citation given its OCI:

PREFIX oci: <https://w3id.org/oc/index/coci/ci/>
PREFIX cito: <http://purl.org/spar/cito/>
SELECT DISTINCT ?citing ?cited ?creation ?timespan {
oci:02001010806360107050663080702026306630509-02001010806360107050663080702026305630301 a cito:Citation ;
    cito:hasCitingEntity ?citing ;
    cito:hasCitedEntity ?cited ;
    cito:hasCitationCreationDate ?creation ;
    cito:hasCitationTimeSpan ?timespan
}

In this case, the query will returns the DOI URLs of the citing and cited entities, accompanied by the creation date and the timespan of the citation:

citing,cited,creation,timespan
http://dx.doi.org/10.1186/1756-8722-6-59,http://dx.doi.org/10.1186/1756-8722-5-31,2013,P1Y

It is worth mentioning that the results can be also returned in JSON (using “Accept: application/json” in the header of the request) or XML (using “Accept: application/xml” in the header of the request). For example, accessing the long URL starting with “https” of the above request with a browser will return an XML document describing the same result.

REST API

Citation information may also be retrieved by using the COCI REST API, available and documented at https://w3id.org/oc/index/coci/api/v1, which has been implemented by means of RAMOSE (the Restful API Manager Over SPARQL Endpoints). Specifically, the COCI REST API makes available a mechanism for getting the citation data:

If you would like to suggest an additional operation to be included in this API, please use the issue tracker of the COCI API available on GitHub.

Search and browsing interfaces

An interface providing a free text search over the contents of COCI is available at http://opencitations.net/index/coci/search. It allows one to search for citation data according to the same operational principles implemented in the REST API discussed above. However, in this case, the result are returned in tabular form through a web interface implemented by means of OSCAR, the OpenCitations RDF Search Application, e.g. http://opencitations.net/index/coci/search?text=10.1186%2F1756-8722-6-59&rule=citingdoi.

Each OCI returned by the search interface is a clickable link that opens a new descriptive page detailing that citation. These web pages are created by LUCINDA, which is a Javascript-based RDF data browser developed for exposing the statements contained in an RDF triplestore as descriptive human-readable HTML pages, e.g. http://opencitations.net/index/coci/browser/ci/02001010806360107050663080702026306630509-02001000002361222131237020000060000020201.

Data dumps

The dump of all the citation data available in COCI, including their provenance information, is downloadable from Figshare. These data are available in CSV and N-Triples formats, and each dump has a DOI assigned so as to be citable. Download links are available at http://opencitations.net/download#coci. A new dump will be made each time COCI is udpated.

By content negotiation

Using the HTTP URI of the individual citations, it is possible to access their representations in different formats: HTML, RDF/XML, Turtle, and JSON-LD. This is possible through a content negotiation mechanism that disambiguates in which format the information about a citation should be returned, by looking at the “Accept” header declared in the request. For instance, accessing the URL https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-02001000002361222131237020000060000020201 will return the citation data in HTML, while the GET request below will return the same information in Turtle:

curl -L -H "Accept: text/turtle" "https://w3id.org/oc/index/coci/ci/02001010806360107050663080702026306630509-0200100000236122213123702000"

Conclusions

In this blog post we have introduced COCI, the OpenCitations Index of Crossref open DOI-to-DOI references, a citation index which contains more than 310 million open citations created from the ‘Open’ references included in Crossref. We plan soon to extend COCI to additionally include those DOI-to-DOI citations extracted from the ‘Limited’ set of Crossref reference data.

References

  1. Silvio Peroni and David Shotton (2018). Open Citation: Definition. figshare. DOI: https://doi.org/10.6084/m9.figshare.6683855
  2. Silvio Peroni and David Shotton (2018). The OpenCitations Data Model. figshare. DOI: https://doi.org/10.6084/m9.figshare.3443876

Footnotes

  1. At Crossref’s request, we have changed the originally proposed description of COCI from “the Crossref Open Citation Index (COCI)” to “COCI, the OpenCitations Index of Crossref open DOI-to-DOI references”, to make clear that COCI is an OpenCitations index and to avoid any implication that COCI is a Crossref service. We apologize for the initial ambiguity of our original wording and any confusion this may have caused.
Posted in Bibliographic references, Citations as First-Class Data Entities, Data publication, Open Citation Identifiers, Open Citations | 4 Comments

Early adopters of the OpenCitations Data Model

OpenCitations is very pleased to announce its collaboration with four new scholarly Research and Development projects that are early adopters of the recently updated OpenCitations Data Model, described in this blog post.

The four projects are similar, in that they each are independently using text mining and optical character recognition or PDF extraction techniques to extract citation information from the reference lists of published works, and are making these citations available as Linked Open Data. Three of the four will also use the OpenCitations Corpus as publication platform for their citation data.  The academic disciplines from which these citation data are being extracted are social science, humanities and economics.

1     Linked Open Citation Database (LOC-DB)

The Linked Open Citation Database, with partners in Mannheim, Stuttgart, Kiel, and Kaiserslautern (LOC-DB, https://locdb.bib.uni-mannheim.de/blog/en/), is the first of two German projects funded by the Deutsche Forschungsgemeinschaft (DFG) that are extracting citations from Social Science publications.  Dr. Annette Klein, Deputy Director of the Mannheim University Library, is the project manager.

The project is using Deep Neural Networks based approaches for reference detection and state-of-the-art methods for information extraction and semantic labelling of reference lists from electronic and print media with arbitrary layouts [3].  The raw data obtained will be manually checked against and linked with existing bibliographic metadata sources in an editorial system.  They will then be structured in RDF using the OpenCitations Data Model, and published in the Linked Open Citations Database under a CC0 waiver. Using its libraries’ own Social Science print holdings and licensed electronic journals as subject material, this project will demonstrate how these citation extraction processes can be applied to the holdings of individual academic libraries, and can be integrated with library catalogues [1, 2, 3].

References

[1]       Kai Eckert, Anne Lauscher and Akansha Bhardwaj (2017) LOC-DB: A Linked Open Citation Database provided by Libraries. Motivation and Challenges.  EXCITE Workshop 2017: “Challenges in Extracting and Managing References”.  https://locdb.bib.uni-mannheim.de/wordpress/wp-content/uploads/2016/11/LOC-DB@EXCITE.pdf

[2]      Lauscher, Anne; Eckert, Kai; Galke, Lukas; Scherp, Ansgar; Rizvi, Syed Tahseen Raza; Ahmed, Sheraz; Dengel, Andreas; Zumstein, Philipp; Klein, Annette  (2018) Linked Open Citation Database: Enabling libraries to contribute to an open and interconnected citation graph. Accepted for the JCDL 2018: Joint Conference on Digital Libraries 2018, June 3-6, 2018 in Fort Worth, Texas [Preprint of the conference publication].
https://locdb.bib.uni-mannheim.de/wordpress/wp-content/uploads/2018/04/LOCDB-JCDL2018-paper-camera-ready.pdf

[3]       Bhardwaj A., Mercier D., Dengel A., Ahmed S. (2017). DeepBIBX: deep learning for image based bibliographic data extraction. In: Liu D., Xie S., Li Y., Zhao D., El-Alfy ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science, vol 10635. Springer, Cham [Conference publication].

2     The EXCITE (Extraction of Citations from PDF Documents) Project

The EXCITE Project (http://west.uni-koblenz.de/en/research/excite/), run jointly at the University of Koblenz-Landau and GESIS (Leibniz Institute for Social Sciences), is the second project funded by the Deutsche Forschungsgemeinschaft (DFG) that is extracting citations from Social Science publications.  It is headed by Steffen Staab, head of the Institute for Web Science and Technologies at the University of Koblenz-Landau, and Philipp Mayr of GESIS.

Since the social sciences are given only marginal coverage in the main bibliographic databases, this project aims to make more citation data available to researchers, with a particular focus on the German language social sciences.  It has developed a set of algorithms for the extraction of reference information from PDF documents and for matching the reference entry strings thus obtained against bibliographic databases (see EXCITE git https://github.com/exciteproject/).  It is using as its data sources the following Social Science collections: full texts from SSOAR, the Gesis Social Science Open Access Repository (https://www.gesis.org/ssoar/home/) and scattered pdf stocks from other social science collections including SOLIS, Springer Online Journals and CSA Sociological Abstracts [4, 5].

The EXCITE project organized an international developer and researcher workshop “Challenges in Extracting and Managing References” in March 2017 in Cologne. http://west.uni-koblenz.de/en/research/excite/workshop-2017

EXCITE will then structure the extracted bibliographic and citation data in RDF using the OpenCitations Data Model, and will use the OpenCitations Corpus as its publication platform, employing the OCC EXCITE supplier prefix 0110, described here, to identify the provenance of these citations.

References

[4]       Martin Körner (2016). Extraction from social science research papers using conditional random fields and distant supervision, Master’s Thesis, University of Koblenz-Landau, 2016.

[5]       Körner, M., Ghavimi, B., Mayr, P., Hartmann, H., & Staab, S. (2017). Evaluating reference string extraction using line-based conditional random fields: a case study with german language publications. In M. Kirikova, K. Nørvåg, G. A. Papadopoulos, J. Gamper, R. Wrembel, J. Darmont, & S. Rizzi (Hrsg.), New Trends in Databases and Information Systems (Bd. 767, S. 137–145). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_15   Preprint: https://philippmayr.github.io/papers/Koerner-et-al2017.pdf

3    The Venice Scholar Index

The Venice Scholar Index is a citation index of literature on the history of Venice, indexing nearly 3000 volumes of scholarship from the mid 19th century to 2013, from which some 4 million bibliographic references have been extracted.

The Venice Scholar Index is the first prototype resulting from Linked Books Project (https://dhlab.epfl.ch/page-127959-en.html), a project spearheaded by Giovanni Colavizza and Matteo Romanello of the Digital Humanities Laboratory at EPFL (École Polytechnique Fédérale de Lausanne), with partners in Venice, Milan and Rome.

The project is exploring the history of Venice through references to scholarly literature as well as archival documents found within publications.  To achieve this goal, the project has developed a system to automatically extract bibliographic references found within a large set of digitized books and journals, which has then been applied to the publications on the history of Venice, its main use case [6].

The Linked Books Project is specifically interested in analysing the interplay between citations to primary (e.g. archival) documents and those to secondary sources (scholarly literature), and the citation profiles of publications through time.  To this end, it developed the Venice Scholar Index, a rich search interface to navigate through the resulting network of citations, with the final aim of interlinking digital archives and digital libraries.

The citation data underlying the Venice Scholar Index are modelled using the OpenCitations Data Model, and will use the OpenCitations Corpus as its publication platform, using the OCC Venice Scholar Index supplier prefix 0120 to identify the provenance of these citations.

Reference

[6] Giovanni Colavizza, Matteo Romanello, and Frédéric Kaplan (2017). The references of references: a method to enrich humanities library catalogs with citation data. In International Journal on Digital Libraries 18 (March 8, 2017): 1–11. https://doi.org/10.1007/s00799-017-0210-1.

4    CitEcCyr – Citations in Economics published in CyrillicCitEcCyr  (https://github.com/citeccyr/CitEcCyr) is an open repository of citation relationships obtained from research papers in the Russian language and Cyrillic script from Socionet (https://socionet.ru/) and RePEc (http://repec.org/) [7, 8].  The CitEcCyr project is headed by Oxana Medvedeva, is technically led by Sergey Parinov, and is funded by RANEPA (http://www.ranepa.ru/eng/), the Russian Presidential Academy of National Economy and Public. CitEcCyr is also developing a suite of open software for the citation content analysis of these papers.  This project intends to model its citations using the OpenCitations Data Model, and will use the OpenCitations Corpus as its publication platform, using the OCC CitEcCyr supplier prefix 0140 to identify the provenance of these citations.

However, since this is the first project from which OpenCitations will be importing bibliographic metadata and citations in a language other than English and in a script other than the Latin script, we at OpenCitations are going to have to crawl out of our comfortable ‘Western’ shells and learn to handle foreign languages and scripts other than Latin scripts.

For Russian language papers written using Cyrillic script, we at OpenCitations will to decide how best to handle Russian language written using Cyrillic script, Cyrillic script transliterated into Latin script, and Russian language translated into English and rendered using Latin script.  In particular, since in the OpenCitations Corpus our reference entry records are the uncorrected literal texts of the references in the reference lists of the citing papers, these will need to be recorded as given in Cyrillic.

We will need to develop a policy for when to provide Latin script translations of (for example) titles and abstracts, if these are not provided by the data supplier.  To facilitate use of the OpenCitations Corpus by Russian scholars, we will also need to modify the OpenCitations web site, so as to render the static information displayed in the web pages in the language and script appropriate to the language setting on the user’s web browser.

Unfortunately, all this will take time, so we do not anticipate publishing citation data from the CitEcCyr project within OCC any time soon.  However, this collaboration will be of tremendous value to OpenCitations as well as to CitEcCyr, since the lessons learned by our collaboration with the CitEcCyr project will enable the OpenCitations Corpus to handle citation data not just in Russian, but also in Arabic, Chinese, Japanese and other languages where the Latin script is not used, something that is not found in other major bibliographic databases.

Watch this space!

References

[7]       Jose Manuel Barrueco, Thomas Krichel, Sergey Parinov, Victor Lyapunov, Oxana Medvedeva and Varvara Sergeeva (2017).  Towards open data for the citation content analysis.    https://arxiv.org/abs/1710.00302

[8]       Thomas Krichel (2017). CitEc to CitEcCyr – A stab at distributed citation systems.  Presented at the 2017 EXCITE workshop. http://west.uni-koblenz.de/sites/default/files/research/projects/excite/workshop-2017/slides/excite-workshop-2017_krichel_citec-to-citeccyr.pdf

Posted in Bibliographic references, Citations as First-Class Data Entities, Open Citations, Open scholarship, Semantic Publishing | Tagged , , , , , | 3 Comments