Early adopters of the OpenCitations Data Model

OpenCitations is very pleased to announce its collaboration with four new scholarly Research and Development projects that are early adopters of the recently updated OpenCitations Data Model, described in this blog post.

The four projects are similar, in that they each are independently using text mining and optical character recognition or PDF extraction techniques to extract citation information from the reference lists of published works, and are making these citations available as Linked Open Data. Three of the four will also use the OpenCitations Corpus as publication platform for their citation data.  The academic disciplines from which these citation data are being extracted are social science, humanities and economics.

1     Linked Open Citation Database (LOC-DB)

The Linked Open Citation Database, with partners in Mannheim, Stuttgart, Kiel, and Kaiserslautern (LOC-DB, https://locdb.bib.uni-mannheim.de/blog/en/), is the first of two German projects funded by the Deutsche Forschungsgemeinschaft (DFG) that are extracting citations from Social Science publications.  Dr. Annette Klein, Deputy Director of the Mannheim University Library, is the project manager.

The project is using Deep Neural Networks based approaches for reference detection and state-of-the-art methods for information extraction and semantic labelling of reference lists from electronic and print media with arbitrary layouts [3].  The raw data obtained will be manually checked against and linked with existing bibliographic metadata sources in an editorial system.  They will then be structured in RDF using the OpenCitations Data Model, and published in the Linked Open Citations Database under a CC0 waiver. Using its libraries’ own Social Science print holdings and licensed electronic journals as subject material, this project will demonstrate how these citation extraction processes can be applied to the holdings of individual academic libraries, and can be integrated with library catalogues [1, 2, 3].

References

[1]       Kai Eckert, Anne Lauscher and Akansha Bhardwaj (2017) LOC-DB: A Linked Open Citation Database provided by Libraries. Motivation and Challenges.  EXCITE Workshop 2017: “Challenges in Extracting and Managing References”.  https://locdb.bib.uni-mannheim.de/wordpress/wp-content/uploads/2016/11/LOC-DB@EXCITE.pdf

[2]      Lauscher, Anne; Eckert, Kai; Galke, Lukas; Scherp, Ansgar; Rizvi, Syed Tahseen Raza; Ahmed, Sheraz; Dengel, Andreas; Zumstein, Philipp; Klein, Annette  (2018) Linked Open Citation Database: Enabling libraries to contribute to an open and interconnected citation graph. Accepted for the JCDL 2018: Joint Conference on Digital Libraries 2018, June 3-6, 2018 in Fort Worth, Texas [Preprint of the conference publication].
https://locdb.bib.uni-mannheim.de/wordpress/wp-content/uploads/2018/04/LOCDB-JCDL2018-paper-camera-ready.pdf

[3]       Bhardwaj A., Mercier D., Dengel A., Ahmed S. (2017). DeepBIBX: deep learning for image based bibliographic data extraction. In: Liu D., Xie S., Li Y., Zhao D., El-Alfy ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science, vol 10635. Springer, Cham [Conference publication].

2     The EXCITE (Extraction of Citations from PDF Documents) Project

The EXCITE Project (http://west.uni-koblenz.de/en/research/excite/), run jointly at the University of Koblenz-Landau and GESIS (Leibniz Institute for Social Sciences), is the second project funded by the Deutsche Forschungsgemeinschaft (DFG) that is extracting citations from Social Science publications.  It is headed by Steffen Staab, head of the Institute for Web Science and Technologies at the University of Koblenz-Landau, and Philipp Mayr of GESIS.

Since the social sciences are given only marginal coverage in the main bibliographic databases, this project aims to make more citation data available to researchers, with a particular focus on the German language social sciences.  It has developed a set of algorithms for the extraction of reference information from PDF documents and for matching the reference entry strings thus obtained against bibliographic databases (see EXCITE git https://github.com/exciteproject/).  It is using as its data sources the following Social Science collections: full texts from SSOAR, the Gesis Social Science Open Access Repository (https://www.gesis.org/ssoar/home/) and scattered pdf stocks from other social science collections including SOLIS, Springer Online Journals and CSA Sociological Abstracts [4, 5].

The EXCITE project organized an international developer and researcher workshop “Challenges in Extracting and Managing References” in March 2017 in Cologne. http://west.uni-koblenz.de/en/research/excite/workshop-2017

EXCITE will then structure the extracted bibliographic and citation data in RDF using the OpenCitations Data Model, and will use the OpenCitations Corpus as its publication platform, employing the OCC EXCITE supplier prefix 0110, described here, to identify the provenance of these citations.

References

[4]       Martin Körner (2016). Extraction from social science research papers using conditional random fields and distant supervision, Master’s Thesis, University of Koblenz-Landau, 2016.

[5]       Körner, M., Ghavimi, B., Mayr, P., Hartmann, H., & Staab, S. (2017). Evaluating reference string extraction using line-based conditional random fields: a case study with german language publications. In M. Kirikova, K. Nørvåg, G. A. Papadopoulos, J. Gamper, R. Wrembel, J. Darmont, & S. Rizzi (Hrsg.), New Trends in Databases and Information Systems (Bd. 767, S. 137–145). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_15   Preprint: https://philippmayr.github.io/papers/Koerner-et-al2017.pdf

3    The Venice Scholar Index

The Venice Scholar Index is a citation index of literature on the history of Venice, indexing nearly 3000 volumes of scholarship from the mid 19th century to 2013, from which some 4 million bibliographic references have been extracted.

The Venice Scholar Index is the first prototype resulting from Linked Books Project (https://dhlab.epfl.ch/page-127959-en.html), a project spearheaded by Giovanni Colavizza and Matteo Romanello of the Digital Humanities Laboratory at EPFL (École Polytechnique Fédérale de Lausanne), with partners in Venice, Milan and Rome.

The project is exploring the history of Venice through references to scholarly literature as well as archival documents found within publications.  To achieve this goal, the project has developed a system to automatically extract bibliographic references found within a large set of digitized books and journals, which has then been applied to the publications on the history of Venice, its main use case [6].

The Linked Books Project is specifically interested in analysing the interplay between citations to primary (e.g. archival) documents and those to secondary sources (scholarly literature), and the citation profiles of publications through time.  To this end, it developed the Venice Scholar Index, a rich search interface to navigate through the resulting network of citations, with the final aim of interlinking digital archives and digital libraries.

The citation data underlying the Venice Scholar Index are modelled using the OpenCitations Data Model, and will use the OpenCitations Corpus as its publication platform, using the OCC Venice Scholar Index supplier prefix 0120 to identify the provenance of these citations.

Reference

[6] Giovanni Colavizza, Matteo Romanello, and Frédéric Kaplan (2017). The references of references: a method to enrich humanities library catalogs with citation data. In International Journal on Digital Libraries 18 (March 8, 2017): 1–11. https://doi.org/10.1007/s00799-017-0210-1.

4    CitEcCyr – Citations in Economics published in CyrillicCitEcCyr  (https://github.com/citeccyr/CitEcCyr) is an open repository of citation relationships obtained from research papers in the Russian language and Cyrillic script from Socionet (https://socionet.ru/) and RePEc (http://repec.org/) [7, 8].  The CitEcCyr project is headed by Oxana Medvedeva, is technically led by Sergey Parinov, and is funded by RANEPA (http://www.ranepa.ru/eng/), the Russian Presidential Academy of National Economy and Public. CitEcCyr is also developing a suite of open software for the citation content analysis of these papers.  This project intends to model its citations using the OpenCitations Data Model, and will use the OpenCitations Corpus as its publication platform, using the OCC CitEcCyr supplier prefix 0140 to identify the provenance of these citations.

However, since this is the first project from which OpenCitations will be importing bibliographic metadata and citations in a language other than English and in a script other than the Latin script, we at OpenCitations are going to have to crawl out of our comfortable ‘Western’ shells and learn to handle foreign languages and scripts other than Latin scripts.

For Russian language papers written using Cyrillic script, we at OpenCitations will to decide how best to handle Russian language written using Cyrillic script, Cyrillic script transliterated into Latin script, and Russian language translated into English and rendered using Latin script.  In particular, since in the OpenCitations Corpus our reference entry records are the uncorrected literal texts of the references in the reference lists of the citing papers, these will need to be recorded as given in Cyrillic.

We will need to develop a policy for when to provide Latin script translations of (for example) titles and abstracts, if these are not provided by the data supplier.  To facilitate use of the OpenCitations Corpus by Russian scholars, we will also need to modify the OpenCitations web site, so as to render the static information displayed in the web pages in the language and script appropriate to the language setting on the user’s web browser.

Unfortunately, all this will take time, so we do not anticipate publishing citation data from the CitEcCyr project within OCC any time soon.  However, this collaboration will be of tremendous value to OpenCitations as well as to CitEcCyr, since the lessons learned by our collaboration with the CitEcCyr project will enable the OpenCitations Corpus to handle citation data not just in Russian, but also in Arabic, Chinese, Japanese and other languages where the Latin script is not used, something that is not found in other major bibliographic databases.

Watch this space!

References

[7]       Jose Manuel Barrueco, Thomas Krichel, Sergey Parinov, Victor Lyapunov, Oxana Medvedeva and Varvara Sergeeva (2017).  Towards open data for the citation content analysis.    https://arxiv.org/abs/1710.00302

[8]       Thomas Krichel (2017). CitEc to CitEcCyr – A stab at distributed citation systems.  Presented at the 2017 EXCITE workshop. http://west.uni-koblenz.de/sites/default/files/research/projects/excite/workshop-2017/slides/excite-workshop-2017_krichel_citec-to-citeccyr.pdf

Posted in Bibliographic references, Citations as First-Class Data Entities, Open Citations, Open scholarship, Semantic Publishing | Tagged , , , , , | 2 Comments

Citations as First-Class Data Entities: The Open Citation Identifier Resolution Service

Requirements for citations to be treated as first-class data entities

In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities.  The fifth and final of these requirements is that there must be a Web-based identifier resolution service that takes the citation identifier as input and returns a description of the citation.

At the recent PIDapalooza Conference on persistent identifiers, held in Gerona, Spain, I described the Open Citation Identifier Resolution Service, the new resolution service for Open Citation Identifiers created and operated by OpenCitations [1].

In this post, I describe this Open Citation Identifier Resolution Service, which supports the resolution of Open Citation Identifiers not only of the citations documented in the OpenCitations Corpus (OCC), but also of open citations recorded in other bibliographic databases.

What is the Open Citation Identifier Resolution Service

The Open Citation Identifier Resolution Service runs on the OpenCitations server, presenting itself to the user as a web page with the URI http://opencitations.net/oci.

When a user enters a valid OCI and clicks the “Look up citation” button, this activates the resolution service, which, after a brief delay, returns information about the citation itself and about the citing and cited bibliographic resources, as shown in the following screen image (which for clarity omits the provenance data associated with this citation).

This information can optionally be returned to the user in a variety of other formats: RDF/XML, Turtle or JSON-LD.

Clicking on the links provided will return additional metadata held by the OpenCitations Corpus for the citing and the cited documents.  In the near future, this service will be integrated with LUCINDA, the forthcoming OCC browse interface, to present this information in a more user-friendly fashion.

Using the Resolution Service with citations in an external resource via a SPARQL endpoint

The Open Citation Identifier Resolution Service currently works for citations between bibliographic resources both within the OpenCitations Corpus and within external bibliographic databases, provided that the external service uses bibliographic resource identifiers having a unique numerical part, and provides a SPARQL endpoint to makes available information about bibliographic resources and the references they contain.

It can therefore resolve OCIs identifying citations within Wikidata, such as oci:01027931310-01022252312, where, as explained in the previous blog post, “010” is the assigned OCC supplier prefix for Wikidata.

Entering this OCI in the Open Citation Identifier Resolution Service pulls live data from the Wikidata SPARQL endpoint and returns the following information about that citation, as shown in the following screen image (which, again, omits for clarity the provenance data associated with that citation):

Clicking on the links provided here returns information about the relevant Wikidata entities.

Citing paper:

Cited paper:

How the Resolution Service works

The bibliographic database supplying the metadata for a particular citation identified by an OCI is specified by the assigned OCC supplier prefix that forms part of the OCI, as described in the previous blog post. Each OCI is thus specific for and unique within a particular bibliographic database.

The resolution service takes the OCI entered into the search box, recognises the supplier prefix specifying the bibliographic database holding the citation information, parses the OCI into the database identifiers for the citing and cited entities, and then sends an appropriate SPARQL query to interrogate the SPARQL endpoint of the relevant database. When that database has returned information about the citation itself and about the citing and cited bibliographic resources, this is displayed to the user as shown in screen images above – or in other RDF formats (Turtle, JSON-LD, RDF/XML) according to the request.

It is important to realize that no other databases are contacted during this resolution process, and that the quality and accuracy of the metadata retrieved by the Open Citation Identifier Resolution Service is the responsibility of the database hosting that citation.  The OCI Resolution Service does no more than retrieve this information, and does nothing to address possible errors or omissions in the metadata coming from the hosting database.

Using the Resolution Service with external citations via a REST API

While the resolution service presently works only to retrieve information from bibliographic databases having a SPARQL endpoint, we plan soon to extend this resolution service to work with information supplied by a bibliographic database via a REST API.

Coupled with the ability to create OCIs by numerical conversions of Digital Object Identifiers (DOIs), as explained in the previous blog post, the Open Citation Resolution Service could then be used to pull metadata live from the Crossref REST API for any of the ~350 million Crossref open references in which the cited paper as well as the citing paper has a DOI, and for which an OCI can thus be created.

Watch this space!

References

[1]     David Shotton (2018). Citations as first-class data entities. Open Citation Identifiers.  Conference presentation. PIDapalooza 2018, Girona, 23-23 January 2018. https://doi.org/10.6084/m9.figshare.5844972

Posted in Bibliographic references, Citations as First-Class Data Entities, Open Citation Identifiers, Open Citations, Semantic Publishing | Tagged , , , , , | 1 Comment

Citations as First-Class Data Entities: Open Citation Identifiers

Requirements for citations to be treated as First-Class Data Entities

In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities.  The fourth of these requirements is that they must be identifiable using a global persistent identifier scheme.

At the recent PIDapalooza Conference on persistent identifiers, held in Girona, Spain, I launched the Open Citation Identifier (abbreviated OCI, in line with DOI), the new persistent identifier for citations [1].

In this post, I describe the Open Citation Identifier scheme, created and operated by OpenCitations, which supports the assignment of Open Citation Identifiers not only to the citations present in the OpenCitations Corpus (OCC) but also to open citations present in other bibliographic databases.

Structure and syntax of the Open Citation Identifier

Each OCI has a simple structure: oci:number-number, where “oci:” is the identifier prefix.

OCIs for citations stored within the OpenCitations Corpus are constructed by combining the OpenCitations Corpus local identifiers for the citing and cited bibliographic resources, separating them with a dash.  (For definition of OCC local identifiers, see the OpenCitations Data Model).

For example, oci:2544384-7295288 is a valid OCI for the citation between two papers stored within the OpenCitations Corpus, the first number being the OCC local identifier for the citing bibliographic resource [2], and the second being the OCC local identifier for the cited bibliographic resource [3], these bibliographic resource local identifiers being unique within the OCC.  [Note: Supplier prefixes are omitted from OCC local identifiers of bibliographic resources ingested into the OpenCitations Corpus prior to February 2018, but will be included within all OCC local identifiers of bibliographic resources ingested into Corpus after that date.]

OCIs for external resources identifies by numerical identifiers

OCIs can also be created for bibliographic resources described in an external bibliographic database, if they are similarly identified there by identifiers having a unique numerical part.  For example, the OCI for the citation that exists between Wikidata resources Q27931310 (the citing resource, [4]) and Q22252312 (the cited resource, [5]) is oci:0102793131001022252312, where “010” is the assigned OCC supplier prefix for Wikidata.  (The colours here and below are added simply for clarity.)

The OCC supplier prefix consist of a positive number (following the pattern “nnn”, where “nnn” is a string of numerals of variable length which includes no zeros), enclosed between two zeros (e.g. “0420”).  The list of all assigned OCC supplier prefixes is given at https://github.com/opencitations/oci/blob/master/suppliers.csv.

OCIs for citations between resources identified by DOIs

OCIs can also be created for bibliographic resources described in external bibliographic database such as Crossref or DataCite where they are identified by alphanumeric Digital Object Identifiers (DOIs), rather than purely numerical strings.

To achieve this, each case-insensitive DOI is first normalized to lower case letters. Then, after omitting the initial “doi:10.” prefix, the alphanumeric string of the DOI is converted reversibly to a pure numerical string using the simple two-numeral lookup table for numerals, lower case letters and other characters presented at https://github.com/opencitations/oci/blob/master/lookup.csv. For example, using this lockup table, “1” becomes “01”, “2” becomes “02”, “a” becomes “10”, “b” becomes “11”, and “/” becomes “36”.  To the resulting number, the appropriate OCC supplier prefix is then added, to clearly identify its provenance.

A citation documented in Crossref exists between the two publications [3] and [6], which are there identified by the DOIs doi:10.1108/jd-12-2013-0166 and doi:10.1371/journal.pcbi.1000361.  We can thus create an OCI for this Crossref citation by using numerical representations of the two DOIs. These numerical representations are:

0200101000836191363010263020001036300010606

and

02001030701361924302723102137251211183701000000030601

where the initial “020” in each case is the assigned OCC supplier prefix for Crossref.

From these two numerical representations of DOIs, the OCI for the Crossref citation between these two paper is easily constructed, and is:

oci:0200101000836191363010263020001036300010606-02001030701361924302723102137251211183701000000030601

While this is long for an identifier, it should be remembered that it will be processed computationally, and is not intended for human readability.

In this way, Crossref OCIs can be assigned to all ~350 million open references within Crossref in which the cited paper as well as the citing paper has a DOI [7].

OCIs for the same citation recorded within different databases

If a citation is recorded in more than one bibliographic database, a separate OCI can be created for each instance, each OCI having a distinct supplier prefix and being specific to that database.

Thus, in addition to the Crossref OCI created from DOIs and described above for the citation from [3] to [6], a Wikidata OCI exists for the same citation recorded within Wikidata, having the form oci:01024260641-01021092566.

Upon resolution of an OCI, the Open Citation Identifier Resolution Service will pull metadata only from the database specified by the supplier prefix of the OCI.  Details of the Open Citation Identifier Resolution Service are given in the next blog post.

It is important to note that an OCI can only be used to specify a citation between a citing and a cited publication which is actually recorded within a bibliographic database.  For this reason, the OCI “oci:7295288-3962641” shown below the second diagram in the introductory blog post to this series is presently invalid.  While the OpenCitations Corpus has metadata describing both bibliographic resources [3] and [6], it has not yet ingested the reference list for the first bibliographic resource [3] (which has the OCC local identifier 7295288), having information about it only from a reference within a third paper, with no information about the references [3] itself contains.  As a result, at present OCC has no record that a citation actually exists between [3] and the second bibliographic resource [6] (which has the OCC local identifier 3962641).

Representing OCIs in RDF

To permit the description of OCIs in RDF, “oci” has been added as a new member of the class datacite:ResourceIdentifierScheme within the DataCite Ontology.

The resolvable URL for any citation identified by a OCI has the form “https://w3id.org/oc/virtual/ci/nnn-mmm”, where nnn-mmm represents the OCI with its “oci:” prefix removed. Currently, we are able to return the RDF description of all the citations contained in the OpenCitations Corpus and Wikidata. We are working to extend the coverage so as to include other datasets, e.g. Crossref.

References

[1]     David Shotton (2018). Citations as first-class data entities. Open Citation Identifiers.  Conference presentation. PIDapalooza 2018, Girona, 23-23 January 2018. https://doi.org/10.6084/m9.figshare.5844972

[2]     Armen Yuri Gasparyan, Marlen Yessirkepov et al. (2015). Preserving the integrity of citations and references by all stakeholders of science communication.  J. Korean Med. Sci. 30:1545-1552. (English.)  https://doi.org/10.3346/jkms.2015.30.11.1545

[3]     Silvio Peroni, Alexander Dutton, Tanya Gray and David Shotton (2015). Setting our bibliographic references free: towards open citation data. Journal of Documentation, 71 (2): 253-277.  https://doi.org/10.1108/jd-12-2013-0166

[4]     Daniel K. Bricker, Eric B. Taylor et al. (2012). A Mitochondrial Pyruvate Carrier Required for Pyruvate Uptake in Yeast, Drosophila, and Humans. Science 337: 96-100.
https://doi.org/10.1126/science.1218099

[5]     Douglas Hanahan and Robert A. Weinberg (2011). Hallmarks of cancer: the next generation.  Cell 144: 646–674.  https://doi.org/10.1016/j.cell.2011.02.013

[6]     David Shotton, Katie Portwin, Graham Klyne and Alistair Miles (2009).  Adventures in semantic publishing: exemplar semantic enhancement of a research article. PLoS Computational Biology 5: e1000361. http://dx.doi.org/10.1371/journal.pcbi.1000361

[7]     Daniel Ecer (2017). Crossref Data Notebook (updated). Available at https://elifesci.org/crossref-data-notebook

 

Posted in Bibliographic references, Citations as First-Class Data Entities, Open Citation Identifiers, Open Citations, Semantic Publishing | Tagged , , , , | 2 Comments

Citations as First-Class Data Entities: The OpenCitations Corpus

Requirements for citations to be treated as First-Class Data Entities

In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities.  The third of these requirements is that they must be storable, searchable and retrievable in an open database designed for bibliographic citations.

In this post, I describe the current status of the OpenCitations Corpus, a well-structured open database specifically developed by OpenCitations and designed to store information about bibliographic citations as Linked Open Data, encoded in RDF (specifically JSON-LD).

What is OpenCitations?

OpenCitations (http://opencitations.net) is an scholarly infrastructure organization that has created and is currently expanding the coverage of the Open Citations Corpus (OCC), an open repository of scholarly citation data made available under a Creative Commons CC0 public domain dedication, which provides in RDF accurate citation information (bibliographic references) harvested from the scholarly literature.

The Co-Directors of OpenCitations are David Shotton, Oxford e-Research Centre, University of Oxford (david.shotton@opencitations.net) and Silvio Peroni, Department of Computer Science and Engineering, University of Bologna (silvio.peroni@opencitations.net).

We are committed to open scholarship, open data, open access publication, and open source software.  We espouse the FAIR data principles developed by Force11, of which David Shotton was a founding member, and the aim of the Initiative for OpenCitations (I4OC), of which David Shotton and Silvio Peroni were both founding members, to promote the availability of citation data that is structured, separable, and open.

The principal activity of OpenCitations to date has been the establishment and population of the OpenCitations Corpus.

Holdings of the OpenCitations Corpus

We have so far concentrated on ingesting into the OpenCitations Corpus bibliographic references from open access papers available at PubMed Central, the encoding of these data in RDF, and high-quality curation of the citation links they represent, involving metadata enrichment from the Crossref API and (for authors) the ORCID API.

To date (19th February 2018), the OCC has ingested the references from 302,758 citing bibliographic resources, and contains information about 12,830,347 citation links to 6,549,665 cited resources. Plans to expand the coverage of the OCC are outlined below.

User interfaces

The information within the OCC can be accessed via OSCAR, our new generic OpenCitations RDF Search Application (http://opencitations.net/search) [1], which can be used for textual searches over any triplestore presenting a SPARQL endpoint.  Users can employ OSCAR to search the OCC for publication titles, author names, publication years, and identifiers (DOIs, PubMed IDs PubMed Central IDs, ORCIDs, and OCC corpus identifiers). Such a search returns details of all bibliographic resources within the OCC matching the search term, from which their references can be obtained, if known. In the near future, we will complement OSCAR with a browse interface named LUCINDA.

We also provide a SPARQL endpoint for directly querying the Blazegraph triplestore in which we store the OCC RDF, and we plan in the near future to supplement such programmatic access with a REST API.  In addition, the contents of the entire triplestore, and of the various sub-databases within the Corpus, together with their provenance information, are downloadable from Figshare as monthly dumps.  Once the REST API has been developed, we will turn our attention to developing user interfaces for the interactive visualization of citation graphs.

The OpenCitations Data Model

As described in the previous blog post, we have just completed a comprehensive revision of the OpenCitations Data Model (OCDM, available at https://doi.org/10.6084/m9.figshare.3443876), which we use to capture descriptions of all aspects of the OCC citations and their provenance. This model makes extensive use of our SPAR (Semantic Publishing and Referencing) Ontologies (http://www.sparontologies.net/), which we developed to describe all aspects of the scholarly publishing domain in RDF .

The OpenCitations Data Model is freely available for third parties to use when recording their own bibliographic and citation information in RDF, with the advantage that data so modelled will be immediately compatible with those within the OpenCitations Corpus, which can act as a publishing venue for such third-party data.

Future ingest rate and data sources

Since July 2016, the instantiation of the OpenCitations Corpus currently running at the University of Bologna has been ingesting reference lists from biomedical journal articles at the relatively slow rate of about 200,000 citing bibliographic resources per year. During February 2018, ingestion into the Corpus is suspended, while we move the system to a completely new and more powerful server, supplemented by thirty Raspberry Pi ingest engines that will work in parallel feeding ingested data to the server.

This will increase our ingestion rate ~30-fold to about six million citing bibliographic resources per year, equivalent to ~240 million citations per year at 40 references per paper (the current OCC value is 42.4 references per paper).  We should then be able to complete ingestion of the ~1.4 million remaining OA resources at PubMed Central within about three months.

At that stage, we plan to start ingesting references from the ~17 million journal articles whose deposited references are now open at Crossref as a consequence of the Initiative for Open Citations.  The scholarly world currently publishes about 2.5 million new journal articles each year, of which about half will be probably be open at Crossref (assuming Elsevier has not by then opened its references).  So, by the end of 2020, Crossref will have ~650 million open references.  In addition to ingesting new open Crossref references as they are made available, we will be able to eat into the backlog of existing Crossref open references at a catch-up rate of ~190 million per year.  By the end of 2020, we anticipate that the OCC should contain ~650 million citations harvested from PMC and Crossref, roughly half the coverage of Web of Science.  We are currently also considering ingest of references from other major bibliographic databases.

Our vision for OpenCitations

Our vision is that OpenCitations should become a comprehensive source of open citation information from all disciplines of scholarly endeavour encoded as Linked Open Data, a key component of the academic open infrastructure used on a daily basis without charge by scholars worldwide.

To be of maximum utility, it requires effective graphical user interfaces and analytical tools to interrogate and quantify the data contained within the OCC.  Since these data are all open, we anticipate that such interface and tool development will best be undertaken collaboratively within the open scholarly community, and we invite developers interested in such collaboration to contact us at contact@opencitations.net.

Reference

[1]     Ivan Heibi, Silvio Peroni and David Shotton (2018).  OSCAR: A customisable tool for free-text search over SPARQL endpoints. Accepted to the 2018 International Workshop on Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination Workshop (https://save-sd.github.io/2018/, co-located with The Web Conference), 24 April 2018 – Lyon, France.  Preprint available at https://w3id.org/people/essepuntato/papers/oscar-savesd2018.html

Posted in Bibliographic references, Citations as First-Class Data Entities, Open Citations, Semantic Publishing | Tagged , , , , | 1 Comment

Citations as First-Class Data Entities: The OpenCitations Data Model

Requirements for citations to be treated as First-Class Data Entities

In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities.  The second of these requirements is that they must have metadata structured using a generic yet appropriately detailed data model.

To fulfil that requirement, OpenCitations is pleased to announce the publication on 13 February 2018 of the OpenCitations Data Model, v1.6 [1].  This replaces the previous version, v1.5.3, published on 13 July 2016.

The data model has been expanded and enhanced to improve the recording of publication dates, to include the treatment of citations as first-class data entities, and to permit the model’s adoption by third parties who may wish to use it to model their own citation data, or to prepare their citation data for publication in the OpenCitations Corpus (OCC).  To facilitate this, the document describing this data model is published under a Creative Commons Attribution 4.0 International license.

In addition to a change in the title from “Metadata for the OpenCitations Corpus” to “The OpenCitations Data Model”, and the use of the name “OpenCitations” (one token with two words in camel case) in place of “Open Citations” (with the space separating the two words), the substantive changes in the model from the previous version are as follows:

New class

A new class, Archival document, has been added as a subclass of bibliographic resource, to permit the model to be used for work on ancient manuscripts.

Publication dates

The mechanism for recording the publication dates of bibliographic resources has been improved, and now accepts the full date of publication (yyyy-mm-dd, if available), or the year plus the month of publication (yyyy-mm, if the full date is not available), or failing that just the year of publication (yyyy, as in the previous version of the data model).   In order to support this modification in the OWL mapping, prism:publicationDate is now used instead of fabio:hasPublicationYear.

Citations as first-class data entities

A new class of bibliographic entity, Citation, has been added to permit the description of citations as first-class data entities.  This class has been assigned sub-classes (e.g. Author self-citation) and properties (e.g. citation time span) to permit the description of citations in a manner helpful for bibliometric analysis.  These, and associated changes to CiTO, the Citation Typing Ontology, are described more fully in the previous blog post.

Virtual entities

The OpenCitations Data Model now permits the definition of virtual entities, i.e. bibliographic entities that are defined on-the-fly, only when they are requested (for example, by accessing their URLs). These are defined either by using data relating to non-virtual bibliographic entities that are already available within the OCC, or by using data that are themselves obtained on-the-fly from an external supplier (e.g. Wikidata).

This approach of using virtual RDF resources is optional, and is simply employed for storage efficiency, to avoid duplication of information within the OCC triplestore. As of January 2018, only one type of bibliographic entity is defined as a virtual entity, namely a citation (a members of the class Citation).

Such a virtual entity does not have the full provenance information normally associated with other bibliographic entities within the OCC, but it does have associated with itself the date of its creation and direct links both to the agent responsible for such creation and to the source data used in its construction.

Because we do not separately store these virtual entities within the Corpus triplestore, they cannot be directly queried by means of the OCC SPARQL end-point, neither are they stored within its data dumps. However, the data associated with an OCC virtual entity can be obtained by accessing its URL, which has form “https://w3id.org/oc/virtual/xyz”, clearly distinguishable from those URIs used for other (non-virtual) OCC bibliographic entities which have the form “https://w3id.org/oc/corpus/xyz”.  More details and examples are given in the Data Model document itself.

Additionally, for citations defined using Open Citation Identifiers (OCIs, described in a subsequent blog post), details of the cited and citing publications may be readily obtained by using the Open Citation Identifier Resolution Service at http://opencitations.net/oci.

Supplier prefixes

To enable citation data created by third parties to be incorporated within the OpenCitations Corpus, from February 2018 the OCC local identifiers for bibliographic resources now include a supplier prefix which clearly identifies the provenance of the data.  The prefix consists of a positive number (following the pattern “nnn”, where “nnn” is a string of numerals of variable length which includes no zeros), enclosed between two zeros (e.g. “0420”).

To ensure uniqueness of prefixes used by different suppliers, all organizations wishing to adopt the OpenCitations Data Model and to use it to create publicly available citation data, whether these are published in the OpenCitations Corpus or independently, must apply to OpenCitations for a unique supplier prefix, by sending an email to support@opencitations.net.  A list of already assigned supplier prefixes is available at https://github.com/opencitations/oci/blob/master/suppliers.csv.

The appropriate supplier prefix is combined with a unique numerical string that forms the ‘body’ of the identifier to create the local identifier used in OCC to identify an individual bibliographic resource.  OCC local identifiers for citations (as opposed to bibliographic resources) are constructed by combining the local identifiers for the citing and cited bibliographic resources, separating them with a dash.  Thus, for a citation between two bibliographic resources described in an external bibliographic database where they are each identified by an identifier having a unique numerical part, the OCC local identifiers for the citing and cited bibliographic resources are combined, separating them with a dash.

For example, the citation between citing Wikidata resource Q27931310 and cited Wikidata resource Q22252312 is given the OCC local citation identifier “01027931310-01022252312”, where “010” is the OCC supplier prefix (defined above) for Wikidata.  How these OCC local identifiers for citations are used to create Open Citation Identifiers is described in a separate blog post.

 

We commend the OpenCitations Data Model to anyone considering the storage of citation information, particularly if it is to be encoded in RDF, and we welcome contributions of citation data encoded using this model for publication within the OpenCitations Corpus.

Reference

[1]     Silvio Peroni, David Shotton (2018). The OpenCitations Data Model. Version 1.6. figshare. https://doi.org/10.6084/m9.figshare.3443876

Posted in Bibliographic references, Citations as First-Class Data Entities, Open Citations, Semantic Publishing | Tagged , , | 2 Comments

Citations as First-Class Data Entities: Citation Descriptions

Requirements for citations to be treated as First-Class Data Entities

In my introductory blog post, I listed five requirements for the treatment of citations as first-class data entities.  The first of these requirements is that they must be definable in a machine-readable manner as a member of the class “Citation”, and describable using appropriate ontology terms.

This blog post describes recent additions to the OpenCitations Data Model, and to CiTO, the Citation Typing Ontology, that permit the required richer description of citations.

Changes to the OpenCitations Data Model

In the OpenCitations Data Model (OCDM), itself described in the following blog post, we have created the following new classes and properties that permit the descriptions of citations in richer ways that are appropriate for bibliometric research.  These changes have been inspired by the publications of Vincent Larivière, Ludo Waltman and their colleagues [1-3].

These new classes and properties and their definitions are described below:

New classes

  • Citation: a permanent conceptual directional link from the citing bibliographic resource to a cited bibliographic resource, created by the performative act of an author citing a published work that is relevant to the current work, typically made by including a bibliographic reference in the reference list of the citing work, or by the inclusion within the citing work of a link, in the form of an HTTP Uniform Resource Locator (URL), to the cited bibliographic resource on the World Wide Web.

The class Citation has sub-classes defining a particular type of citation.

  • Self-citation: a citation in which the citing and the cited entities have something significant in common with one another. Sub-classes include:
    • Affiliation self-citation: a citation in which at least one author from each of the citing and the cited entities is affiliated with the same academic institution.
    • Author network self-citation: a citation in which at least one author of the citing entity has direct or indirect co-authorship links with one of the authors of the cited entity.
    • Author self-citation: a citation in which the citing and the cited entities have at least one author in common.
    • Funder self-citation: a citation in which the works reported in the citing and the cited entities were funded by the same funding agency.
    • Journal self-citation: a citation in which the citing and the cited entities are published in the same journal.
  • Journal cartel citation: a citation from one journal to another journal which forms one of a very large number of citations from the citing journal to recent articles in the cited journal.
  • Distant citation: a citation in which the citing and the cited entities have nothing significant in common with one another over and beyond their subject matter.

New object properties

  • has citing document: The bibliographic resource which acts as source for the citation.
  • has cited document: The bibliographic resource which acts as target for the citation.

New data properties

  • has citation creation date:The date on which the citation was created. This has the same numerical value as the publication date of the citing bibliographic resource, but is a property of the citation itself. When combined with the citation time span, it permits that citation to be located in history.
  • has citation time span: The temporal characteristic of a citation, namely the interval between the publication date of the cited entity and the publication date of the citing entity.

Changes to CiTO, the Citation Typing Ontology

To complement these additions to the OpenCitations Data Model, and to permit these richer characteristics of citations to be encoded in RDF, we have additionally made the following changes to CiTO, the Citation Typing Ontology.

New classes

The class cito:SelfCitation has been renamed cito:AuthorSelfCitation, with an unchanged definition (“a citation in which the citing and the cited entities have at least one author in common”).

A new class cito:SelfCitation has been created, with same the more general definition as for this sub-class in the OCDM (“a citation in which the citing and the cited entities have something significant in common with one another”). In CiTO, this now includes five new sub-classes:

  • cito:AuthorSelfCitation
  • cito:JournalSelfCitation
  • cito:FunderSelfCitation
  • cito:AffiliationSelfCitation
  • cito:AuthorNetworkSelfCitation

with the definitions given above for these sub-classes in the OCDM.

New object properties

To complement the OCDM properties, we have within CiTO the following object properties:

  • cito:hasCitedEntity (“A property that relates a citation to the cited entity”) and
  • cito:hasCitingEntity (“A property that relates a citation to the cited entity”).

CiTO also has the following relevant object property:

  • cito:sharesPublicationVenueWith

with the sub-property cito:sharesJournalWith.

New data properties

To match the additions in the OCDM, we have added these new data properties to CiTO, which have the same definitions as those in the OCDM:

  • cito:hasCitationCreationDate
  • cito:hasCitationTimeSpan.

In addition, the class cito:AuthorNetworkSelfCitation is accompanied by the new data property:

  • cito:hasCoAuthorshipCitationLevel

which specifies the minimal distance that one of the authors of a citing entity has with regards to one of the authors of a cited entity according to their co-author network. For instance, a citation has a co-authorship citation level equal to 1 if at least one author of the citing entity has previously published as co-author with one of the authors of the cited entity. Similarly, we say that a citation has a co-authorship citation level equal to 2 if at least one author of the citing entity has previously published as co-author with someone who him/herself has previously published as co-author with one of the authors of the cited entity. And so on.

Describing a citation in RDF

Describing a citation between two articles in RDF as a simple link is straightforward but relatively uninformative:

<https://w3id.org/oc/corpus/br/1>
      cito:cites
          <https://w3id.org/oc/corpus/br/18> . 

The alternative RDF description of a citation as a first-class date entity could include the following triples (omitting any provenance information in this example), where br/1 and br/18 are the internal identifiers for the citing bibliographic resource and the cited bibliographic resource within the OpenCitations Corpus:

<https://w3id.org/oc/virtual/ci/1-18> a cito:Citation ;
     cito:hasCitingEntity <https://w3id.org/oc/corpus/br/1> ;
     cito:hasCitedEntity <https://w3id.org/oc/corpus/br/18> ;
     cito:hasCitationCreationDate "2016"^^xsd:gYear ;
     cito:hasCitationTimeSpan "P10Y"^^xsd:duration ;
     datacite:hasIdentifier <https://w3id.org/oc/virtual/id/ci-1-18> .

The meaning of “virtual” in the URI of this citation is explained in the following blog post about the OpenCitations Data Model.

The following diagram prepared by Silvio Peroni shows the semantic relationships for a citation currently handled by the OpenCitations Corpus (omitting the sub-classes of the class cito:Citation).  Explanation of OCI, the Open Citation Identifier, is given in a subsequent post.

References

[1]     Matthew L. Wallace, Vincent Larivière and Yves Gingras (2012. A Small World of Citations? The Influence of Collaboration Networks on Citation Practices.  PLoS ONE 7(3): e33339. https://doi.org/10.1371/journal.pone.0033339

[2]     Philippe Mongeon, Ludo Waltman and Sarah de Rijcke (2016). What do we know about journal citation cartels? A call for information.  CWTS blog post. Available at https://www.cwts.nl/blog?article=n-q2w2b4

[3]       Ludo Waltman and Caspar Chorus (2016). Journal self-citations are increasingly biased toward impact factor years. CWTS blog post. Available at https://www.cwts.nl/blog?article=n-q2x264

Posted in Bibliographic references, Citations as First-Class Data Entities, Ontologies, Open Citations, Semantic Publishing | Tagged , , , , | 1 Comment