Using the ORCID Public API for author disambiguation in the OpenCitations Corpus

Among the external services used, the ORCID Public API is of crucial importance for the task of author disambiguation. During the OCC ingestion workflow, the main metadata of an article are usually retrieved from the Crossref API. While the JSON schema used by Crossref to return the information requested by its APIs includes a field for specifying the ORCID for each of the authors of an article, this field is usually blank, since such information is commonly not available in the data provided by publishers. We therefore routinely use the ORCID Public API to try to retrieve ORCIDs for all authors and editors named in the Crossref metadata for a given DOI.

The process is organised as follows. Once we get back from Crossref the metadata about an article, we call the ORCID Public API and search for ORCIDs associated with the family names returned by Crossref of all the authors and editors (‘agents’) associated with that particular DOI. For instance, using the Crossref metadata about the article with DOI “10.1108/jd-12-2013-0166” (API call: https://api.crossref.org/works/10.1108/jd-12-2013-0166), we extract all the agents’ family names and call the ORCID Public API as follows:

https://pub.orcid.org/v2.1/search?q=(doi-self:10.1108/JD-12-2013-0166%20OR%20doi-self:10.1108/jd-12-2013-0166)%20AND%20(family-name:Peroni%20OR%20family-name:Dutton%20OR%20family-name:Gray%20OR%20family-name:Shotton)

The result of this query returned by ORCID is as follows:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<search:search num-found="2" 
  xmlns:search="http://www.orcid.org/ns/search" 
  xmlns:common="http://www.orcid.org/ns/common">
  <search:result>
    <common:orcid-identifier>
      <common:uri>https://orcid.org/0000-0003-0530-4305</common:uri>
      <common:path>0000-0003-0530-4305</common:path>
      <common:host>orcid.org</common:host>
    </common:orcid-identifier>
  </search:result>
  <search:result>
    <common:orcid-identifier>
      <common:uri>https://orcid.org/0000-0003-1448-3114</common:uri>
      <common:path>0000-0003-1448-3114</common:path>
      <common:host>orcid.org</common:host>
    </common:orcid-identifier>
  </search:result>
</search:search>

Then, for each ORCID returned, we call again the ORCID Public API, shown as follows for ORCID “0000-0003-0530-4305”, so as to get the full personal details of the agent with that ORCID:

https://pub.orcid.org/v2.1/0000-0003-0530-4305/personal-details

The result of this query is shown as follows:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<personal-details:personal-details
  path="/0000-0003-0530-4305/personal-details"
  xmlns:personal-details="http://www.orcid.org/ns/personal-details"
  ...>
  <personal-details:name 
    visibility="public" path="0000-0003-0530-4305">
    ...
    <personal-details:given-names>
      Silvio
    </personal-details:given-names>
    <personal-details:family-name>
      Peroni
    </personal-details:family-name>
  </personal-details:name>
  ...
</personal-details:personal-details>

Then, two possible alternative situations exist:

If the OpenCitations Corpus has already recorded the personal details and ORCID of that agent, we associate that agent with the new bibliographic resource identified by the input DOI; otherwise,
If the personal details and ORCID of that agent have not been previously recorded in the OpenCitation Corpus, we create a new agent record with that ORCID as external identifier, specified by means of the DataCite Ontology, and we associate this new agent with the new bibliographic resource identified by the input DOI.
This process is repeated for all ORCIDs associated with that DOI.

Software reuse in different applications

While the OCC ingestion workflow explained above regulates the ingestion of new citation data directly into the OpenCitations Corpus, the particular software library that implements this ingestion is generic in form, and is being reused in another application that we have recently released in prototype, namely BCite (sources available on GitHub). BCite is a Web application that enables users such as journal editors, starting with the ‘raw’ reference text strings supplied by the author as items in an article’s reference list, to obtain ‘clean’ verified and enriched bibliographic reference text strings, for inclusion in the reference list of the citing article they have in hand, so that accurate rather than erroneous references can be published in the version of record. Additionally, these references are at the same time transformed into RDF data compliant with the OpenCitation Data Model, including ORCIDs where available, thereby (in principle, although not yet in practice) permitting inclusion of the metadata for these cited works, and the citations for which they are the targets, into the OpenCitations Corpus itself.

	RTD TIG Week: Augmen… on Coverage of open citation data…
	UZH – Universität Zü… on Coverage of open citation data…
	UZH – Universität Zü… on OpenCitations and the Initiati…
	The Initiative for O… on Academia’s missing refer…
	Coverage of open cit… on Elsevier references dominate t…