As previously described, the PubMed Central Open Access subset of journal articles yielded 6,529,815 independent bibliographic records of both citing and cited entities, while our use of the PubMed Entrez API provided a further 2,304,143 bibliographic records for the same cited entities. Before converting these references into RDF to create the Open Citations Corpust, we attempted to remove errors in the data.
Some of the references we collected were to highly cited papers, while 2,505,879 referenced papers were only cited once. Figure 1 shows the number of citations per paper for the 100 most highly cited papers in our records – the left hand end of what is a classical long-tail dataset.
We have not yet analysed the topics of these papers, but can reveal that the paper most highly cited from within the OASS, with 2150 citations, is
Altschul et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database. Nucleic Acids Res. 25(17):3389-3402. doi: 10.1093/nar/25.17.3389.
In an ideal world, all OASS references to an individual paper would be identical, and would exactly match the data on that paper extracted from the Entrez API. However, as we have already seen for author names, this is not the case. As with most datasets in the world, a significant proportion (~1%) of our input reference data is either incomplete or erroneous.
We attempted to correct for these errors by comparing references that appeared to reference the same bibliographic entity, and from this comparison extract the correct data for authors, title, etc. using the following rules:
Accept the longest author list and names bearing accents over those lacking them.
- Accept DOIs and PubMed IDs from references having them, after eliminate mis-formed identifiers (e.g. DOIs lacking the journal prefix “10.****”), using the majority vote if different identifiers were given for the same paper.
- Accepting those variants of titles, journal names, etc. held in common by the majority of references.
This voting method was weighted in favour of data we judged to be most reliable, namely the PubMed records returned from the Entrez API, and metadata about cited papers that were within the OASS coming from the independent bibliographic records we had for those papers.
As a result of these activities, not only did we coalesce the independent references from different OASS articles to the same multiply cited papers into a set of 3,578,598 unique bibliographic citation target records describing 204,637 OASS articles and 3,373,961 articles outside the OASS, but we were able to select from the multiple references those elements (author list, title, etc.) that were judged to be correct for each target.
However, of the 2,505,879 papers that are only cited once from within the OASS, 1,246,967 lacked a PubMed ID, so for these we were unable to gather confirmatory evidence for the accuracy of the citation from the Entrez API. These references, which are to the least significant papers in the corpus, are therefore provided “as is” from PubMed Central, without any external corroboration of their accuracy.
How these error-correction processes fitted into the data processing pipeline used to create the Open Citations Corpus is described in the next blog post.