To illustrate three kinds of problems in obtaining correct author lists for Open Citation data from articles in the PubMed Central Open Access subset (OASS), I take three examples, the first of which is the result of a publication policy, the second due to mis-handling of an authorship attribution at the time of publication, and the third exemplifing errors introduced when handling non-English personal names.
In the paper by Reis et al. (2008) , that we took as the subject for our exercise in semantic publishing enhancements described in , we find the following entry for Reference 40 in the reference list:
40. Maciel EAP, Carvalho ALF, Nascimento SF, Matos RB, Gouveia EL, et al. (2008) Household transmission of Leptospira infection in urban slum communities. PLoS Negl Trop Dis 2: e154. doi:10.1371/journal.pntd.0000154.
Note that it is the policy of the publisher, the Public Library of Science, to list only the first five authors in references to papers that have more than five, despite publishing online-only journals where article length is not an issue.
The XML for this reference in the document is as follows:
<meta name="citation_reference" content="citation_title=Household transmission of Leptospira infection in urban slum communities.; citation_author=EAP Maciel; citation_author=ALF Carvalho; citation_author=SF Nascimento; citation_author=RB Matos; citation_author=EL Gouveia; citation_journal_title=PLoS Negl Trop Dis; citation_volume=2; citation_number=40; citation_pages=e154. doi:10.1371/journal.pntd.0000154; citation_date=2008; " />
Note that in the XML, all indication that there are more than five authors, i.e. the “et al.” present in the human-readable reference, is totally lost. There is thus no way of telling from the XML for this paper retrieved from PubMed Central that the full authorship for this paper is as follows:
Elves A. P. Maciel, Ana Luiza F. de Carvalho, Simone F. Nascimento, Rosan B. de Matos, Edilane L. Gouveia, Mitermayer G. Reis and Albert I. Ko.
The last two authors of the cited paper, Mitermayer Reis and Albert Ko, who are the lead author and the senior author, respectively, of the citing paper, are both omitted from the PLoS reference to the cited paper, and hence from the data automatically extracted by Open Citations from the OASS.
A second example from the same citing paper is the cited reference to Ko et al. (1999) . As the reference at the foot of this page shows, the author list includes six names “and the Salvador Leptospirosis Study Group”. Group attributions of this kind are commonplace, particularly in papers resulting from large collaborative projects. However, conventional markup systems such as the NLM-DTD have no systematic way of handling such information. Surprisingly, it is even incorrectly stated in the human readable version of the reference in the Reis et al. paper. Reference 6 in the article’s reference list reads:
thus including “Salvador Leptospirosis Study Group” as part of the title, an error also present in the XML version of the paper:
<meta name="citation_reference" content="citation_title=Urban epidemic of severe leptospirosis in Brazil. Salvador Leptospirosis Study Group.; citation_author=AI Ko; citation_author=MG Reis; citation_author=CM Ribeiro Dourado; citation_author=WD Johnson; citation_author=LW Riley; citation_journal_title=Lancet; citation_volume=354; citation_number=6; citation_pages=820-825; citation_date=1999; " />
The third example, also taken from the reference list of Reis et al. (2008) , illustrated the problems of handling non-English names and titles. Reference 39 in the PLOS article’s reference list  reads:
Dias JP, Teixeira MG, Costa MC, Mendes CM, Guimaraes P, et al. (2007) Factors associated with Leptospira sp infection in a large urban center in Northeastern Brazil. Rev Soc Bras Med Trop 40: 499–504.
Note that the author list is truncated, that the generic name “Leptospira sp” in the title is italicized, and that neither the DOI nor the PubMed ID is provided, although the article has both.
The landing page for this article on the publisher’s web site (http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0037-86822007000500002&lng=en&nrm=iso&tlng=en) shows the following:
Revista da Sociedade Brasileira de Medicina Tropical
Print version ISSN 0037-8682
Rev. Soc. Bras. Med. Trop. vol.40 no.5 Uberaba Oct. 2007
doi: 10.1590/S0037-86822007000500002 ARTIGO ARTICLE
Fatores associados à infecção por Leptospira sp em um grande centro urbano do Nordeste do Brasil
Juarez Pereira DiasI; Maria Glória TeixeiraI; Maria Conceição Nascimento CostaI; Carlos Maurício Cardeal MendesI; Patrícia GuimarãesII; Mitermayer Galvão ReisII; Albert KoII,III; Maurício Lima BarretoI
The paper is published in English. Note that there is an alternative Portuguese title, that the generic name “Leptospira sp” is italicized in both, and that a DOI is provided, although page numbers are not give for this on-line version of the article. Note also the accents and structures of the full Brazilian author names.
Clicking on the tab Article in PDF format on the landing page takes one to the PDF download page that gives the following reference with page numbers, but lacking English title and DOI, and lacking the full author list:
DIAS, Juarez Pereira et al. Factors associated with Leptospira sp infection in a large urban center in northeastern Brazil. Rev. Soc. Bras. Med. Trop. [online]. 2007, vol.40, n.5, pp. 499-504. ISSN 0037-8682.
A manual search for the same article in PubMed returns the following information:
Rev Soc Bras Med Trop. 2007 Sep-Oct;40(5):499-504.
Factors associated with Leptospira sp infection in a large urban center in northeastern Brazil.
Note the correct accentuation of the surname “Guimarães”, but the loss of the last Christian Name initial for Maria Conceição Nascimento Costa, and the loss of italicization of the generic name “Leptospira sp“in the tile. Note also the absence of the Portuguese title and the DOI, and the addition of a PubMed ID.
The real problems with this reference arise when we look at the starting XML corpus upon which our linked open citation data output is based, which in turn is based on the original PLoS submission of the Reis et al. (2008) paper to PubMed Central.
The author list for this reference #39 in the PMC XML for Reis et al (2008) is as follows:
Here we see, as in the HTML reference list of the original Reis et al. (2008) paper, the truncation of the list of authors to the first five, and the loss of accent on the surname “Guimarães”. Surprisingly, the title is recorded as
<article-title>Factors associated with <italic>Leptospira</italic> sp infection in a large urban center in Northeastern Brazil.</article-title>
correctly showing the italic “Leptospira” but also correctly not italicizing the following “sp”!
All is not lost, however, in terms of the full author list. Since the OASS XML for Reis et al. (2008) contains a Pubmed ID for this Dias et al. (2007) paper :
we can retrieve the PubMed bibliographic record for this paper by querying the Entrez API, from which we recover:
<Item Name=“AuthorList” Type=“List”>
<Item Name=“Author” Type=“String”>Dias JP</Item>
<Item Name=“Author” Type=“String”>Teixeira MG</Item>
<Item Name=“Author” Type=“String”>Costa MC</Item>
<Item Name=“Author” Type=“String”>Mendes CM</Item>
<Item Name=“Author” Type=“String”>Guimarães P</Item>
<Item Name=“Author” Type=“String”>Reis MG</Item>
<Item Name=“Author” Type=“String”>Ko A</Item>
<Item Name=“Author” Type=“String”>Barreto ML</Item>
<Item Name=“LastAuthor” Type=“String”>Barreto ML</Item>
Note that here we have the full author list containing the correctly accented surname “Guimarães”.
Thus, by matching this record with the original PLoS record for Dias et al., and selecting the longer list and the more accentuated names, we can correct the omissions in the original PubMed Central OASS data.
However, Entrez output is not infallible. In the Entrez xml output for reference #44 in Reis et al. (2008), namely the paper by Travassos and Williams (2004) , we find:
<Item Name=“doi” Type=“String”>/S0102-311X2004000300003</Item>
The correct DOI for this article is doi:10.1590/S0102-311X2004000300003. However, the Entrez output from PubMed is missing the journal prefix “10.1590”, which when parsed to its URI form during our automated processing of the source data from XML to RDF would become
if we did not take steps to check for the correct DOI syntax.
These few examples, all taken from a single OASS article, illustrates some of the problems we have had to face in creating accurate and reliable RDF to enable us to publish these reference lists as open citation data.
What is shocking to me with regard to PLoS, perhaps the leading Open Access publisher, is that they don’t systematically include both DOIs and PubMed IDs in both HTML and XML versions of article references on the PLoS web site, despite the fact that they insert PubMed IDs into the records they mark up in NLM-DTD XML and send to PubMed Central, and also that PLoS persists with its policy of not listing all the authors, and that it does not include proper accents and diacritical marks, particularly for non-English names.
The methods used to correct citation errors are described in the next blog post, while the data processing pipeline through which we pass the input data to generate our RDF output Open Citations Corpus is described in the following blog post.
 Reis RB, Ribeiro GS, Felzemburgh RDM, Santana FS, Mohr S, Melendez SXTO, Queiroz A, Santos AC, Ravines RR, Tassinari WS, Carvalho MS, Reis MG, Ko AI
(2008). Impact of environment and social gradient on Leptospira infection in urban slums. PLoS Negl Trop Dis
2(4): e228. doi:10.1371/journal.pntd.0000228
 Ko AI, Reis MG, Ribeiro Dourado CM, Johnson WD Jr, Riley LW and the Salvador Leptospirosis Study Group (1999). Urban epidemic of severe leptospirosis in Brazil. Lancet
354: 820–825. doi:10.1016/S0140-6736%2899%2980012-9.
 Dias JP, Teixeira MG, Costa MC, Mendes CM, Guimarães P, Reis MG, Ko A, Barreto ML (2007). Factors associated with Leptospira sp infection in a large urban center in northeastern Brazil. Rev Soc Bras Med Trop. 40(5): 499-504. doi:10.1590/S0037-86822007000500002.
 Travassos C, Williams DR (2004). The concept and measurement of race and their relationship to public health: A review focused on Brazil and the United States. Cad Saude Publica
20: 660–678. doi:10.1590/S0102-311X2004000300003.