Libraries and linked data #1: What are linked data?

The first of six blog posts about libraries and linked data, bearing this title, is to be found at

http://semanticpublishing.wordpress.com/2013/03/01/lld1-what-are-linked-data/.

A draft of that post, that erroneously appeared here in this blog, has been removed.

 

Advertisements
Posted in Uncategorized | 1 Comment

Open Citations – Indexing PubMed Central OA data

As part of our work on the Open Citations extensions project, I have recently been doing one of my favourite things – namely indexing large quantities of data then exploring it.

On this project we are interested in the PubMed Central Open Access subset, and more specifically, we are interested in what we can do with the citation data contained within the records that are in that subset – because, as they are open access, that citation data is public and freely available.

We are building a pipeline that will enable us to easily import data from the PMC OA and from other sources such as arXiv, so that we can do great things with it like explore it in a facetview, manage and edit it in a bibserver, visualise it, and stick it in the rather cool related-work prototype software. We are building on the earlier work of both the original Open Citations project, and of the Open Bibliography projects.

Work done so far

We have spent a few weeks getting to understand the original project software and clarifying some of the goals the project should achieve; we have put together a design for a processing pipeline to get the data from source right through to where we need it, in the shape that we need it. In the case of facetview / bibserver work, this means getting it into a wonderful elasticsearch index.

While Martyn continues work on the bits and pieces for managing the pipeline as a whole and pulling data from arXiv, I have built an automated and threadable toolchain for unpacking data out of the compressed file format it arrives in from the US National Institutes of Health, parsing the XML file format and converting it into BibJSON, and then bulk loading it into an elasticsearch index. This has gone quite well.

To fully browse what we have so far, check out http://occ.cottagelabs.com.

For the code: https://github.com/opencitations/OpenCitationsCorpus/tree/master/pipeline.

The indexing process

Whilst the toolchain is capable of running threaded, the server we are using only has 2 cores and I was not sure to what extent they would be utilised, so I ran the process singular. It took five hours and ten minutes to build an index of the PMC OA subset, and we now have over 500,000 records. We can full-text search them and facet browse them.

Some things of particular interest that I learnt – I have an article in the PMC OA! And also PMIDs are not always 8 digits long – they appear in fact to be incremental from 1.

What next

At the moment there is no effort made to create record objects for the citations we find within these records, however plugging that into the toolchain is relatively straightforward now.

The full pipeline is of course still in progress, and so this work will need a wee bit of wiring into it.

Improve parsing. There are probably improvements to the parsing that we can make too, and so one of the next tasks will be to look at a few choice records and decide how better to parse them. The best way to get a look at the records for now is to use a browser like Firefox or Chrome and install the JSONview plugin, then go to occ.cottagelabs.com and have a bit of a search, then click the small blue arrows at the start of a record you are interested in to see it in full JSON straight from the index. Some further analysis on a few of these records would be a great next step, and should allow for improvements to both the data we can parse and to our representation of it.

Finish visualisations. Now that we have a good test dataset to work with, the various bits and pieces of visualisation work will be pulled together and put up on display somewhere soon. These, in addition to the search functionality already available, will enable us to answer the questions set as representative of project goals earlier in January (thanks David for those).

Postscript

David Shotton writes: This productive collaboration between Cottage Labs and the Open Citations Corpus came to an end when Jisc funding ran out.  The corpus has more recently been given a new lease of life, as described here, with a new instantiation named OpenCitations hosted at the Department of Computer Science and Engineering of the University of Bologna, with Silvio Peroni as Co-Director.

Posted in JISC, Open Citations | Tagged , , , , , , , , , | Leave a comment

The start of something new

Previously, my blog posts relating to semantic publishing have appeared in this Open Citations Blog.

However, because of the merger of the Open Citations Project with the Related Work Project, described here, this Open Citations Blog has been renamed Open Citations and Related Work and has been opened to contributions from others involved in developing Open Citations and Related Work.

It thus makes sense that future blog posts concerned with semantic publishing be given a separate distinct home – a new Semantic Publishing Blog available at at http://semanticpublishing.wordpress.com/ – to which previous blog posts on the semantic publishing theme that originally appeared in this Open Citations Blog will be copied.

Those interested should follow both the Open Citations and Related Work Blog and the Semantic Publishing Blog, and may also be interested in the third more specialized blog concerning the on-line creation of research data management plans, entitled Creating data management plans online that is available at http://datamanagementplanning.wordpress.com/.

Postscript

David Shotton writes: With the return of Heinrich Hartmann and René Pickhardt to Germany and their involvement in other things, the potential collaboration between the Open Citations Project and their Related Works project came to nothing.  Separately, the corpus has recently been renamed OpenCitations and given a new lease of life, described here, with Silvio Peroni as Co-Director.  For this reason, this blog was re-named “OpenCitations” on 30th March 2017.

 

Posted in Open Citations, Semantic Publishing | Tagged , , , | Leave a comment

Why openness benefits research

David writes: Dr Heinrich Hartman is a new colleague of mine, who, having been working in the Mathematical Institute of Oxford University, has just returned to Germany to start a new job in a leading semantic web research group, that of Steffan Staab at the Institute for Web Science and Technologies, University of Koblenz-Landau.  What follows are our thoughts about research openness, that relates to our decision, described in the next blog post, to merge our bibliographic citation projects.

The following text is jointly authored by David Shotton (david.shotton@zoo.ox.ac.uk) and Heinrich Hartmann (hartmann@uni-koblenz.de)

Transparency is essential for trust and credibility in the research community, and true openness brings great opportunities for academia. The internet facilitates the free flow of information and knowledge, and permits new forms of communication both for researchers and for the general public. Already, today’s children can listen freely on the internet to university courses taught by world-leading scientists, and everybody has the best encyclopaedia ever written (Wikipedia) at their fingertips.  These are real game changers. Opening up the research literature is the next logical step.

Open publishing

We believe that the current academic publishing model – whereby researchers give their content to commercial publishers and then buy it back from them at enormous cost by means of journal subscription fees – has become absurd, since it is no longer helping the researcher to distribute his or her findings, but rather prevents the work from being widely read, by hiding it behind subscription pay walls.  Would it not be much better to let this information flow freely, accessible to everybody who wants to read it!

Of course, such a vision of openness for academic publishing raises issues of finance and quality control – who will pay for open access publishing, and how can we ensure that scientific rigor accompanies open publication.  While the internet enables dissemination of information at a fraction of the cost of traditional print publication, publishing clearly involves more than electronic dissemination.  It is for this reason that we, with others, are presently planning a high level conference on modern scientific communication, entitled Rigor and Openness in 21st Century Science, to be held in Oxford next spring.

However, new publication funding models are being developed, particularly in the United Kingdom, where Research Councils UK and the Wellcome Trust are insisting that papers reporting research results obtained as a result of their research funding should be published under an open Creative Commons CC-By attribution licence when an article processing charge (APC) is levied, so that the works are freely available for text mining and re-use [1].  What is significant is that they are backing their words with funding to enable it.  Cameron Neylon has recently written a commentary in Nature about the importance of this [2].

Furthermore, peer review is being carefully examined by several forward-looking publishers to determine how well open alternatives to the present system of confidential review actually work.

The role of social media in science

Much academic research is done in relative isolation, because topics have become so specialized that there may be only a few experts in the whole world who really understand each particular research problem.  These experts may be located on different continents, and may not know about one another – a situation that is particularly true for Ph.D. students and other young researchers, who may not yet be familiar with the literature in their field, and who may have formed few personal relationships with colleagues in other institutions through attendance at research conferences.  New forms of academic social media can play a role here, to catalyse interactions between geographically separated academics, and many experiments in this area are being conducted.

Academic social media can also play an important role in filtering the wealth of new articles published every day, and in alerting people to the small fraction of these that are most relevant to them.  Typically, junior researchers rely on recommendations from friends and colleagues about which articles are worth reading, but if academic social media can be used to broaden this recommendation network, they will provide a significant service.

Fears and benefits of openness

Of course researchers, particularly early in their careers, are cautious about sharing their discoveries too early or too widely, for fear they may get ‘scooped’, since they naturally and quite properly wish to obtain credit for their own work by being the first to publish it.  However, what is often missed by people of this mind-set is that working openly with other people can have benefits too.  It can be a lot more fun, can lead to more sustainable motivation, can result in incredibly rapid collaborative progress, and hence can often lead to better results.  An essential pre-requisite for this is the willingness to share one’s ideas and making contact with like-mined people.  An example of a researcher who practices openness in his day-to-day research is Georgio Gilestro, Lecturer in Systems Neurobiology with the Department of Life Sciences at Imperial College London, who publishes his research group’s Open Lab Book online.

Our personal experience, not at least in the joint Open Citations and Related Work developments described in the next blog post, is that you gain more than you loose by being open!

Reference

[1]       Wellcome Trust announcement: Open access: CC-BY licence required for all articles which incur an open access publication fee – FAQ. Available from http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/WTVM055715.pdf.

[2]       Cameron Neylon (2012). Science publishing: Open access must enable open use.  Nature 492: 348–349.  doi:10.1038/492348a.

Posted in JISC, Open Citations | Tagged , , , , , , , , | 3 Comments

Open letter to publishers

[The text of this post was updated on 27-09-2013 and 04-04-2017 to reflect a new CrossRef metadata best practice document and a change in their URI.]

Today I wrote an open letter to all scholarly journal publishers, available online here, entitled:

Open your article reference lists for inclusion in the Open Citations Corpus.

In this letter, I request that publishers open the bibliographic citation data in their journal article reference lists.  There is a growing movement to make such bibliographic citation data open – for example, Nature Publishing Group’s open Linked Data Platform now includes citation metadata for all published article references.

Provided a publisher is already depositing article references with CrossRef as part of the CrossRef CitedBy Linking service, all the publisher need to do is to inform CrossRef that it is willing for CrossRef to freely distribute these reference, for example in response to queries against the CrossRef XML API.  We will then harvest them from CrossRef and incorporate them as open linked data in the Open Citations Corpus.

Nature Publishing Group, Taylor & Francis, the American Association for the Advancement of Science (who publish Science) and Oxford University Press, as well as a number of open-access publishers, have already given their consent to CrossRef to do this for some or all of their journals.

If not already a subscriber to the CrossRef CitedBy Linking service, a publisher can register for this useful service free of charge.  Having done so, there is nothing further the publisher needs to do to ‘open’ its reference data, other than to give its consent to CrossRef.  This can be done automatically, in the submitted article metadata, or (for back numbers) by informing CrossRef directly.

Even Open Access publishers, publishing articles under a CC-By open license, need to give this specific permission to CrossRef for this to occur, because CrossRef policy is that all publishers, including open access publishers, have to opt in to any distribution of references that CrossRef makes.

For new submissions, publishers should follow instructions detailed in the CrossRef blog at https://www.crossref.org/blog/distributing-references-via-crossref/, which contains the following key instruction:

“In order for publishers to distribute references along with standard bibliographic metadata, publishers need to set the <reference_distribution_opt> metadata element to “any" for each DOI deposit where they want to make references openly available.”

In this way, publishers can choose to open the reference lists for all their journals, or to do so on a journal-by-journal or on an article-by-article basis (useful for ‘hybrid’ subscription-access journals in which only some articles are open access).

To open reference lists for back numbers, publisher needs to e-mail CrossRef to express their intent,using the template shown at the foot of this post, as detailed in my Open Letter to Publishers.

I have copied this open letter to the CEOs of the Open Access Scholarly Publishers Association (OASPA), of the Association of Learned and Professional Society Publishers (ALPSP), and of the International Association of Scientific, Technical & Medical Publishers (STM), asking them to distribute it to their members, perhaps in association with their next Members News Letter, as CrossRef itself is planning to do later this month.

Please spread the word about this, particularly to publishers who may not be members of these professional associations.  Thanks.

= = =

Template for an e-mail to CrossRef expressing willingness to open reference lists in previously published and future journal articles.

To support@crossref.org

I am writing on behalf of *** [name of publisher] to confirm that *** [name of publisher] is willing for the bibliographic reference lists within the articles in [delete as necessary:] all our journals [or] the attached list of journals be made freely available by CrossRef, for inclusion in the Open Citations Corpus. These journals are associated with the following DOI prefix(es): 10.**** [Please complete DOI prefix(es) – see footnote].
Yours sincerely [name, position, date]
= = =
Footnote: Publisher’s DOI prefixes are listed at http://www.crossref.org/06members/50go-live.html by name of publisher.

 

Posted in JISC, Open Citations, Semantic Publishing | Tagged , , , , , , | 16 Comments