Why researchers don’t publish data

Evidence submitted by David Shotton in response to the Royal Society’s Policy Study “Science as a Public Enterprise” Call for Evidence, addressing the following two topics raised by that call:

Getting Researcher buy-in. How do we get researchers to be more willing to share data? What is there to be learned from disciplines such as genomics which have norms which favour wide sharing of data?

Ensuring we generate useful metadata. For open data to be useful, it needs to be sufficiently well described. The researchers creating the dataset are in the best position to create the metadata; but as things stand, the incentives for them to do a thorough job of this are not always very strong. Do we need to change incentives?

= = =

“I guess I have been invited to contribute evidence to the evidence session on digital curation at the Royal Society on 5th August 2011 to present the view from the shop floor – or rather from the laboratory bench.  I would like to mention three pressures that presently combine to prevent researchers from publishing their data.

Pressure one: Information volume

When I started research, you could, if you were very fortunate (as I was), solve a protein structure to low resolution within six months and to medium resolution within three or four years, and you could hope to know something about all the protein structures that had so far been determined.  Today, you can collect the crystallographic structure factor data for a new protein in a few minutes at the Diamond Light Source, and can compute its 3D structure on your laptop during the train ride home. PDB currently contains the structures of about 74,000 macromolecules, and you are unlikely to know the structures of more than a handful of these.

Looking at the same problem from a different perspective, PubMed currently received a million articles per year.  If you imagine there might be a thousand biomedical specialisms – if you slice the salami thinly enough -, as a specialist you can expect to have on average twenty new papers in your field each week – an impossible number to carve out time for, from your other activities, if you wish to read them properly.

Thus you will never catch up – there is just too much scientific information around now.  You would like to know about it all, to keep abreast of your field, but the task is impossible.  Researchers are thus under overwhelming pressure, and have to run just to stand still.  They have no spare time to undertake data curation activities for which they receive little or no academic reward in terms of peer esteem, tenure or promotion.

Pressure two: Institutional pressures

The principal pressures researchers are under from their departments and institutions are (a) to win grants and (b) to publish in high impact journals, because these things influence departmental income both (a) directly through full economic costs from funding agencies, and (b) through high RAE/REF scores that in England determine funding from HEFCE.  From the viewpoint of a Head of Department trying to establish or maintain his department’s reputation and financial health, nothing else matters.  I have known these factors as the deciding ones in academic appointments.  Nobler concepts of scientific excellence and of scientific altruism in the form of data publication become submerged beneath these pressures.

Pressure three: Cognitive overheads of data management

Appropriate ontologies and technical infrastructures for data preservation increasingly exist, but the concepts surrounding metadata creation, repository deposit and data accessibility are foreign to most biomedical researchers, leading to cognitive and skill barriers that prevent them from undertaking routine best-practice data management.

Put crudely, the large amount of effort involved in preparing data for publication, coupled with the negligible incentives and rewards, prevents researchers in most biomedical specialisms from doing so.

Having said that, research scientists are perfectly able to provide structured metadata when it is necessary to do so.  With the switch to on-line journal article submission, publishers have devised lengthy web forms that require completion with details of co-authors and their affiliations, funding agencies, etc. before you are permitted to upload your manuscript – forms that for certain publishers can take the best part of an afternoon to complete for a new submission involving many authors, figures and supplementary files.  Since researchers have no choice but to comply with the metadata requests, they do so, since this is the only way in which to achieve their desired goal of publication in the chosen journal.

That the fields of genomics and macromolecular structures are exceptions to the rule that data are not widely published is due to two factors:

  • First, their datasets are relatively simple, homogeneous and well-defined  – linear nucleotide or amino acid sequences, lists of structure factors, and lists of atomic coordinates – in comparison with the heterogeneity of data in fields such as ecology or animal behaviour, simplifying the tasks of data management and metadata creation.
  • Second, and more important, is the fact that in the early 1970s journals such as Nature started to mandate database accession numbers as a precondition of publishing sequence or structure papers – this brought about an almost instantaneous change in attitudes among our research community!

For other disciplines, while I commend journals’ and research councils’ recent policies regarding data publication, I believe we will only achieve radical change when funders and publishers mandate data publication as a pre-condition of applying for a further grant or of article submission.  Toothless research council data policies, however laudable, are of little use unless backed up by some policing.  ‘Sticks’ are required to achieve desired policy aims, as well as the ‘carrots’ of better personal data management and data security obtained by employing easy-to-use tools and systems.”

= = =

The following post describes what we are doing, with funding help from the JISC, to help mitigate these pressures and provide tools and services to assist researchers in data publication.

This entry was posted in Data publication, JISC, Semantic Publishing and tagged , , , , , , , , . Bookmark the permalink.

7 Responses to Why researchers don’t publish data

  1. Pingback: Current Projects at the Image Bioinformatics Research Group in Oxford | JISC Open Citations

  2. Pingback: IBRG projects to facilitate data publication and data citation | JISC Open Citations

  3. David, are you being politic? You seem to have provided researchers with three excuses (I’m too busy/it’s too difficult/it’s my department’s fault) to absolve them from personal responsibility. Why don’t researchers publish data? Because they don’t want to. That’s it. Full stop. Stop the buck.

    I think that you’re absolutely right about mandates, but I think that the mandates have to be rooted in the intention of “the scholarly system”‘ to systematically change scientific values and professional practice.

  4. davidshotton says:

    Fair comment, Les. We all agree that researchers should aim for the gold standard of rapidly making their data fully disclosed and publicly available, both to back up claims in research papers and to allow for reuse. But I was trying to reflect the reality on the ground as I saw it, in a stark manner that people would take note of. Of course researchers have a responsibility. But the golden rule is that no-one ever does anything until it reaches the top of that person’s priority agenda. If/when that will occur for data publication is determined by a balance of things: competing pressures, difficulty of task, rewards for accomplishing task and/or penalties for not doing so. At present, the competing pressures and difficulties of data publication are too great, and the rewards and penalties too small to push data publication to the top of most researchers’ agendas.

    So what can we do? Influence those who can influence those who determine research excellence assessment criteria, to reward data publication more fully. Influence journal editors, publishers and funding agencies to encourage or mandate data publication. Create data management systems that ease the researcher’s task of data management and publication / metadata creation. Provide mechanisms that facilitate citation of published data. Show researchers the evidence for the benefits of data publication, in terms of increased article citation rates if the articles are linked to data publications. Support repositories like Dryad that accept data linked to journal articles. . . . (What have I missed?)

    Together, all these things will contribute to achieving the sea change in attitudes I discuss in the subsequent post, in which the researcher both recognizes and **acts on** his/her personal responsibility to publish data.

  5. Pingback: The electronic lab notebook blog » The Encyclopedia of Open Research and the data/publication problem

  6. Pingback: IBRG projects to facilitate data publication and data citation | Semantic Publishing

  7. Pingback: Current Projects at the Image Bioinformatics Research Group in Oxford | Semantic Publishing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s