Skip to Content

How to summit the mountain of data

Have we reached a turning point on the question of data sharing in medical research?

Like no other previous scientific era, the daily conduct of research is producing a virtual mountain of data. It is rare to go to a scientific meeting where "big data" is not a topic. Genomics and its related disciplines are often called on to demonstrate this point. These fields are producing trillions of data points about DNA, the proteins coded by individual genes, and how those are impacted by cell type, location, and epigenetic conditions. The Director of NIH's National Human Genome Research Institute, Eric Green, M.D., told many of us at the recent GREAT/GRAND meeting that the pace of progress in gene sequencing technologies outpaces even Moore's law for the semiconductor industry.

The latent potential for discovery that resides in big data from laboratory, clinical care and community settings is staggering. Consider for instance the adoption of electronic health records (EHRs), which could be mined to match the phenotypic expression of disease to genetic profiles at a population level - but not if these data sit on separate systems or in separate laboratories. Addressing the interoperability issue with EHRs is part of the larger conversation on data sharing.

It is hard to argue against the concept of sharing data to improve health, cut down research and health care costs, or to keep researchers from unnecessarily going down a "dead-end research path." At the same time, sharing data is not simple. There are, without doubt, unique challenges. The cultural, institutional and regulatory barriers; not to mention concerns about the veracity of a given data set, and the fair and appropriate inclusion of data from all relevant populations. I was reminded by a recent series in the Incidental Economist  External Linkthat how big data are turned into meaningful and actionable information—the questions asked and the quality of the interpretations made from the data—is just as essential as the mechanisms of data collection.

Notwithstanding these important considerations, we as a community are in the right place at the right time in science and have the privilege to take advantage of the exponential increase in data to make meaningful inroads to generate new knowledge that advances health. Most people I know would welcome our collective efforts to create a research environment where the right incentives and protections enable scientists to openly exchange all the inputs and outputs of scientific inquiry.

New policies are part of the solution

In the past few years, there's been a groundswell of policy changes geared at encouraging data sharing among scientists. Pham-Kanter, Zinner and Campbell External Link recently replicated a 2000 survey on data sharing practices among medical researchers to determine if new journal procedures and funding agency requirements have altered attitudes and behaviors.

A majority indicated NIH policies, particularly related to genomic data sharing and requirements to include data sharing plans in funding proposals, were the most influential drivers of increased data sharing behaviors. This is needed good news, signaling that policy levers can be effective at incentivizing data sharing. But policies alone won't get us all the way there. The mindset shift required for data sharing is a cultural one, which I believe we as a scientific community are ready to tackle.

Making the climb

In September, the NIH's new Associate Director for Data Science (ADDS), Phil Bourne, Ph.D., convened the first ADDS Data Science Meeting to discuss strategies and initiatives for the next five years of data science at the NIH. The needed culture shift for data sharing was perfectly summed up by one of the thought leaders in attendance. To paraphrase, we need to move from researchers feeling a sense of exclusive ownership of their data to an environment where the highest aim for new data is to get it in front of as many eyes as possible. With the rise of crowdsourcing, this mindset about data and discovery is already taking hold.

To advance this vision, the agency has launched the Big Data to Knowledge Initiative External Link (BD2K). With an expected budget of over $650M through 2020, BD2K aims to catalyze the biomedical research enterprise to maximize the value of the growing volume and complexity of data, including the challenge of data sharing. The pioneers funded under this initiative will be sherpas, laying way markers and supplies along our trail up data mountain.


Best Practice Example

In 2012, AAMC began offering Innovation Awards to recognize best practices in research and research education. Among the 2014 cohort of awardees is the Washington University Human Research Protection Office Digital Commons (HRPO). The commons allows faculty experts to copyright, upload, and track the use of their publications, conference presentations and other materials and has led to over 4,500 downloads and 11,000 page views promoting the faculty's expertise and spreading the utility of their accrued knowledge. Initiatives such as this help build confidence in the mechanisms of data sharing.

To learn more, profiles of all six 2014 Innovation Award recipients are published here .