A Scalable Solution for FAIR Data Sharing

Mark Musen, MD, PhD, professor of medicine and of biomedical data science and director of the Stanford Center for Biomedical Informatics Research (BMIR).

May 18, 2023 - By Sarah Paris

Open access to data is a fundamental component of team science. The NIH now requires that plans for data sharing be part of all grant submissions. By 2025, data generated as part of all federally funded research must be publicly available, and it must be FAIR (findable, accessible, interoperable, and reusable.)

Reaching this goal will require a focused and collaborative effort. Currently, “almost all datasets that are online are not FAIR. They are not findable, and you can’t search through them,” said Mark Musen, MD, PhD, a professor of medicine and of biomedical data science and director of the Stanford Center for Biomedical Informatics Research (BMIR). “There is no systemized way of classifying datasets, no MeSH vocabulary that would allow a third party to find and understand exactly what studies have been done,” he said. “The vision for data sharing to advance our scientific knowledge will not become reality until we can describe datasets in a more standardized, richer way, so that people can query data the way they query journal articles.”

With this in mind, the National Institutes of Health (NIH) recently awarded $10M annually to Musen and his team of collaborators. The goal is two-fold: to create a data hub, and to demonstrate how metadata can be enriched to make them FAIR. The NIH Rapid Acceleration of Diagnostics Data Hub (RADx® Data Hub) will make it possible to access curated and de-identified COVID-19 data and to allow researchers to find and aggregate data and perform targeted analyses.

“This will be a sort of cloud-based way for researchers to engage with that data, bring their own data, if they’re interested, and really investigate all of the ways that COVID has spread across our country and what testing data can provide in terms of answering research questions,” said Susan Gregurick, NIH’s associate director for data science and the director of its Office of Data Science Strategy, at a conference last year.

The data hub is part of the larger RADx® initiative, a national call for scientists and organizations to contribute their ideas for new COVID-19 testing approaches and strategies, from basic studies to clinical trials to understanding barriers to access for underserved populations. Data generated by each RADx investigator are sent to one of five coordinating centers for cleaning and standardization, after which they are submitted to the data hub. The goal is a community-driven, integrated approach for harmonizing data through shared models and common elements, according to Gregurick.

“For us, the big excitement will be when investigators around the world will start to use the data in the hub to make new discoveries that were not apparent at the time of the original experiments,” said Musen. “These discoveries will be possible only because of technology that we have been developing at Stanford for several years, which makes it easy to describe datasets using community standards and to search for and integrate data using these standards.”

Purvesh Khatri, PhD, associate professor of medicine and biomedical data science

There is an urgent need to speed up innovation and discovery. “My research is focused on developing novel diagnostics and prognostics by leveraging biological, clinical, and technical heterogeneity across multiple datasets in public domain. However, we spend months looking for appropriate datasets. A FAIR-compliant data hub will substantially accelerate our ability to develop new diagnostic and prognostic tests and identify novel drug targets,” commented Purvesh Khatri, PhD, associate professor of medicine and biomedical data science

The 3.0 version of the RADx® Data Hub that is now being built by BMIR and its collaborators is based on two earlier versions. BMIR’s role is to lead and coordinate the project and enhance the metadata that is used to describe the datasets, so that other investigators can benefit.

“We are working very closely with the primary investigators and the data-coordination centers to determine how to describe data in ways that are both reasonable and acceptable to the original investigators, as well as natural for third parties, so they can find what they are looking for,” said Musen. “Our work to develop the RADx Data Hub builds on a general approach to structured data management that we have explored in several large consortia that want to ensure that their experimental data are shareable and reusable.”

Upinder Singh, MD, professor and chief of infectious disease and geographical medicine and of microbiology and immunology

The data hub’s architecture, data flow, and security features will be developed by Booz Allen Hamilton, a global IT consulting firm. The Renaissance Computing Institute( RENCI), affiliated with the University of North Carolina, will provide administrative support and community outreach.

“We have a strong tradition of team science at BMIR,” said Musen. “But our collaborations tend to involve groups at other institutions and in the tech industry, rather than with our Stanford peers. We hope the data hub will provide an opportunity to partner with and among Stanford-based investigators involved in COVID-related research.”

“Stanford Medicine has been a leading force during the Covid pandemic from the early days of developing lab testing for Sars-CoV-2 and treatment clinical trials to the present with leadership in studies of Long-Covid. It’s exciting that with the RADx Data Hub program, Stanford investigators will now also get to access large-scale data and can now address a broad number of questions relevant to multiple aspects of the Covid pandemic,” said Upinder Singh, MD, a professor and chief of infectious disease and geographical medicine and a professor of microbiology and immunology.

Sharing data systematically also involves a culture shift. Traditionally, researchers have been incentivized to keep their research to themselves. But team science, as recognized and awarded by today’s funding agencies, is based on collaboration and a broad availability of data. The RADx® Data Hub creates a real-world, practical model of how this can be accomplished.

The 3.0 version of the RADx® Data Hub is planned to be up and running by the end of the year. A 1.0 version is currently accessible at https://radx-hub.nih.gov, with data from over 125 studies.

The NIH is offering awards for researchers who want to use the data hub for secondary analysis and reexploration of the data. Two notices of special interest were issued last fall: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-23-040.html and https://grants.nih.gov/grants/guide/notice-files/NOT-OD-23-041.html.