“Repositories of Horrible Stuff”
CEDAR to the Rescue
Making Large Data Easily Available Online
Several years ago, Mark Musen, MD, PhD, wrote: “The ultimate Big Data challenge lies not in the data, but in the metadata — the machine-readable descriptions that provide data about the data. It is not enough to simply put data online; data are not usable until they can be ‘explained’ in a manner that both humans and computers can process.”
Musen is a professor of biomedical informatics and director of the Stanford Center for Biomedical Informatics Research. He is also the head of CEDAR, the Center for Expanded Annotation and Retrieval, which helps researchers comply with requirements to archive their data so others can understand and use them. In a recent interview, Musen provided clarity about the problem of metadata.
Why is it a problem for researchers to comply with the requirement to publish their metadata?
The greatest challenge of this whole enterprise is the problem of “What’s in it for me?” We reward scientists for authoring journal articles and for creating PDFs, but we don’t have a system that recognizes the data contributions that scientists make. We need to change the culture so that when other investigators report secondary analyses of data, or when data sets are re-explored and then lead to new discoveries, there is a benefit to the original investigator other than being acknowledged in someone else’s paper. Currently, investigators don’t have the motivation to spend a lot of time making their experimental data easily available online, and they generally lack tools to enable them to do so in a standardized, reproducible fashion.
Are there problems with data currently in repositories?
We’re starting to see an emphasis not just on putting the data into repositories but on actually doing a good job of it. The National Center for Biotechnology Information (NCBI) maintains most of the NIH repositories for experimental data, but it generally does no more than make sure that the forms are filled in. So NCBI databases contain lots of horrible stuff; for instance, some 25 percent of the metadata values that are supposed to be numeric don’t actually parse as numbers.
Is this where CEDAR has a role to play?
Precisely. The idea of CEDAR is to make it easier and more attractive for investigators to publish their data because more science is going to come out of it if they do. CEDAR has a whole library of templates that correspond to “minimal information models” for describing different classes of experiments. And we have technology that makes it easy to fill in one of these templates to describe your particular experiment when you are ready to upload your data sets to a repository. By filling in the template, you create standardized, searchable metadata that future investigators will use to locate the data and to make sense of what you have done. Using a cache of metadata that it already has stored, CEDAR can make suggestions as you’re filling out a template to accelerate the process of creating the metadata in the first place.
How is CEDAR being used today?
CEDAR helps investigators put data sets online — with well-described metadata — that will allow future scientists to perform new analyses that may allow them to make new discoveries. Our collaborations with several large research consortia show that it’s not all that difficult for investigators to do a great job of annotating their data sets in a way that will benefit the entire scientific community. Immunologists in the Antibody Society use CEDAR to upload their data and metadata to repositories at the NIH. Scientists developing the Library of Integrated Network-Based Cellular Signatures use CEDAR in association with their own data coordinating and integration center. The Irish Health Research Board and the Dutch Clinical Funding Agency are evaluating using CEDAR to review proposed metadata before making funding decisions about new studies.