Reducing the Pain of Writing Metadata


Mark Musen, MD, PhD, and Scott Delp, PhD

Over the past four or five decades, scientific journals have tended to limit article length, especially the methods. Methods sections have been reduced in several ways, including font size and word limits, as well as banishment to online-only supplements. As a result, it has become increasingly problematic for one group of investigators to replicate the findings of another; one recent study in psychology reported success in only 36 of 100 attempts, as was reported in the August 2015 issue of Science. Even when investigators make their experimental data publicly available—as federal funding agencies require—other investigators may be stymied in their attempts to make sense of the data or to re-analyze the data in any meaningful way.

And yet, “the scientific method requires nothing less than that experiments be reproducible and that the data be available for other scientists to examine and reinterpret.” That sentence, borrowed with the permission of Mark Musen, MD, PhD (professor, Biomedical Informatics), from his article in the Journal of the American Medical Informatics Association in June 2015, is the foundation of his current effort to help authors create their metadata to annotate their scientific results.

Metadata is sometimes defined as data that describes other data. It is essentially the “methods section” that researchers must write so that others can use their data and thus reproduce their findings or build on them. Once they have completed their experiments and amassed their data, however, the last thing workers in biomedicine want to do is go back to the beginning and explain step-by-step the process they followed and the details of the investigation. Musen and his colleagues aim to help ease the chore of writing metadata because, as he says, “people hate to author metadata.”

The Center for Expanded Data Annotation and Retrieval (CEDAR) was created to develop computer-assisted approaches to overcome the impediments to creating high-quality biomedical metadata. CEDAR is supported by the Big Data to Knowledge Initiative of the National Institutes of Health, with the goal of developing new technology to ease the authoring and management of biomedical experimental metadata.

Musen, who is CEDAR’s principal investigator, explains: “there is a compelling need to solve the metadata problem so that we can move on to the next way in which we can use computer-stored knowledge to drive biomedical investigation.  Ultimately, the goal is to replace the dissemination of scientific results in the form of prose journal articles with computer-interpretable information—allowing Google-like agents to ‘read’ the literature and to summarize what they find.”

CEDAR has embarked on this project by having groups of biomedical scientists create metadata templates and store them in a repository where other scientists can use all or parts of them to author their own metadata. Choosing template components that best match their needs, scientists then annotate their own data by filling in metadata acquisition forms from the repository. Once completed, the metadata will accompany the primary data to archives where other scientists will have access to them. These metadata will also remain in CEDAR’s repository, which will be analyzed repetitively to find patterns that will enable the tools for metadata acquisition to use predictive data entry (essentially pre-populating the acquisition forms with likely text), simplifying and speeding metadata authoring.

“Natural language can be very ambiguous,” says Musen, “and one of the challenges we have sometimes is to use the computer to clarify what an actual procedure is. Thirty years ago we had two oncologists at Stanford looking at a protocol that they had just written together. What was startling was that they couldn’t agree whether they were to first give the chemotherapy and then the radiotherapy, or first give the radiotherapy and then the chemotherapy. It wasn’t until they could see it in very clear terms in the computer system that they realized the text of the protocol that they had written together was ambiguous.”

Musen concludes: “This doesn’t happen very often but it points out the fact that there is a great advantage to the clarity that the computer system offers. It often obviates some of the problems that people run into when dealing with natural language.”  

Tools such as those developed by CEDAR not only will help make metadata more precise, but they also will make metadata more complete and more comprehensive. If online datasets can be made more understandable to both humans and computers, in the end we should expect nothing less than better science.

More Stories