A team of researchers from the Department of Veterans Affairs, Oak Ridge National Laboratory, Harvard’s T.H. Chan School of Public Health, Harvard Medical School and Brigham and Women’s Hospital has developed a novel, machine learning–based technique to explore and identify relationships among medical concepts using electronic health record data across multiple healthcare providers.

The method, called Knowledge Extraction via Sparse Embedding Regression, or KESER, was published recently in Nature Digital Medicine. The process integrates electronic health record data from two large institutions — the VA and Boston-based Partners Healthcare — and provides automated feature selection that leads to phenotype identification algorithms and knowledge discovery.

“KESER provides a high-level view of the relationships between clinical knowledge that we can’t always see when caring for patients at the individual or group level,” said Dr. Katherine Liao, a principal investigator of KESER at VA Boston and associate professor of medicine at Harvard Medical School. “We look forward to translating the study’s methods and results from applications in clinical research to advancements in clinical care.”

The project is part of the phenomics core work directed by Drs. Kelly Cho and Mike Gaziano from VA Boston and Harvard under the VA’s Million Veteran Program, or MVP, a “national research program to learn how genes, lifestyle, and military exposures affect health and illness,” according to the VA Office of Research and Development MVP website.

In 2016, ORNL began collaborating with the VA on MVP-CHAMPION, a big-data initiative under the MVP program, to create a large, precision-medicine platform to host the VA’s vast medical record dataset — consisting of records for some 24 million veterans. In efforts to strengthen crosscutting innovation in support of numerous research projects under this joint VA-DOE program, ORNL worked closely with MVP Data Core from VA Boston and Harvard to identify specific research areas to pursue. Among those was an effort to answer the question: What elements do we need to find within electronic health records to correctly identify a given phenotype?

Working with what they think is the largest cohort of healthcare data used for this type of research in the U.S., the team set out to automate the identification of phenotypic relationships while providing visibility into the underlying machine learning assumptions and decision processes.