Stanford scientists use AI to generate synthetic biological data

December 2, 2024

A transformative new AI model from researchers at the March of Dimes Prematurity Research Center (PRC) at Stanford can use existing Electronic Medical Records (EMR) on millions of people to generate a reliable synthetic biological profile of each of those people in a matter of minutes.

The achievement, spearheaded by Stanford PRC researcher Dr. Nima Aghaeepour under the leadership of PRC principal investigator Dr. David Stevenson, was recently published in Briefings in Bioinformatics. It dramatically simplifies and expedites medical research efforts by eliminating the need to run years long, multi-million-dollar studies aimed at obtaining biological samples like blood and urine to mine for clues into disease, and in the case of the PRC team, into preterm birth and other adverse pregnancy outcomes.

With the new model, researchers can take EMR records containing information about millions of women's lives—from diagnoses to prescriptions, to alcohol consumption, marital status, exercise history, and more—and almost instantaneously generate protein-related data for each woman. That data offers insights about a myriad of bodily processes, including inflammation, that could result in the identification of a future biomarker for the prediction of preterm birth.

While diagnostic biomarkers would need to be tested and validated in traditional clinical trials involving real patients giving real samples, the heavy and expensive lift of the initial data collection that leads to the biomarker discovery could be eliminated thanks to the new model, freeing up money and time for scientists to invest into other discoveries.

And though the model detailed in the paper was successful in generating protein profiles for a validation cohort of pregnant women, Dr. Aghaeepour said that a newer iteration of the model currently being developed is also able to generate meaningful synthetic biological data associated with Parkinson’s disease, various cancers, and other conditions. This broad reach would allow doctors to generate complex biological profiles for a range of medical conditions quickly and with minimal cost.

“With this model, AI can look at millions of EMR with tens of millions of characteristics and estimate a corresponding biological profile for each of those people without the need to enroll patients into studies, draw blood, and run expensive biological assays,” said Dr. Aghaeepour. “It means that in the future, we will be able to generate biological data in millions upon millions of patients basically at the price of electricity.”

To create the model, Stanford MD-PhD student Dr. David Seong, the lead author of the study, linked a bank of EMRs from the Stanford University hospital system with protein data found in 171 blood samples drawn from 61 pregnant women. Once the model trained itself on these two data sets, it was validated on a group of pregnant patients within the EMR group, generating the right profile for the right woman in many proteins (with correlations reaching as high as 78% accuracy). Now, the model no longer needs real life samples to spit out biological profiles and can function on EMR inputs alone.

Despite this, Dr. Aghaeepour cautioned that the model, which currently only works in a subset of proteins, would still need to be fed voluminous new real life biological data sets to continue learning, increase its accuracy in pregnant women and to generalize to other populations.

“Given that future iterations of this model will be more robust, it’s safe to say that this capability will dramatically reshape the way scientists conduct their research and speed the pace of discovery in the field of preterm birth,” he said. “But it’s also clear that this model will not replace the need for real biological samples.”

“Not only does the model depend on the addition of real biological data to fuel its improvement, but any exciting biomarker insights that come from the synthetic data generated by the model will still need to be tested in trials involving people—this strategy just gets us there much faster.”