Aiden Sims was a 9 day old who presented with congestion and fever. He had only been sick for a day, but he had declined rapidly. Slightly hypoxic, he growled and retracted during my examination. Unfortunately, Aiden never improved.
Indeed, Aiden is not a person but an artificially generated data point.
The data problem
Real medical data poses a number of problems: it is messy, often unstructured, contained in multiple formats, expensive, and impossible to share openly. In contrast, synthetic medical data can be generated to a user’s specifications for the purpose of interoperability, and sharing is not restricted by HIPAA.
Real medical data must also be collected, but synthetic medical data can be generated, sometimes in large volumes. A major question, however, exists in synthetic medical data: can we use it to get real insights into medical issues?
I created a web application that uses procedurally generated pediatric patients with infectious respiratory symptoms to see if it would be possible to use synthetic medical data to generate real insights. Each of these patients has a name, medical history, complete history, and physical exam.
Procedural generation means that all of these characteristics were created by a randomization algorithm in the background rather than being pulled from a predefined dataset, so no two patients are the same. I drew on original research in the field and used specific historical and examination findings that were relevant to the presence of pneumonia as characteristics of my data set. (Pediatrics. 2022;149:e2021051405.)
Patients were generated and physician users could then decide whether a chest X-ray was indicated. Then the process would start again with a new patient.
To encourage participation, I offered a $20 reward to the user who achieved the highest score using a score based on accuracy and number of patients seen. I used the collected data to build a random forest model that assessed the accuracy of each individual’s guess based on a model built on all other guesses. I used this same model to create a facet of the app that allowed users to adjust parameters for their own hypothetical patients to predict the likelihood of having a chest x-ray.
The use case
Determining the true usefulness of our synthetic data is only possible by testing it in a real environment. The approximation of physician practice patterns, however, is a short mental leap from this data set; using it to build hypotheses for best practices or clinical decision rules is a slightly bigger leap but still a real possibility.
As an example, let’s look at how a patient’s age and vital signs help us determine if a chest X-ray was obtained (using patients over the age of one year). (See charts.) It’s easy to see that some vital signs change with age while others don’t. Just as easily, we can see that the deviation in some vital signs correlates much more with the completion of a chest X-ray than the deviation in others. If we combine a patient’s pulse oximetry and respiratory rate, we can isolate a group of individuals likely to receive a chest X-ray – those with high respiratory rate and low pulse oximetry readings (although that, surprisingly, some in this cohort did not).
We can further match the likelihood of receiving a chest X-ray to the duration of reported fever using logistic regression and see that the two appear to be well correlated: the longer you have had a fever, the longer you have chance of having an x-ray.
It is by no means an exhaustive exploration of the factors that lead to performing a chest x-ray; rather, I demonstrate that such an analysis can be performed on entirely artificial data. A total of 864 patients were treated in this project, and 41% of them received a chest X-ray.
The big winner
Who won $20? Andrew Bui, DO, an assistant in Fort Worth, TX, treated 60 patients with excellent algorithmic agreement. As the biggest user of the app, I asked him what he thought about the potential usefulness of such synthetic data. “Everyone has their own risk tolerance,” he said. “But it might help you see if you’re completely off track with what other people are doing.” He said he sees potential in using such systems to assess and develop clinical decision rules or to assess regional practice differences.
Why did Dr. Bui provide such a large amount of data? “To be honest, I kept going because I didn’t know if there would be an end point,” he said.
Intrigued by synthetic data? Interviews with companies using synthetic data to make a difference in our world will be featured in this column in June.
Dr. Belangeris secretary of the Locum Tenens chapter of the American College of Emergency Physicians and an emergency physician in McKinney, TX. Read his past articles onhttp://bit.ly/EMN-numbERs.