The testing of AI in medicine is a mess. Here’s how it should be done

When Devin Singh was a paediatric resident, he attended to a young child who had gone into cardiac arrest in the emergency department after a prolonged wait to see a doctor. “I remember doing CPR on this patient and feeling that kiddo slip away,” he says. Devastated by the child’s death, Singh remembers wondering whether a shorter waiting time could have prevented it.
The incident convinced him to combine his paediatric expertise with his other speciality — computer science — to see whether artificial intelligence (AI) might help to cut waiting times. Using emergency-department triage data from the Hospital for Sick Children (SickKids) in Toronto, Canada, where Singh currently works, he and his colleagues built a collection of AI models that provide potential diagnoses and indicate which tests will probably be required. “If we can predict, for example, that a patient has a high likelihood of appendicitis and needs an abdominal ultrasound, we can automate ordering that test almost instantly after a patient arrives, rather than having them wait 6–10 hours to see a doctor,” he says.
A study using retrospective data from more than 77,000 emergency-department visits to SickKids suggested that these models would expedite care for 22.3% of visits, speeding up results by nearly 3 hours for each person requiring medical tests1. The success of an AI algorithm in a study such as this, however, is only the first step in verifying whether such an intervention would help people in real life.

Ex-Meta scientists debut gigantic AI protein design model

Properly testing AI systems for use in a medical setting is a complex multiphase process. But relatively few developers are publishing the results of such analyses. Only 65 randomized controlled trials of AI interventions were published between 2020 and 2022, a review shows2. Meanwhile, regulators such as the US Food and Drug Administration (FDA) have approved hundreds of AI-powered medical devices for use in hospitals and clinics.
“Health-care organizations are seeing many approved devices that don’t have clinical validation,” says David Ouyang, a cardiologist at Cedars-Sinai Medical Center in Los Angeles, California. Some hospitals opt to test such equipment themselves.
And although researchers know what an ideal clinical trial for an AI-based intervention should look like3, in practice, testing these technologies is challenging. Implementation depends on how well health-care professionals interact with the algorithms: a perfectly good tool will fail if humans ignore its suggestions. AI programs can be particularly sensitive to differences between the populations whose data they were trained on and the ones they’re aiming to help. Moreover, it’s not yet clear how best to inform patients and their families about these technologies and ask for their consent to use their data for testing the devices.
Some hospitals and health-care systems are experimenting with ways to use and evaluate AI systems in medicine. And as more AI tools and companies are entering the market, groups are coming together to seek consensus on what kinds of assessment work best and provide the most rigour.
AI-based medical applications, such as the one being built by Singh, are generally considered medical devices by drug regulators, including the US FDA and the UK Medicines and Healthcare products Regulatory Agency. As such, the criteria for reviewing and authorizing them for use are often less rigorous than are those for drugs. Only a small proportion of devices — those that might pose a high risk to patients — require clinical-trial data for approval.
Many think that the bar is too low. When Gary Weissman, a critical-care physician at the University of Pennsylvania in Philadelphia, reviewed the FDA-approved AI devices in his field, he found that, of the ten he identified, only three cited published data in their authorizations. Just four mentioned a safety assessment and none included a bias evaluation, which analyses whether the tool’s outcomes are fair across different patient groups4. “What’s concerning is these devices really can and do influence care at the bedside,” he says. “A patient’s life can hinge on those decisions.”
The dearth of data leaves hospitals and health-care systems in a difficult position when deciding whether to use these technologies. In some cases, financial incentives come into play. In the United States, for example, health-insurance programmes already reimburse hospitals for the use of certain medical AI devices5, making them economically appealing. These institutions might also be inclined to adopt AI tools that promise cost savings, even if they don’t necessarily improve patient care.
Those incentives could discourage AI companies from investing in clinical trials, says Ouyang. “For many commercial enterprises, you can imagine they’re putting more effort in making sure their AI tool is reimbursable and has a good financial outcome, because they see that that drives adoption,” he says.

An AI revolution is brewing in medicine. What will it look like?

The situation might be different depending on the market. In the United Kingdom, for example, nationwide government-sponsored health programmes might set a higher evidence bar before medical centres can acquire a given product, says Xiaoxuan Liu, a clinical researcher who studies responsible innovation in AI at the University of Birmingham, UK. “Then, the incentive is there for companies to do clinical trials,” says Liu.
Once hospitals purchase an AI product, they are not required to perform further tests and can use it immediately as they would any other software. Some institutions, however, recognize that regulatory approval does not guarantee that the device is truly beneficial. So they choose to test it themselves. Many of these efforts are currently performed and funded by academic medical centres, Ouyang says.
Alexander Vlaar, the head of intensive-care medicine at Amsterdam University Medical Center, and Denise Veelo, an anaesthesiologist at the same institution, started one such endeavour in 2017. Their goal was to test an algorithm that aims to predict the occurrence of low blood pressure during surgery. This condition, known as intraoperative hypotension, can lead to life-threatening complications, such as myocardial injury, heart attack and acute renal failure, and even death.
The algorithm was developed by Edwards Lifesciences, a company in Irvine, California, and uses arterial waveform data — the red line with peaks and troughs seen on monitors in an emergency department or intensive-care unit. It can predict hypotension minutes before it happens, enabling early intervention.
Vlaar, Veelo and their colleagues conducted a randomized clinical trial to test the tool on 60 patients undergoing non-cardiac surgery. Individuals who had the device running during their surgery experienced a median time of 8 minutes of hypotension compared with nearly 33 minutes for those in the control group6.
The team ran a second clinical trial, which confirmed that the device, combined with a clear treatment protocol, also works in more-complex settings, including during cardiac surgery and in the intensive-care unit. The results have not yet been published.
The success wasn’t simply because of the precision of the algorithm. How the anaesthesiologists respond to an alert matters. So, the researchers made sure to prepare physicians carefully: “We had a diagnostic flowchart with steps to take when you get an alarm,” says Veelo. The same algorithm failed to show a benefit in a clinical trial performed by another institution7. In that case, “there was no compliance by the bedside physicians for doing something when the alarm went off”, says Vlaar.
A perfectly good algorithm might fail because of variability in human behaviour, both by health-care professionals and by people receiving treatments.
When Mayo Clinic in Rochester, Minnesota, tested an algorithm developed in-house to detect a heart condition called low ejection fraction, the centre’s human–computer interaction researcher, Barbara Barry, was in charge of bridging the gap between developers and the primary-care providers using the technology.

AI tools are designing entirely new proteins that could transform medicine

The tool was designed to flag individuals who might be at high risk of the condition, which can be a sign of heart failure and is treatable, but often goes undiagnosed. A clinical trial showed that the algorithm did increase diagnosis8. However, in conversations with providers, Barry found that they wanted further guidance on how to talk to the patients about the algorithm’s findings. This led to the recommendation that the application, if widely implemented, should include bullet points with important information to communicate to the patient so that the health-care provider doesn’t have to consider how to have that conversation each time. “This is one example of how we move from a pragmatic trial to implementation strategies,” Barry says.
Another issue that can limit the success of certain medical AI devices is ‘alert fatigue’ — when clinicians are exposed to a high number of AI-generated warnings, they might become desensitized to them. This should be considered during the testing process, says David Rushlow, chair of the family medicine department at Mayo Clinic.
“We’re already getting alerted many times a day on conditions that our patients may be at risk for. And that’s actually a very difficult task for a busy front-line clinician,” he says. “I think many of these tools will be able to help us. But, if they are not introduced accurately, the default will be to just continue to do things the same way, because we don’t have the bandwidth to learn something new,” Rushlow notes.
Another challenge in testing medical AI is that clinical-trial results are hard to generalize to different population groups. “It’s simply a known fact that AI algorithms are very fragile when they are used on data that is different from the data that it was trained on,” Liu says. Results can be extrapolated safely only if the clinical-trial participants are representative of the population the tool will be used in, she notes.
Furthermore, algorithms trained on data collected in hospitals with a lot of resources might not perform well when applied in lower-resource settings. For example, an algorithm developed by Google Health in Palo Alto, California, to detect diabetic retinopathy, a condition that causes vision loss in people with diabetes, was in theory highly accurate9. But its performance dropped significantly when the tool was used in clinics in Thailand. An observational study revealed that lighting conditions in the Thai clinics led to low-quality eye images that reduced the tool’s effectiveness9.
Currently, most medical AI tools assist health-care professionals in screening, diagnosing or planning treatments. Patients might not be aware that such technologies are being tested or routinely used in their care, and there is currently no requirement in any country for providers to disclose this.
There are continuing debates on what to tell patients about AI technologies. And some of these applications are bringing the issue of patient consent to the forefront of developers’ concerns. That’s the case for the AI device being developed by Singh and his colleagues to streamline care for children in the emergency department at SickKids.

Science and the new age of AI: a Nature special

What’s markedly different about this technology is that it removes the clinician from the loop, making the child — or their parent or carer — the end user.
“What this tool is going to do is take emergency triage data, make a prediction and have a parent directly approve — yes or no — if the child can be tested,” Singh says. This alleviates the burden on the clinician and accelerates the whole process. But it also creates many unprecedented issues. If something goes wrong with the patient, who is responsible? And if unnecessary tests are done, who will pay for them? “We need to, in an automated way, obtain informed consent from the family,” Singh says. And the consent has to be reliable and authentic. “It can’t be like when you sign up for social media and there are 20 pages of small print and you just hit accept,” Singh says.
As Singh and his colleagues await funding to start a trial with patients, the team is partnering with legal experts and involving the country’s regulatory authority, Health Canada, to review its proposal and consider the regulatory implications. Right now, says Anna Goldenberg, a computer scientist and co-chair of the AI in Medicine for Kids initiative at SickKids, “it’s a bit of a Wild West out there in terms of regulation”.
Institutions are coming together to discuss how to address some of these challenges. Some specialists say that the best approach would be for each health-care institution to perform its own tests before adopting medical AI tools. Others note that this is not feasible because of the costs involved, so researchers and health-care organizations are exploring other options.
“It’s already difficult for big organizations and it would be tremendously more difficult for smaller ones,” says Shauna Overgaard, a medical AI specialist. She co-directs Mayo Clinic’s AI Validation and Stewardship Research Program, which aims to test medical AI tools in a standardized and centralized way so that they can be used in community-based health facilities affiliated with Mayo Clinic Health System.
Overgaard is also a member of the Coalition for Health AI, which includes representatives from industry, academia and patient-advocacy groups. The coalition, funded by companies such as Google, Amazon, Microsoft and CVS Health, has proposed the creation of a network of health AI assurance laboratories, which would evaluate models using an agreed set of principles in a centralized way.
Mark Sendak, a clinical data scientist at the Duke Institute for Health Innovation in Durham, North Carolina, says that this centralized approach is not ideal. “Every setting needs to have its own internal capabilities and infrastructure to do that testing as well,” he says.
He’s part of the Health AI Partnership, a group comprising academics and health-care organizations. The collaboration, which has received initial funding from the Gordon and Betty Moore Foundation in Palo Alto, aims to build capabilities and provide technical assistance for any organization to be able to test AI models locally.
Nina Kottler, a radiologist and associate chief medical officer of clinical AI at Radiology Partners, a large group of medical-imaging practices in the United States, agrees that local validation is crucial. And she hopes that the insights from such studies can be used to educate the professionals who will be operating the tools. She says that this human element will be the most important. “There is almost no AI in health care that is autonomous,” she says. “We have to start thinking of how to make sure we’re measuring the accuracy, not just of the AI, but the AI plus the end user.”

Trending now