This site is intended for health professionals only

Insufficient evidence that Babylon chatbot can perform better than doctors, experts suggest

by Valeria Fiore
7 November 2018

Share this article

Private provider Babylon’s chatbot – an AI software that triages patients based on their symptoms, without a doctor’s intervention – could offer patients a less effective service than a GP, a letter published in medical journal The Lancet has suggested.
In a study published in June, Babylon, which provides the GP at Hand service to NHS patients, analysed the efficacy of its AI chatbot in primary care triage and diagnosis.

The letter was written by Associate Professor of medical science Hamish Fraser, medical informatics Professor Enrico Coiera and Lecturer in health informatics David Wong. 

While the three authors commend Babylon ‘for releasing a fairly detailed description of the system development and the three evaluation studies,’ they also argue that Babylon’s internal evaluation of its chatbot – which Babylon’s founder and chief executive Ali Parsa said is ‘on par’ with doctors in June – does not provide resounding evidence of this. 
Babylon’s findings were ‘met with scepticism because of methodological concerns’, the authors said.
The letter added: ‘Babylon’s study does not offer convincing evidence that its Babylon diagnostic and triage system can perform better than doctors in any realistic situation, and there is a possibility that it might perform significantly worse.
‘Further clinical evaluation is necessary to ensure confidence in patient safety.’
The authors added that similar concerns have been raised in relation to other computerised diagnostic decision support (CDDS) programmes – which are designed to help doctors make decisions on patient care.
CDDS programmes that are ‘poorly designed or lack rigorous clinical evaluation can put patients at risk’ and could put additional pressure on the health system, the authors suggested.
They added that guidelines should be introduced to evaluate CDDS programmes.
Babylon’s AI chatbot evaluation
Babylon’s AI chatbot scored 81% when tested using RCGP exam questions, which the company compared to the average mark of 72% for real-life doctors.
Commenting on The Lancet letter, Babylon said it is ‘unrealistic to believe that AI can compete with all specialist doctors’.
However, the company believes that ‘the power of these gradual advances in healthcare should be celebrated as they will help to make healthcare accessible and affordable to all’.
Babylon’s chief scientist Saurabh Johri said:‘As we indicated in our original study, our intention was not to demonstrate or claim that our AI system is capable of performing better than doctors in natural settings.
In fact, we stress that our study adopts a ‘semi-naturalistic role-play paradigm’ to simulate a realistic consultation between patient and doctor, and it is in the context of this controlled experiment that we compare AI and doctor performance.
‘We welcome the suggestions of the authors for developing guidelines for robust evaluation of computerised diagnostic decision support systems since they align with our own thinking on how best to perform clinical evaluation.
Together with our academic partners, we are currently in the process of performing a larger, real-world study, which we intend to submit for peer-review.’