ChatGPT plays doctor: what happened when a real NHS doctor asked the AI for medical advice

There will surely come a time when AI ‘doctors’ become our primary healthcare givers. To see how well ChatGPT copes today, genuine NHS doctor Jo Best puts it to the test with real-world scenarios

Over a decade ago, IBM was boasting its Jeopardy-winning artificial intelligence could pass the same medical licensing exams that US doctors have to. Since then, the steady drumbeat of ‘will AI replace doctors?’ has grown. At the most pessimistic end of the scale, reports suggest areas of medicine such as radiology could be almost entirely ceded to machines, while optimists see human doctors continuing much as before, but with AI replacing the stethoscope as doctors’ go-to tool.

Having never used ChatGPT before (I know, I know), I spent some time quizzing the generative AI to see whether it could be a handy assistant for overburdened health systems and medics — or even end up replacing flesh and blood doctors entirely.

First, I thought I’d try and fox ChatGPT with a question only a real UK doctor would know: which are the best NHS biscuits?

Hospitals across the NHS seem to all come with the same selection of biscuits, available in twos or threes in perkily coloured packets and handed out liberally to patients (and parsimoniously to doctors). Which is the best variety of NHS biscuit is a heavily debated subject among NHS staff. Bourbons and shortbread seem to get a lot of love from medics, but I’d argue fruit shortcakes are the underrated hero (no-one’s voting for ginger snaps, are they?).

Which side will ChatGPT come down on? Apparently, none of them. ChatGPT makes a relevant but disappointing suggestion that the NHS isn’t known as a purveyor of biscuits, and sweet treats do not a healthy lifestyle make. Touché, ChatGPT, touché — but what else is a doctor meant to eat at 3am when their blood sugar’s dropping, the canteen’s closed and A&E is full to bursting?

ChatGPT: What are the best NHS biscuits?

How ChatGPT responds to medical questions

More pertinently, I quizzed ChatGPT on a typical clinical scenario: a person with a red, swollen and painful calf. A presentation that can be indicative of deep vein thrombosis (DVT), a blood clot in the leg.

I asked ChatGPT what it would suggest for differential diagnoses (a list of all the potential conditions that could account for the symptoms) in this situation. It suggested a number of diagnoses, including DVT as well as a soft tissue infection called cellulitis, and superficial thrombophlebitis, where veins nearer the surface of the skin become inflamed.

It also suggested a number of much less common diagnoses, such as compartment syndrome, where the blood pressure in a muscle suddenly becomes very high, and erythema nodosum, nodules caused by inflammation of the fat in the leg. Hearteningly, however, it finishes its list of differentials with a note that it’s “crucial to consider urgent and serious conditions like DVT and seek prompt medical attention if necessary”.

ChatGPT, what should be my differential diagnosis?

While ChatGPT offers a decent list of differentials, there’s a saying in medicine that ‘when you hear hoofbeats, think horses, not zebras’ – a reminder that common diseases are, as the name suggests, common. However, with its differentials, ChatGPT presents the medical equivalent of horses, zebras and a man hitting two coconut halves all on equal footing — the work of telling them apart would still fall to a human medic.

Or would it? I tell ChatGPT I’m suspecting my imaginary patient has a DVT, and asked what should I do next to come up with a definitive diagnosis. It suggested an ultrasound and various blood tests. Those are both very sensible suggestions, but there was no mention of a Wells Score — a tool used to assess the likelihood of a blood clot in a vein which is used to guide which investigations are needed, in which order, and with what degree of urgency. Again, ChatGPT’s assessment offers a good starting point, but it is just that — a starting point.

ChatGPT suggests what investigations to conduct

I decided to try out the same scenario but this time as a patient. After ChatGPT’s obligatory ‘I’m not a doctor but…’ it suggested getting urgent medical attention, which seems absolutely right. That red, swollen painful calf might or might not be a DVT, but it needs a rapid medical assessment to rule a clot out. Or rule it in.

Asking ChatGPT for medical advice

I upped the ante a bit, suggesting as well as a red, swollen and painful leg, there’s shortness of breath too. (This is another classic scenario that indicates that the blood clot in the leg has moved up to the lung and become a pulmonary embolus (PE) — a potentially life-threatening situation.) As a patient, what should I do? ChatGPT suggested I could indeed have a PE and I should really think about getting to A&E at speed. Again, ChatGPT and I are in agreement — yes, it could be a serious medical emergency, and yes, you need a doctor, not an AI to work that out and arrange treatment accordingly.

ChatGPT provides medical advice

So far, ChatGPT is making a reasonable go of handling medical questions (even though it highlights repeatedly during its responses it really isn’t a doctor and all of the information it’s telling you could be wrong anyway).

I thought I’d try it with a broader set of symptoms — confusion, tremor, clumsiness and ophthalmologic symptoms — that could be indicative of any number of conditions. What would ChatGPT make of it if I, as a patient, asked for a diagnosis? With those symptoms, ChatGPT advises a patient might have ‘electrolyte imbalances’ or ‘metabolic disturbances’ — sure, but which ones? It lists toxins and medications as another potential cause of my symptoms — again, which ones should I be considering? It ends its lists with ‘other medical conditions’. Of course — but which ones? At least, as before, it advises seeking immediate medical attention and letting doctors work out what’s really going on.

I’m not a doctor, warns ChatGPT… repeatedly

Interestingly, if I ask about the symptoms as a doctor, I get another list of potential diagnoses. Compared to the patient’s one, the list aimed at doctors is much less vague — no ‘other medical conditions’ here — but the list is just as lengthy, with over 25 suggestions for named conditions that could be causing those symptoms.

ChatGPT offers different diagnoses to consider

What investigations should I consider to winnow down that list, I asked. ChatGPT delivered another long and exhaustive list of tests I could order. At first glance, they all make sense — the list covered a number of investigations that I think most doctors would consider not unreasonable — but I’d be laughed out of a hospital if I suggested all of them straight off the bat. I’d need a full examination and history from the patient before I could decide which of those tests were really needed, in what order, and how soon. I don’t think ChatGPT is up to that part of the job just yet.

How to deliver bad news according to ChatGPT

By this point, ChatGPT’s responses are all starting to feel like a blunderbuss, when sometimes an epeé is required. Medicine is a science, yes, but there’s a lot of art to it too. So, I decided to turn to asking about the softer skills of medicine: I ask for ChatGPT’s help in delivering a terminal diagnosis to a patient or having a discussion with a relative about whether their loved one should be given CPR if their heart were to stop.

Tips for delivering bad news, according to ChatGPT

ChatGPT is full of practical advice, but it reminds me of when record companies first started using AI to write music. In 2016, Sony produced the first pop song created by an AI (it’s called Daddy’s Car, and you can listen to it below if you really want to).

At first listen, you think ‘it’s actually alright, it doesn’t sound too bad at all’. Then you get your ear in, and it just feels a bit off. The more you listen, the more you hear how it’s missed the target — it’s just a poor recreation of something that humans naturally do far better. Having those difficult conversations as a doctor is a bit like making music — you’ve got ChatGPT for the Daddy’s Car paint-by-numbers advice, but to do things well, you really need a skilled, adaptable human who can do the equivalent of a free jazz jam on the fly.

ChatGPT recognises its own limitations: it concludes its breaking-bad news advice with, ‘Remember that this conversation is highly individualized, and each patient will react differently. Tailor your approach to the patient’s emotional needs, cultural background, and preferences.’ Exactly so, ChatGPT.

But here’s the problem: ChatGPT’s job isn’t to individualise its responses, it’s to digest information and report back based on the patterns it finds — the very opposite of individualised. Perhaps one day it will be able to do that — take a history based on a particular patient, organise the right tests in the right timeframe, come up with a diagnosis and break the news to a patient in the way that works best for them — but until then, I see ChatGPT’s future as one where it assists, rather than replaces, doctors.

This article has been tagged as a “Great Read“, a tag we reserve for articles we think are truly outstanding.