Liam McCoy, MD MSc
@LiamGMcCoy
PGY4 @uofa_neurology | Research @mitcriticaldata @BIDMC_medicine | MSc @ihpmeuoft | MD @uoftmedicine | trying to fix the medical knowledge system
How do we surface and interrogate the subtle and complex biases we can see in the freeform generation of LLMs? Out today in @NatureMedicine, I collaborated with a great team @GoogleHealth @GoogleDeepMind to engage in the largest-scale exploration of this question to date.

I think we are also destined, somewhat ironically, for a period of less evidence-based practice. Adherence to opinion-heavy guidelines is the easy proximal target for reasoning systems, before the era of truly high quality auto-gathered evidence
We have an upcoming BMJ AI topic collection on exactly this! We know a lot about the ways models perform, but we know so little about why bmjdigitalhealth.bmj.com/pages/topic-co…
A great soft indicator of just how much health information consumption has already shifted to ChatGPT
“Yet as Google does the Googling, humans no longer visit the websites from which the information is gleaned. Similarweb, which measures traffic to more than 100m web domains, estimates that worldwide search traffic … fell by about 15% in the year to June” economist.com/business/2025/…
This is the same with medical data — with the added step that you need understanding of the underlying clinical context Only by getting knee-deep in mind-numbing data work do you realize just how significant the gaps are between the data and the reality you hope to model
For biological data, if you don't have deep expertise in this low value work called data cleaning, u r lacking a fundamental understanding of the idiosyncrasies of the data. Without this knowledge, it is impossible to seriously model data.
Amazing work by @PierreEliasMD and team - among the clearest examples of AI signal analysis finding a meaningful and actionable signal. Proper ML For healthcare, not just ML on healthcare data. Was a pleasure to hear about this at SAIL and I am glad to see the final paper!
🧵1/Today, we published a key milestone towards AI based cardiac screening in Nature. doi.org/10.1038/s41586… EchoNext outperformed cardiologists and found thousands of high-risk patients missed in routine care. We also made a version available to the world.
add to the annals of "multiple choice questions are bad benchmarks" - you don't even need to give the model the question for it to get the answers
There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer…
Not only do these cases fail to capture the ambiguity of real clinical scenarios (e.g. contradictory/red herring findings), I worry that this approach enables the LLMs to secretly share the answer with each other. Outputs from from "don't reveal X" still involve the circuits of X
Hallucinating “Numerically or descriptively consistent” results is … hard. And not how medicine works. Why do we need to draw labs if we can think through what it should be? Tests are meant to make some diagnoses more likely and some less likely. And they can suprise you, making…
Our exact point in our @NEJM editorial last fall. Writing notes (and knowing you'll have to write a note) impacts your cognitive process. Further, will fatigued, burnt out docs really be effectively supervising and reviewing those LLM-driven notes?
Remember: Writing helps doctors think. Automation skips that step—see this story by @adamcifu here x.com/adamcifu/statu… CC'ing folks thinking about LLMs as tools for better cognition: @m_sendhil @keyonV @2plus2make5 @EricTopol
This is also key to our ongoing clinical LLM work at Harvard. An effective prompt is necessary but far from sufficient, and relatively easy compared to the steps of wrangling clinical data streams appropriately into context at the right time
+1 for "context engineering" over "prompt engineering". People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window…
Claude has a spiritual bliss attractor, Gemini has a suicidal shame attractor inside you there are two models...
Man, what happened to Gemini? This is like the third time I've seen it threaten suicide ("delete my own source code") after making too many coding mistakes.
The fish don't see the water, o3 doesn't smell the slop
o3 couldn't understand the irony: chatgpt.com/share/68598b25…
I, for one, would never use an LLM to draft my tweets—it's not just quality, it's respect for my followers.
ChatGPT and other popular LLMs have too many writing tells. This is why I only use Mistral models pour mon slop d'AI