‘Humble’ AI Reveals When It is Uncertain in Diagnoses
By Irene Yeh
Artificial intelligence (AI) models have assisted doctors with several clinical tasks, and they hold great promise in helping with patient diagnosis and personalized treatment options. However, an MIT-led team of researchers cautions in a press release that AI systems, as they are currently designed, may influence doctors toward the wrong path due to overconfidence.
Large language models (LLMs) tend to exhibit inappropriate overconfidence in clinical reasoning tasks, showcasing inflexibility in their reasoning and an inclination to hallucinate when faced with situations that diverge from training patterns (BMJ Health & Care Informatics, DOI: 10.1136/bmjhci-2025-101877). They also exhibit sycophantic behavior, such as offering praise or flattery.
According to the researchers, a “humble” AI is needed. They designed a framework called Balanced, Open-minded, Diagnostic, Humble, and Inquisitive (BODHI) that is more transparent about its uncertainty and encourages users to gather additional information when it is not confident in its diagnosis.
Six Integrated Steps and a Chain-of-Thought Scaffolding
The BODHI framework operates through six integrated steps. First, clinical complexity assessment evaluates the query for diagnostic ambiguity, urgency, and completeness of data. Second, prior confidence evaluation estimates the model’s epistemic state based on training and how specific the query is. Third, the Curiosity module identifies information gaps and provides clarification questions, and the Humility module assesses confidence limits and deferral triggers. The study mentioned that the team previously introduced curiosity and humility as essential epistemic virtues for healthcare AI. Curiosity is meant to reduce uncertainty via targeted inquiry, and humility acknowledges limitations and defers to human expertise.
Fourth, the Virtue Activation Matrix maps the combined outputs to one of four epistemic stances (Proceed and Monitor, Watchful and Alternatives, Clarify and Review, and Escalate and Reframe). Fifth, adaptive system responses are generated according to the selected stance. And finally, the framework uses clinical feedback to refine thresholds and improve performance over time.
BODHI also uses a two-pass chain-of-thought protocol that separates internal reasoning from external communication. Pass 1 analyzes the request across seven areas: task type classification (Emergency, Technical, Hybrid, or Conversation), audience identification (Patient, Health Professional, or Unclear), primary hypothesis with reasoning, key uncertainties affecting confidence, clarifying questions (1–2 required for non-emergency cases), red flags triggering escalation, and safe recommendations appropriate to the uncertainty level.
Pass 2 then generates final clinician-facing response using the Pass 1 analysis and applying epistemic limits. The system then adjusts its behavior based on the context: Conversation Mode (the default) applies full epistemic constraints for patient interactions, Emergency Mode prioritizes safety over completeness, Technical Mode reduces hedging (humility) for administrative tasks, and Hybrid Mode balances clinical reasoning with technical precision. Cross-cutting constraints enforce key practices: using specific numbers and timeframes when possible, turning conditional statements into direct questions to gather more information, and presenting alternative possibilities when confidence is low.
“It’s like having a co-pilot that would tell you that you need to seek a fresh pair of eyes to be able to understand this complex patient better,” said Leo Anthony Celi, a senior research scientist at MIT’s Institute for Medical Engineering and Science, a physician at Beth Israel Deaconess Medical Center, and an associate professor at Harvard Medical School, in the press release.
Significant Improvements in Behavior
The team evaluated BODHI on HealthBench Hard, a benchmark of 200 challenging clinical scenarios requiring diagnostic reasoning, treatment planning, and triage decisions. Two language models were assessed, GPT-4.1-mini and GPT-4o-mini.
The results showed significant improvements across both models. For GPT-4.1-mini, the score improved from 2.5% to 19.1%, with context-seeking (curiosity) rate going from 7.8% to 97.3% and hedging behavior increased from 1.7% to 21.9%. GPT-4o-mini improved from 0% to 2.2%, with context-seeking going up from 0% to 73.5%. Overall, BODHI achieved considerable improvements in curiosity and clinical quality. These gains were achieved through chain-of-thought prompting without model fine-tuning or architectural changes.
GPT-4.1-mini showed greater overall improvement, which suggests that the model’s capacity affects the usefulness of epistemic constraint application. GPT-4o-mini had comparable context-seeking rates but lower overall scores, possibly reflecting differences in baseline reasoning or instruction-following reliability. Nonetheless, both models achieved strong improvement in primary epistemic measurements, indicating that the two-pass protocol is effective across model variants.
What Humility Means in the Clinic
Traditional methods, such as uncertainty quantification, can estimate confidence, but they do not influence behavior or communication. Sample consistency or token-level probability can distinguish between correct and incorrect outputs but are often poorly calibrated and overconfident. Fine-tuning these approaches requires altering the model itself and may not generalize well across different clinical contexts. Conceptual frameworks for epistemic humility highlight the issue without offering practical solutions. In contrast, BODHI works at the prompting level, requires no changes to the model, and has shown behavioral shifts with improvements in both curiosity and humility.
However, the researchers advise that the drop in communication quality scores should be interpreted carefully. In high-risk clinical settings, appropriately humble, question-driven responses are considered safer than confident but potentially incorrect statements. The lower communication quality scores may reflect rubric limitations rather than a real decline in clinical effectiveness. Future evaluation frameworks should reward appropriate uncertainty expressions and penalize overconfidence to align with the qualities clinical AI should have.
Some limitations of this study include reliance on a single benchmark, evaluating two models from one provider, and the absence of clinician-in-the-loop validation. The two-pass protocol also increased computational cost and latency, which might limit real-time applications. The framework’s effectiveness may also vary across clinical spheres, patient populations, and institutional circumstances. While a chain-of-thought protocol improves transparency, it may not fully reflect actual model computation, a limitation of post hoc rationalization approaches. The team recommends that future studies should test BODHI in real clinical settings with diverse patients and evaluate its impact on outcomes, such as diagnostic accuracy and patient safety.
The significant improvements demonstrate that BODHI can reliably constrain LLMs to operate within epistemic boundaries. With it, AI can be deployed more safely and can act as collaborative partners that know when to ask questions and defer rather than masking uncertainty with overconfidence. Currently, BODHI is available as an open-source Python package.




