AI tools vary in predictability based on their design, application, and underlying technology. While some AI systems provide consistent outputs within well-defined parameters, others may show greater variability due to their probabilistic nature, complexity of tasks, or frequent updates to their underlying models or knowledge base. Thus, relying on a single assessment of an AI tool’s efficacy and safety does not provide sufficient safeguards against errors, biases, and potential harm to patients or learners. Frequent evaluation is essential, leveraging evidence-based practice frameworks to ensure these tools remain effective and safe in their intended contexts.
Additionally, fostering research initiatives and promoting collaboration ensure that evidence-based findings can inform ongoing practice, policy development, and advances in the field. Leveraging frequent evaluation and scholarly research will promote a culture of continuous improvement that encourages open discussion about AI tool performance and collective learning.
From Principle to Practice
Apply this principle to your practice using the following strategies:
- Implement assessment programs for AI tools. Stakeholders should leverage interdisciplinary committees comprising medical educators, clinicians, AI developers, ethicists, and learners to guide the evaluation process and conduct regular review cycles to assess performance trends, identify unintended consequences, and recommend ongoing improvements. These committees should develop frameworks that combine quantitative metrics (e.g., accuracy, reliability, user engagement) with qualitative assessments (e.g., impact on clinical reasoning, educational value, patient safety). Begin by incorporating AI tool feedback into existing evaluation processes. Use surveys to assess and document AI tool accuracy, bias, and educational value during specific educational experiences (e.g., curricular components such as courses or rotations). Collect results from such evaluations to facilitate brief regular discussions during committee meetings to review how AI tools are performing.
- Build evaluation into implementation. When introducing new AI tools, establish monitoring plans from the start that define success metrics before deployment, identify who will collect and review data, and set regular review intervals (monthly initially, then quarterly). This proactive approach prevents tools from operating without oversight and makes evaluation a natural part of the implementation process.
- Gather and appraise diverse sources of evidence. Evaluation and oversight should leverage a variety of evidence sources. In addition to qualitative surveys, firsthand observations, and ad hoc feedback from users, evidence sources should include published literature in the form of peer-reviewed research or perspectives that document real-world AI performance. Evidence should be critically appraised for reliability, validity, and applicability to ensure robust conclusions and actionable insights.
- Scale monitoring efforts appropriately. Match evaluation intensity to tool impact and risk. For low-stakes tools used occasionally, periodic informal reviews may suffice. For tools affecting assessment or clinical education, implement more structured evaluation including documented review cycles, defined performance thresholds, and clear escalation procedures for concerns. As AI use expands, consider forming a dedicated working group, developing standardized evaluation rubrics, or partnering with other institutions to share monitoring approaches and findings.
- Support research in AI evaluation. Institutions should foster research initiatives aimed at evaluating AI’s impact on medical education by establishing funding mechanisms and encouraging multicenter collaborations to conduct large-scale studies. They should engage educators, staff, and learners in systematic research through protected time and resources, enabling robust data collection on AI’s educational impact. The evaluation process should begin by defining clear, hypothesis-driven questions that assess an AI tool’s utility, risks, and benefits within its intended context. Findings should be disseminated through peer-reviewed publications, conferences, and educational repositories to build an evidence base for best practices in AI integration. Research priorities should address stakeholder needs and emerging challenges in AI-enhanced education.
References
- AAMC. Advancing AI across academic medicine resource collection. https://www.aamc.org/about-us/mission-areas/medical-education/advancing-ai-across-academic-medicine…
- Beavins E. HLTH24: here’s a first look at a draft “nutrition label” for health AI. Fierce Healthcare. Published October 20, 2024. Accessed December 30, 2024. https://www.fiercehealthcare.com/ai-and-machine-learning/hlth24-heres-first-look-draft-nutritional-…
- Gin BC, O’Sullivan PS, Hauer KE, et al. Entrustment and EPAs for artificial intelligence (AI): a framework to safeguard the use of AI in health professions education. Acad Med. 2025;100(3):264-272. doi:10.1097/ACM.0000000000005930