top of page

UX Research Projects

AI-Generated vs. Human Feedback in Clinical Education
A Comparative Study of LLM-Generated Narratives vs. Human Faculty Evaluations

Doctoral Research @Columbia University (2024-25)

​

Executive Summary:

I contributed to a study evaluating a custom AI tool that converts clinical data into Medical Student Performance Evaluation (MSPE) narratives. Using a blinded, human-in-the-loop framework, we found AI-generated reports were comparable to human-written ones—and preferred by students for their clarity and actionability.

​

The MSPE is an institutional, standardized evaluation that synthesizes a student's experiences, attributes, and academic performance for residency programs.

​

These findings suggest that applying learning science to AI workflows can improve feedback quality while reducing faculty workload in time-constrained clinical settings.

The Medical Student Performance Evaluation (MSPE).3.png

​1. The Challenge

​​

  • Scalability: Creating comprehensive Medical Student Performance Evaluations (MSPEs) is labor-intensive.

​

  • Consistency: Human feedback varies significantly in clarity and actionability.

​

  • Trust: Can students trust an AI to summarize their professional performance for high-stakes residency applications?

Clinical Education.png

2. The Methodology (The Evaluation Framework

We designed a rigorous empirical study to test the efficacy of a custom AI tool:​

​

  • Data Synthesis: The AI analyzed raw end-of-clerkship feedback to generate composite narratives and Individualized Learning Plans (ILPs).

​

  • Blinded Evaluation: Medical students performed a blinded "Turing Test" style review, rating both human and AI outputs on 7-point Likert scales.

​

  • Metric Operationalization: "Quality" was decomposed into four measurable dimensions: Clarity, Organization, Readability, and Comprehensiveness. ​

Clinical Education2.png

3. Key Findings

​​

  • Higher Perceived Quality: AI-generated feedback scored higher (mean 5.56) than human feedback (mean 5.06).

​

  • Strong Preference: Students preferred AI-generated narratives for their official MSPE submissions (mean votes 13.75 vs 6.5).

​

  • Actionable Insights: Over 85% of participants agreed that AI-generated ILPs provided superior "actionable goals" for their future clinical practice.

The Medical Student Performance Evaluation (MSPE).2.png

4. Technical Details & Specifications

​

  • Model Architecture: GPT-4o

​

  • Prompt Engineering Strategy: Few-Shot examples

​

  • Evaluation Metrics: p-values for non-inferiority

The Medical Student Performance Evaluation (MSPE)..png

Experience Highlights

​

  • Operationalized complex human interactions into measurable metrics by designing a custom AI evaluation framework to assess non-deterministic summative feedback quality.

​

  • Led rigorous empirical research comparing AI-generated vs. human-generated narratives, utilizing 7-point Likert scales and blinded comparative analysis to validate AI non-inferiority.

​

  • Developed "Golden Data Sets" and evaluation rubrics for LLM performance, achieving 85%+ alignment between automated learning plans and human expert standards.

​

  • Synthesized qualitative patterns and quantitative data (p-value analysis) to translate research findings into actionable product recommendations for educational AI tools.

​

bottom of page