UX Research Projects

AI-Generated vs. Human Feedback in Clinical Education
A Comparative Study of LLM-Generated Narratives vs. Human Faculty Evaluations

Doctoral Research @Columbia University (2024-25)

Executive Summary:

I contributed to a study evaluating a custom AI tool that converts clinical data into Medical Student Performance Evaluation (MSPE) narratives. Using a blinded, human-in-the-loop framework, we found AI-generated reports were comparable to human-written ones—and preferred by students for their clarity and actionability.

The MSPE is an institutional, standardized evaluation that synthesizes a student's experiences, attributes, and academic performance for residency programs.

These findings suggest that applying learning science to AI workflows can improve feedback quality while reducing faculty workload in time-constrained clinical settings.

The Medical Student Performance Evaluation (MSPE).3.png

1. The Challenge

Scalability: Creating comprehensive Medical Student Performance Evaluations (MSPEs) is labor-intensive.

Consistency: Human feedback varies significantly in clarity and actionability.

Trust: Can students trust an AI to summarize their professional performance for high-stakes residency applications?

2. The Methodology (The Evaluation Framework

We designed a rigorous empirical study to test the efficacy of a custom AI tool:

Data Synthesis: The AI analyzed raw end-of-clerkship feedback to generate composite narratives and Individualized Learning Plans (ILPs).

Blinded Evaluation: Medical students performed a blinded "Turing Test" style review, rating both human and AI outputs on 7-point Likert scales.

Metric Operationalization: "Quality" was decomposed into four measurable dimensions: Clarity, Organization, Readability, and Comprehensiveness.

3. Key Findings

Higher Perceived Quality: AI-generated feedback scored higher (mean 5.56) than human feedback (mean 5.06).

Strong Preference: Students preferred AI-generated narratives for their official MSPE submissions (mean votes 13.75 vs 6.5).

Actionable Insights: Over 85% of participants agreed that AI-generated ILPs provided superior "actionable goals" for their future clinical practice.

The Medical Student Performance Evaluation (MSPE).2.png

4. Technical Details & Specifications

Model Architecture: GPT-4o

Prompt Engineering Strategy: Few-Shot examples

Evaluation Metrics: p-values for non-inferiority

The Medical Student Performance Evaluation (MSPE)..png

Experience Highlights

Operationalized complex human interactions into measurable metrics by designing a custom AI evaluation framework to assess non-deterministic summative feedback quality.

Led rigorous empirical research comparing AI-generated vs. human-generated narratives, utilizing 7-point Likert scales and blinded comparative analysis to validate AI non-inferiority.

Developed "Golden Data Sets" and evaluation rubrics for LLM performance, achieving 85%+ alignment between automated learning plans and human expert standards.

Synthesized qualitative patterns and quantitative data (p-value analysis) to translate research findings into actionable product recommendations for educational AI tools.

< Back to Selected Research

UX Research Projects

AI-Generated vs. Human Feedback in Clinical Education A Comparative Study of LLM-Generated Narratives vs. Human Faculty Evaluations

AI-Generated vs. Human Feedback in Clinical Education
A Comparative Study of LLM-Generated Narratives vs. Human Faculty Evaluations