UX Research Projects
AI-Generated vs. Human Feedback in Clinical Education
A Comparative Study of LLM-Generated Narratives vs. Human Faculty Evaluations
Doctoral Research @Columbia University (2024-25)
​
Executive Summary:
I contributed to a study evaluating a custom AI tool that converts clinical data into Medical Student Performance Evaluation (MSPE) narratives. Using a blinded, human-in-the-loop framework, we found AI-generated reports were comparable to human-written ones—and preferred by students for their clarity and actionability.
​
The MSPE is an institutional, standardized evaluation that synthesizes a student's experiences, attributes, and academic performance for residency programs.
​
These findings suggest that applying learning science to AI workflows can improve feedback quality while reducing faculty workload in time-constrained clinical settings.
_3.png)
​1. The Challenge
​​
-
Scalability: Creating comprehensive Medical Student Performance Evaluations (MSPEs) is labor-intensive.
​
-
Consistency: Human feedback varies significantly in clarity and actionability.
​
-
Trust: Can students trust an AI to summarize their professional performance for high-stakes residency applications?

2. The Methodology (The Evaluation Framework
We designed a rigorous empirical study to test the efficacy of a custom AI tool:​
​
-
Data Synthesis: The AI analyzed raw end-of-clerkship feedback to generate composite narratives and Individualized Learning Plans (ILPs).
​
-
Blinded Evaluation: Medical students performed a blinded "Turing Test" style review, rating both human and AI outputs on 7-point Likert scales.
​
-
Metric Operationalization: "Quality" was decomposed into four measurable dimensions: Clarity, Organization, Readability, and Comprehensiveness. ​

3. Key Findings
​​
-
Higher Perceived Quality: AI-generated feedback scored higher (mean 5.56) than human feedback (mean 5.06).
​
-
Strong Preference: Students preferred AI-generated narratives for their official MSPE submissions (mean votes 13.75 vs 6.5).
​
-
Actionable Insights: Over 85% of participants agreed that AI-generated ILPs provided superior "actionable goals" for their future clinical practice.
_2.png)
4. Technical Details & Specifications
​
-
Model Architecture: GPT-4o
​
-
Prompt Engineering Strategy: Few-Shot examples
​
-
Evaluation Metrics: p-values for non-inferiority
_.png)
Experience Highlights
​
-
Operationalized complex human interactions into measurable metrics by designing a custom AI evaluation framework to assess non-deterministic summative feedback quality.
​
-
Led rigorous empirical research comparing AI-generated vs. human-generated narratives, utilizing 7-point Likert scales and blinded comparative analysis to validate AI non-inferiority.
​
-
Developed "Golden Data Sets" and evaluation rubrics for LLM performance, achieving 85%+ alignment between automated learning plans and human expert standards.
​
-
Synthesized qualitative patterns and quantitative data (p-value analysis) to translate research findings into actionable product recommendations for educational AI tools.
​