Study Finds AI Can Evaluate Empathy Like an Expert

Study Finds AI Can Evaluate Empathy Like an Expert

As artificial intelligence systems become increasingly integrated into the most sensitive areas of human life, from mental health support to professional coaching, a critical question has emerged about the true nature of their capabilities. While modern large language models (LLMs) have demonstrated an uncanny ability to generate responses that feel deeply understanding and affirming, it has remained unclear whether this is merely sophisticated mimicry or a genuine capacity to recognize the complex dynamics of empathic communication. New research from a team led by Matthew Groh, an assistant professor at the Kellogg School, has provided a groundbreaking answer, revealing that AI can indeed evaluate empathy with a level of consistency that rivals trained human experts. This discovery moves beyond the debate over AI sentience and instead points toward a powerful new reality: AI is poised to become an invaluable tool for teaching and scaling one of the most essential human skills. The findings suggest a future where empathy is no longer just an abstract “soft skill” but a measurable, refinable competency that can be systematically developed across professions.

A Rigorous Test of Digital Intuition

To move beyond anecdotal evidence of AI’s conversational skills, the researchers designed a comprehensive study to rigorously compare the evaluative abilities of AI with those of humans. The experiment pitted three of the industry’s leading large language models—Gemini 2.5 Pro, ChatGPT 4o, and Claude 3.7 Sonnet—against two distinct human cohorts: a panel of experts with specialized training in communication and a large group of non-expert crowd workers. The objective was not to see if the AI could generate an empathic response itself, but to determine if it could accurately recognize and rate the empathy demonstrated in a conversation between two people. This distinction is crucial, as the ability to evaluate is a prerequisite for providing effective, scalable feedback and training, transforming the technology from a conversational partner into an expert judge. The core of the study involved the annotation of 200 real-world, text-based dialogues where one person shared a personal challenge and another offered support. All three groups were tasked with systematically rating these interactions, providing a robust dataset for direct comparison between machine and human judgment.

The evaluation process was meticulously structured to ensure objectivity and depth, avoiding a vague, singular notion of empathy. Instead, all annotators—whether human or AI—were guided by four distinct and established frameworks from the fields of psychology and natural-language processing. These frameworks, which included Empathic Dialogues, Perceived Empathy, EPITOME, and a new model developed by the researchers called Lend-an-Ear Pilot, broke empathy down into specific, actionable components. Annotators were required to judge conversations based on characteristics such as whether a response “encouraged elaboration,” “demonstrated understanding,” or made an “attempt to explore the seeker’s experiences and feelings.” This structured approach forced a granular analysis far more nuanced than a simple gut reaction. By the end of the experiment, the team had amassed a massive collection of 3,150 annotations from the LLMs, 3,150 from the experts, and 2,844 from the crowd workers, setting the stage for a definitive analysis of the AI’s evaluative prowess. This methodical approach ensured that the AI was being tested not on its ability to charm, but on its capacity to apply complex, scientific criteria to human interaction.

Expert-Level Reliability and the Power of Frameworks

Because there is no single, objective truth for the amount of empathy present in a given statement, the study’s analysis centered on a critical metric: inter-rater reliability. This measures the degree of consistency and agreement among judges within a group. The underlying expectation was that trained experts, sharing a common professional understanding, would exhibit high agreement in their scores, while non-experts would be inconsistent. The results powerfully confirmed this hypothesis, but with a startling addition. The experts’ annotations were highly consistent, as predicted, while the crowd workers’ judgments were, in the researchers’ words, “all over the map.” The most significant finding, however, was that the LLMs demonstrated a level of inter-rater reliability that was remarkably similar to that of the human experts and vastly superior to the non-expert group. This indicated that the AI models were not guessing; they were able to apply the complex criteria from the evaluation frameworks with a high degree of consistency, reliably recognizing the subtle markers of empathic communication in a manner that approached a professional human standard. Aakriti Kumar, the study’s first author, noted the profound implications, stating, “The fact that LLMs can evaluate empathic communication at a level approaching experts suggests promising opportunities to scale training for applications like therapy or customer service, where empathic skills are essential.”

The study also unearthed a second, equally important insight: the quality and clarity of the evaluation framework itself are paramount to achieving reliable results. Researchers observed that the level of agreement among all judges—both human and AI—was heavily dependent on the specific framework being used. When a framework was comprehensive, clearly defined, and robust, both the experts and the LLMs produced more consistent and reliable annotations. Conversely, when a framework was ambiguous or less structured, the judgments of both groups became more inconsistent and varied. This highlights a symbiotic relationship: not only can well-defined frameworks help AI evaluate empathy effectively, but AI can also be used as a tool to test, refine, and strengthen these frameworks for human use. By identifying areas where an LLM struggles to find consistency, researchers can pinpoint weaknesses in a framework’s definitions or criteria. This iterative process has the potential to transform empathy from an intangible concept into a measurable “hard skill” with clearly defined components, making it far easier to teach and cultivate. As Groh explained, “LLMs as a judge are only as reliable as the framework is,” underscoring that the technology’s success is intertwined with our own ability to codify human connection.

Beyond Simulation to Scalable Skill Development

The research findings heralded a new era for professional development, suggesting that the rigorous evaluation of soft skills at a massive scale was now within reach. The practical applications that stemmed from this work were extensive, promising to reshape training programs across multiple industries. For example, therapists-in-training could use LLMs as a sophisticated coaching tool to receive immediate, nuanced feedback on their practice sessions, helping them hone their empathic responses before engaging with real clients. Similarly, customer service teams could participate in advanced role-playing exercises with AI, using robust empathic frameworks to evaluate and improve their communication strategies, leading to higher customer satisfaction and loyalty. The potential for leadership development was also significant, as empathy is widely recognized as a cornerstone of effective decision-making and team management. An AI-driven tool could provide leaders with a private, objective means to practice and refine this crucial capability, empowering them to build consensus and maintain morale even when delivering unpopular news. This technology stood to democratize access to high-quality communication training, making it more affordable and accessible than ever before.

Despite the optimistic outlook on AI’s capabilities, the study made a critical distinction that underscored the enduring value of human connection: the ability to recognize empathy is not synonymous with the ability to feel it. The researchers were clear that these technological advancements did not signal the obsolescence of human professionals like therapists or counselors. An AI can articulate a perfect empathic response and provide technically flawless advice, but it lacks genuine consciousness, subjective emotion, and the rich context of lived experience. The “human touch,” with its inherent capacity for shared vulnerability and authentic connection, remained unique and irreplaceable. The technology was therefore positioned not as a replacement for human interaction, but as a powerful assistant designed to help humans become better at connecting with one another. In the end, the study suggested that AI’s greatest contribution to the field of empathy might not be its own simulated feelings, but its remarkable ability to hold up a mirror and teach us more about the intricate art of our own humanity.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later