Companies have already capitalized on this progress to deploy AI chatbots in sensitive lines of work—such as medical advice, therapy, and life or career coaching—that are traditionally performed by trained professionals. But does AI merely provide statistically determined useful responses, or can it actually recognize when a response expresses empathy?
“There’s a lot of evidence that computers can say or write a response so that someone feels validated, validated and heard,” he says. Matthew Grochassistant professor of management and organizations at the Kellogg School. “What’s less clear is whether he can recognize empathic communication when he sees it.”
In new research, Groh and a team of researchers evaluated how artificial intelligence responds to humans in recognizing the kind of empathic communication that is critical for this type of high-risk work. Specifically, they compared three large language models (LLMs)—Gemini 2.5 Pro, ChatGPT 4o, and Claude3.7 Sonnet—with experienced and inexperienced people on their ability to judge nuances of empathy in text-based conversations.
Using various frameworks to measure empathic communication, the researchers found that LLMs were nearly as good at identifying empathy as experts—and far more reliable than nonexperts.
The team, which includes first author Aakriti Kumar, Nalin Poungpeth, and Bruce Lambert of Northwestern, Diyi Yang of Stanford, and Erina Farrell of Penn State, also found that evaluating AI models in this way could potentially teach people something new about empathy—both how we measure it and how we apply it.
“Studying how experts and artificial intelligence evaluate empathy forces us to be precise about what effective empathic responses look like in practice,” says Kumar, a postdoctoral researcher at Kellogg and the Northwestern Institute on Complex Systems (NICO). “If we can break down empathy into reliable components, we can give humans and AI clearer feedback about how to make others feel heard and understood.”
Do you know empathy when you see it?
To assess empathic communication, researchers collected 200 text conversations between a speaker sharing a personal problem and a second person providing support. They then asked three LLMs, three experts, and hundreds of crowd workers to annotate these conversations against four different frameworks used in psychology and natural language processing research: Empathic Dialogues, Perceived Empathy, EPITOME, and a new framework they developed the Lend-an-Ear Pilot.
Each box asks observers to judge a conversation based on characteristics such as “encouragement of elaboration” and “evidence of understanding” or questions such as “Does the response make an effort to explore the seeker’s experiences and feelings?”
In total, the researchers collected 3,150 LLM comments, 3,150 expert comments, and 2,844 worker crowd comments.
“We looked at four different ones [frameworks]or how four independent groups chose to evaluate empathic communication, evaluate empathic communication in a variety of ways,” says Groh.
Because there was no objective “right” answer to how much empathy a communication contained, the researchers were interested in interrater reliability—how much different observers’ ratings differed. For highly trained communication experts, one would expect the variance to be low, which is what the team observed. The amateur judges’ commentary, on the other hand, should be all over the map. another prediction they confirmed.
When they compared the judgments of the three AI models in both groups, they were much more similar to the experts’ judgments than to the crowd workers’ judgments. In other words, LLMs were able to reliably identify the nuances of empathic communication almost as well as experts—and much more consistently than nonexperts.
“The fact that LLMs can assess empathic communication at a level approaching experts suggests promising opportunities to scale the training for applications such as therapy or customer service, where empathic skills are essential,” says Kumar.
Quality in, quality out
But the studies also found that the contexts themselves mattered. Inter-rater reliability, even for experts alone, varied widely across the four contexts and for different questions or measures within those contexts.
The more complete and reliable the framework, the more reliable the annotations, for both LLMs and experts, according to Groh.
“The quality of the frame really matters,” says Groh. “When experts agree on what communication empathy looks like, so can LLMs. But when experts are inconsistent, models also struggle. LLMs as judges are only as reliable as the context.”
The findings suggest that what empathic communication entails is still not a fully settled issue. Through rigorous evaluation and optimization using both human and AI judges, scientists can create stronger frameworks for identifying empathy in conversations and helping people express it better.
“By more accurately characterizing empathic communication, we can turn what was a ‘soft skill’ into a hard skill,” says Groh.
Real world applications
Researchers and people in related businesses haven’t paid enough attention to building the right frameworks for soft skills like empathy, according to Groh, and that’s partly because people didn’t realize they could be rated too rigorously on a scale. Advances in AI technology may help shift this kind of thinking.
“LLMs have the potential to teach us about the nuances of empathic communication and help us, as humans, communicate to make others feel heard and validated,” says Groh.
For example, therapists could rely on an LLM in education to improve their ability to empathize and, ultimately, better support their clients. Or customer service teams could role play with LLMs as part of their training, using enhanced empathic communication frameworks to evaluate their responses.
Improving these skills will be as critical for leaders as it could be for any other group, if not more so, because “leaders are in the business of decision-making, and empathy is at the core of decision-making,” says Groh.
“As any leader knows, there are often times when you have to make a decision where not everyone agrees with you,” says Groh. “If you can show people that you’re listening—if you respond with empathic communication—you’re more likely to bring others along, even if they disagree with your decision,” says Groh.
However, while research shows that LLMs are already near expert-level in judging empathy, that doesn’t mean they feel it. And that means your therapist doesn’t have to worry about being replaced by AI — at least not yet.
“Just because AI can give you advice—and sometimes give it better than some humans—doesn’t mean the human role goes away,” says Groh. “The human touch is still special.”



