Talks | Beiduo Chen

2025

(Invited Panelist) Panel Discussion on Perspectives in NLP

Beiduo Chen

In 4th Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ EMNLP 2025, Nov 2025

Website
(Invited Talk) Explanations as a Catalyst: Leveraging Large Language Models to Embrace Human Label Variation

Beiduo Chen

In Dealing with Meaning Variation in NLP - 3rd Yearly Workshop @ Utrecht University, Oct 2025

Abs Website

Human label variation (HLV), where annotators provide different valid labels for the same data, is a rich signal often dismissed as noise. This talk demonstrates how Large Language Models (LLMs), catalyzed by explanations, can efficiently model this variation. I will present a three-part research journey: First, we show that LLMs can accurately approximate full human judgment distributions using just a few human-provided explanations. Next, to overcome the cost of human input, we prove that LLM-generated explanations are effective and scalable proxies for human ones. Finally, we introduce a more authentic forward-reasoning paradigm by extracting nuanced explanations directly from an LLM’s Chain-of-Thought (CoT) process. This is paired with a novel, rank-based evaluation framework that better aligns with human decision-making. Together, these studies offer a scalable approach to embrace HLV, paving the way for more pluralistic and trustworthy AI.
(Invited Talk) Explanations as a Catalyst: Leveraging Large Language Models to Embrace Human Label Variation

Beiduo Chen

In Language Technology Lab Seminars @ University of Cambridge, Oct 2025

Abs HTML Website

Human label variation (HLV)—the phenomenon where multiple annotators provide different yet valid labels for the same data—is a rich source of information often dismissed as noise. Capturing this variation is crucial for building robust NLP systems, but doing so is typically resource-intensive. This talk presents a series of studies on how Large Language models (LLMs) can serve as a catalyst to embrace and model HLV, moving from scalable approximation to a deeper analysis of the reasoning process itself. First, I will discuss how LLMs can approximate full Human Judgment Distributions (HJDs) from just a few human-provided explanations. Our work shows that this explanation-based approach significantly improves alignment with human judgments. This investigation also reveals the limitations of traditional, instance-level distribution metrics and highlights the importance of complementing them with global-level measures to more effectively evaluate alignment. Building on this, the second part of the talk addresses the high cost of collecting human explanations by asking: can LLM-generated explanations serve as a viable proxy? We demonstrate that when guided by a few human labels, explanations generated by LLMs are indeed effective proxies, achieving comparable performance to human-written ones in approximating HJDs. This finding opens up a scalable and efficient pathway for modeling HLV, especially for datasets where human explanations are not available. Finally, I will shift from post-hoc explanation (justifying a given answer) to a forward-reasoning paradigm. I will introduce CoT2EL, a novel pipeline that extracts explanation-label pairs directly from an LLM’s Chain-of-Thought (CoT) process before a final answer is selected. This method allows us to analyze the model’s reasoning across multiple plausible options. To better assess these nuanced judgments, I will also present a new rank-based evaluation framework that prioritizes the ordering of answers over exact distributional scores, showing a stronger alignment with human decision-making.
(ELLIS PhD Symposium Selected Presentation) Modeling Human Label Variation in Natural Language Inference: From Human to LLM and Chain-of-Thought Explanations

Beiduo Chen

In The European Laboratory for Learning and Intelligent Systems (ELLIS) Doctoral Symposium 2025 on Robust AI, Aug 2025

Abs Poster Website

Understanding human label variation (HLV) is critical in Natural Language Processing (NLP), where multiple plausible annotations reflect the nuanced nature of language interpretation. Traditional approaches to capturing human judgment distributions (HJDs) either aggregate large numbers of crowd-sourced labels or collect detailed expert explanations—both of which are resource-intensive and difficult to scale. This poster explores how large language models (LLMs) can approximate HJDs more efficiently through scalable alternatives. First, we show that a small number of expert-provided explanations significantly enhance LLMs’ ability to estimate HJDs, even in the absence of explicit labels (Chen et al., EMNLP 2024). Next, we demonstrate that LLM-generated explanations, when conditioned on human labels, serve as effective proxies for human rationales, enabling accurate HJD approximation across both in-distribution and out-of-distribution datasets (Chen et al., ACL 2025). Finally, we introduce a novel pipeline that leverages chain-of-thought (CoT) reasoning, augmented with discourse-aware extraction techniques, to recover implicit rationales embedded in LLM-generated reasoning paths. Paired with a rank-based evaluation framework, this method yields stronger alignment between model outputs and human answer plausibility rankings (Chen et al., EMNLP 2025). Collectively, our findings advance the methodological rigor and practical viability of using LLMs to scale the modeling of human-like label distributions, offering new insights for both AI evaluation and the broader understanding of human reasoning diversity.
(Invited Talk) Understanding and Modeling Human Label Variation in LLM — Natural Language Inference as A Case

Beiduo Chen

In 4th International Workshop on Dependability Modeling and Digitalization @ The 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Jun 2025

Abs Website

In this talk, I will discuss the phenomenon of human label variation and its implications for the reliability of large language models (LLMs). Traditionally, models are trained to focus on a single “correc” answer, but real-world scenarios are rarely black and white-often, multiple plausible answers exist. For instance, in natural language inference (NLI) tasks, annotators frequently disagree on whether a given statement entails, contradicts, or remains neutral with respect to another. Drawing from my recent research, I will explore how analyzing inference task examples and leveraging human or LLM-generated explanations can enhance our understanding and modeling of human label variation. By addressing this variation, we aim to improve model comprehension of disputed judgments, enrich its evaluation of uncertainty and confidence, and ultimately contribute to more robust and reliable LLMs for real-world applications.