(ELLIS PhD Symposium Selected Presentation) Modeling Human Label Variation in Natural Language Inference: From Human to LLM and Chain-of-Thought Explanations
Understanding human label variation (HLV) is critical in Natural Language Processing (NLP), where multiple plausible annotations reflect the nuanced nature of language interpretation. Traditional approaches to capturing human judgment distributions (HJDs) either aggregate large numbers of crowd-sourced labels or collect detailed expert explanations—both of which are resource-intensive and difficult to scale. This poster explores how large language models (LLMs) can approximate HJDs more efficiently through scalable alternatives. First, we show that a small number of expert-provided explanations significantly enhance LLMs’ ability to estimate HJDs, even in the absence of explicit labels (Chen et al., EMNLP 2024). Next, we demonstrate that LLM-generated explanations, when conditioned on human labels, serve as effective proxies for human rationales, enabling accurate HJD approximation across both in-distribution and out-of-distribution datasets (Chen et al., ACL 2025). Finally, we introduce a novel pipeline that leverages chain-of-thought (CoT) reasoning, augmented with discourse-aware extraction techniques, to recover implicit rationales embedded in LLM-generated reasoning paths. Paired with a rank-based evaluation framework, this method yields stronger alignment between model outputs and human answer plausibility rankings (Chen et al., EMNLP 2025). Collectively, our findings advance the methodological rigor and practical viability of using LLMs to scale the modeling of human-like label distributions, offering new insights for both AI evaluation and the broader understanding of human reasoning diversity.