Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?

Abstract

Studies on evaluation metrics and LLM-as-a-Judge models for automatic textsummarization have largely been focused on English, limiting our understandingof their effectiveness in other languages. Through our new dataset BASSE(BAsque and Spanish Summarization Evaluation), we address this situation bycollecting human judgments on 2,040 abstractive summaries in Basque andSpanish, generated either manually or by five LLMs with four different prompts.For each summary, annotators evaluated five criteria on a 5-point Likert scale:coherence, consistency, fluency, relevance, and 5W1H. We use these data toreevaluate traditional automatic metrics used for evaluating summaries, as wellas several LLM-as-a-Judge models that show strong performance on this task inEnglish. Our results show that currently proprietary judge LLMs have thehighest correlation with human judgments, followed by criteria-specificautomatic metrics, while open-sourced judge LLMs perform poorly. We releaseBASSE and our code publicly, along with the first large-scale Basquesummarization dataset containing 22,525 news articles with their subheads.