JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System

Abstract

This paper introduces JuDGE (Judgment Document Generation Evaluation), anovel benchmark for evaluating the performance of judgment document generationin the Chinese legal system. We define the task as generating a complete legaljudgment document from the given factual description of the case. To facilitatethis benchmark, we construct a comprehensive dataset consisting of factualdescriptions from real legal cases, paired with their corresponding fulljudgment documents, which serve as the ground truth for evaluating the qualityof generated documents. This dataset is further augmented by two external legalcorpora that provide additional legal knowledge for the task: one comprisingstatutes and regulations, and the other consisting of a large collection ofpast judgment documents. In collaboration with legal professionals, weestablish a comprehensive automated evaluation framework to assess the qualityof generated judgment documents across various dimensions. We evaluate variousbaseline approaches, including few-shot in-context learning, fine-tuning, and amulti-source retrieval-augmented generation (RAG) approach, using both generaland legal-domain LLMs. The experimental results demonstrate that, while RAGapproaches can effectively improve performance in this task, there is stillsubstantial room for further improvement. All the codes and datasets areavailable at: https://github.com/oneal2000/JuDGE.