Abstract
Brief hospital course (BHC) summaries are clinical documents that summarize apatient's hospital stay. While large language models (LLMs) depict remarkablecapabilities in automating real-world tasks, their capabilities for healthcareapplications such as synthesizing BHCs from clinical notes have not been shown.We introduce a novel pre-processed dataset, the MIMIC-IV-BHC, encapsulatingclinical note and brief hospital course (BHC) pairs to adapt LLMs for BHCsynthesis. Furthermore, we introduce a benchmark of the summarizationperformance of two general-purpose LLMs and three healthcare-adapted LLMs.Using clinical notes as input, we apply prompting-based (using in-contextlearning) and fine-tuning-based adaptation strategies to three open-source LLMs(Clinical-T5-Large, Llama2-13B, FLAN-UL2) and two proprietary LLMs (GPT-3.5,GPT-4). We evaluate these LLMs across multiple context-length inputs usingnatural language similarity metrics. We further conduct a clinical study withfive clinicians, comparing clinician-written and LLM-generated BHCs across 30samples, focusing on their potential to enhance clinical decision-makingthrough improved summary quality. We observe that the Llama2-13B fine-tuned LLMoutperforms other domain-adapted models given quantitative evaluation metricsof BLEU and BERT-Score. GPT-4 with in-context learning shows more robustness toincreasing context lengths of clinical note inputs than fine-tuned Llama2-13B.Despite comparable quantitative metrics, the reader study depicts a significantpreference for summaries generated by GPT-4 with in-context learning comparedto both Llama2-13B fine-tuned summaries and the original summaries,highlighting the need for qualitative clinical evaluation.