Abstract
In this paper, we investigate Extractive Question Answering (EQA) with LargeLanguage Models (LLMs) under domain drift, i.e., can LLMs generalize to domainsthat require specific knowledge such as medicine and law in a zero-shot fashionwithout additional in-domain training? To this end, we devise a series ofexperiments to explain the performance gap empirically. Our findings suggestthat: (a) LLMs struggle with dataset demands of closed domains such asretrieving long answer spans; (b) Certain LLMs, despite showing strong overallperformance, display weaknesses in meeting basic requirements as discriminatingbetween domain-specific senses of words which we link to pre-processingdecisions; (c) Scaling model parameters is not always effective for crossdomain generalization; and (d) Closed-domain datasets are quantitatively muchdifferent than open-domain EQA datasets and current LLMs struggle to deal withthem. Our findings point out important directions for improving existing LLMs.