Abstract
Objectives: Compare qualitative coding of instruction tuned large languagemodels (IT-LLMs) against human coders in classifying the presence or absence ofvulnerability in routinely collected unstructured text that describespolice-public interactions. Evaluate potential bias in IT-LLM codings. Methods:Analyzing publicly available text narratives of police-public interactionsrecorded by Boston Police Department, we provide humans and IT-LLMs withqualitative labelling codebooks and compare labels generated by both, seekingto identify situations associated with (i) mental ill health; (ii) substancemisuse; (iii) alcohol dependence; and (iv) homelessness. We explore multipleprompting strategies and model sizes, and the variability of labels generatedby repeated prompts. Additionally, to explore model bias, we utilizecounterfactual methods to assess the impact of two protected characteristics -race and gender - on IT-LLM classification. Results: Results demonstrate thatIT-LLMs can effectively support human qualitative coding of police incidentnarratives. While there is some disagreement between LLM and human generatedlabels, IT-LLMs are highly effective at screening narratives where novulnerabilities are present, potentially vastly reducing the requirement forhuman coding. Counterfactual analyses demonstrate that manipulations to bothgender and race of individuals described in narratives have very limitedeffects on IT-LLM classifications beyond those expected by chance. Conclusions:IT-LLMs offer effective means to augment human qualitative coding in a way thatrequires much lower levels of resource to analyze large unstructured datasets.Moreover, they encourage specificity in qualitative coding, promotetransparency, and provide the opportunity for more standardized, replicableapproaches to analyzing large free-text police data sources.