Enhancing Domain-Specific Encoder Models with LLM-Generated Data: How to Leverage Ontologies, and How to Do Without Them

Abstract

We investigate the use of LLM-generated data for continual pretraining ofencoder models in specialized domains with limited training data, using thescientific domain of invasion biology as a case study. To this end, we leveragedomain-specific ontologies by enriching them with LLM-generated data andpretraining the encoder model as an ontology-informed embedding model forconcept definitions. To evaluate the effectiveness of this method, we compile abenchmark specifically designed for assessing model performance in invasionbiology. After demonstrating substantial improvements over standard LLMpretraining, we investigate the feasibility of applying the proposed approachto domains without comprehensive ontologies by substituting ontologicalconcepts with concepts automatically extracted from a small corpus ofscientific abstracts and establishing relationships between concepts throughdistributional statistics. Our results demonstrate that this automated approachachieves comparable performance using only a small set of scientific abstracts,resulting in a fully automated pipeline for enhancing domain-specificunderstanding of small encoder models that is especially suited for applicationin low-resource settings and achieves performance comparable to masked languagemodeling pretraining on much larger datasets.