When Text Embedding Meets Large Language Model: A Comprehensive Survey

Abstract

Text embedding has become a foundational technology in natural languageprocessing (NLP) during the deep learning era, driving advancements across awide array of downstream tasks. While many natural language understandingchallenges can now be modeled using generative paradigms and leverage therobust generative and comprehension capabilities of large language models(LLMs), numerous practical applications-such as semantic matching, clustering,and information retrieval-continue to rely on text embeddings for theirefficiency and effectiveness. Therefore, how to combine the LLMs and the textembeddings has become one of the hotspots of academic attention in recentyears. In this survey, we categorize the interplay between LLMs and textembeddings into three overarching themes: (1) LLM-augmented text embedding,enhancing traditional embedding methods with LLMs; (2) LLMs as text embedders,adapting their innate capabilities for high-quality embedding; and (3) Textembedding understanding with LLMs, leveraging LLMs to analyze and interpretembeddings. By organizing recent works based on interaction patterns ratherthan specific downstream applications, we offer a novel and systematic overviewof contributions from various research and application domains in the era ofLLMs. Furthermore, we highlight the unresolved challenges that persisted in thepre-LLM era with pre-trained language models (PLMs) and explore the emergingobstacles brought forth by LLMs. Building on this analysis, we outlineprospective directions for the evolution of text embedding, addressing boththeoretical and practical opportunities in the rapidly advancing landscape ofNLP.