Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLMs

Abstract

This paper introduces a novel semi-supervised learning framework specificallydesigned for text classification tasks, effectively addressing the challenge ofvast datasets with limited labeled examples. By integrating multi-levelsimilarity based data augmentation techniques from Retrieval-AugmentedGeneration (RAG) to Large Language Model (LLM) rewriting and traditional wordsubstitution-we constructed an intelligent augmentation pipeline. Thisframework innovatively employs the selection of representative landmarksthrough clustering, which serve as intermediaries in the retrieval andrewriting processes, ensuring that the augmented data maintains a distributionsimilar to the original dataset. Empirical results show that even in complextext document classification scenarios with over 100 categories, our methodachieves state-of-the-art accuracies of 95.41% and 82.43% on the Reuters andWeb of Science datasets, respectively. These findings highlight theeffectiveness and broad applicability of our semi-supervised learning approachfor text classification tasks.