Abstract
Distributed training methods are crucial for large language models (LLMs).However, existing distributed training methods often suffer from communicationbottlenecks, stragglers, and limited elasticity. Local SGD methods have beenproposed to address these issues, but their effectiveness remains limited tosmall-scale training due to additional memory overhead and lack of concerns onefficiency and stability. To tackle these issues, we propose EDiT, aninnovative Efficient Distributed Training method that combines a tailored LocalSGD approach with model sharding techniques to enhance large-scale trainingefficiency. EDiT performs layer-wise parameter synchronization during forwardpass, reducing communication and memory overhead and enabling the overlap ofcomputation and communication. Besides, EDiT employs a pseudo gradient penaltystrategy to suppress loss spikes, which ensures training stability and improveperformance. Additionally, we introduce A-EDiT, a fully asynchronous variant ofEDiT that accommodates heterogeneous clusters. Building on EDiT/A-EDiT, weconduct a series of experiments to validate large-scale asynchronous trainingfor LLMs, accompanied by comprehensive analyses. Experimental resultsdemonstrate the superior performance of EDiT/A-EDiT, establishing them asrobust solutions for distributed LLM training in diverse computationalecosystems.