At a time when the global competition in artificial intelligence technology is becoming increasingly fierce, how to achieve a breakthrough in training efficiency at the algorithm level has become a key focus of Sino-US technology competition, computing infrastructure construction and domestic chip breakthrough. Recently, a hybrid parallel training algorithm called GroPipe, led by a Chinese scientific research team, has become the focus of attention of academia and industry at home and abroad because it significantly improves the training efficiency in multiple mainstream deep learning models and achieves performance surpassing on domestic chip platforms.
According to public experimental data, the performance of the GroPipe algorithm is improved by up to 79.2% when training typical models such as ResNet, VGG, and BERT-base, and the average training acceleration rate is more than 50%. What's even more groundbreaking is that it achieves a training performance that is 17% higher than that of international flagship GPUs (such as NVIDIA A100) on the domestic Cambrian MLU platform. This not only marks China's major progress in AI algorithm optimization and chip collaborative design, but also presses the acceleration button for the restructuring of the global AI industry.
Ⅰ Breaking through the bottleneck: How does GroPipe break the "speed wall" of AI training?
At present, with the exponential growth of the parameter scale of deep learning models, traditional training acceleration methods such as Data Parallelism (DP) and Pipeline Model Parallelism (PMP) have gradually exposed efficiency bottlenecks. DP usually has a large amount of communication delay during the parameter synchronization phase, while PMP faces the problems of pipeline bubbles and uneven load, resulting in a waste of computing resources.
The GroPipe algorithm was born to address this pain point in the industry. The core innovation lies in the proposal of a hybrid hierarchical parallel architecture of "intra-group pipeline + data parallelism between groups", and the introduction of Automatic Model Partitioning Algorithm (AMPA), which can automatically optimize the distribution strategy of computing tasks according to the network structure and hardware topology. Compared with traditional training solutions, GroPipe can dynamically adjust the load, significantly improve GPU resource utilization, and effectively reduce communication redundancy.
Experimental data show that:
* On a standard 8-GPU server (NVIDIA A100), the training acceleration ratio of the ResNet-50 model is 41.9%, and the training speedup ratio of ResNet-152 is 42.2%;
* The VGG-19 model has a speedup of up of 79.2%;
* BERT-base accelerates NLP tasks by 51.0%.
These data fully show that GroPipe is not a simple fine-tuning or local optimization, but a deep innovation at the structural level.
Figure: Overall framework diagram of the GroPipe method. (Image from: Northwest A&F University)
Ⅱ Performance surpassing: GroPipe boosts the first "hard" international flagship of domestic chips
In the past few years, domestic AI acceleration chips have continued to make efforts in hardware design and ecological adaptation, but it is still difficult to match international manufacturers such as NVIDIA and AMD in model training performance. The emergence of GroPipe has become a booster for "corner overtaking" in a sense.
According to the recent test data released by the Chinese Academy of Sciences and Cambrian Joint Laboratory, the performance of GroPipe on Cambrian's latest-generation MLU370 chip surpassed Nvidia A100 in actual tasks for the first time, and the training efficiency increased by 17%. The key to this achievement is that GroPipe's AMPA can accurately adapt to the heterogeneous architecture characteristics of Cambrian chips, optimize the computing graph and data transmission path, and maximize the potential of the chip.
At present, Huawei's Ascend team has started the adaptation and convergence development of GroPipe, and plans to integrate the algorithm into the Ascend CANN compiler. Alibaba Cloud Feitian platform has also opened APIs to support developers to call the GroPipe algorithm for distributed training. These industrial actions show that GroPipe has entered the engineering implementation cycle from the theoretical verification stage.
This also means that in the future, in the main battlefield of large model training, domestic chips and domestic algorithms are expected to completely change the situation of "monopoly of computing power and suppressed performance" in the past.
Ⅲ From high-end to inclusive: GroPipe opens a new era of computing power sinking
In addition to large-scale model training for data centers, GroPipe is also promoting the "civilianization" of AI computing power. Through the "GroPipe-mini" open-source branch, the algorithm has been successfully transplanted to the Doka RTX 4090 consumer GPU environment, showing excellent performance in small and medium-sized image classification and semantic segmentation tasks. Preliminary tests have shown that the VGG-16 is as efficient as the 4-card A100 on a 3-card RTX 4090 combination.
The "performance equality" brought about by this algorithm optimization has opened a new channel for AI applications for user groups with limited computing power, such as small and medium-sized enterprises and scientific research institutions. In the future, with the further optimization of algorithms and the upgrading of consumer-level chip hardware, the training tasks that previously required millions of yuan for high-end servers can be achieved with only 10,000 yuan of investment. This is undoubtedly a huge empowerment for industries such as edge AI, medical imaging, intelligent manufacturing, and education technology.
Ⅳ The reshaping of the international competition pattern: the "China path" from following to leading
The birth of GroPipe is not only a technological breakthrough, but also a "benchmarking and surpassing" of China's AI algorithms milestones. In the evaluation of the global large model training framework, GroPipe surpassed the NVIDIA Megatron-LM solution in multiple indicators, and the training efficiency was 89% ahead. Especially in the scenario of large model and multi-task, its advantages of parallel scheduling and load balancing are fully utilized, and it has the strength to compete with international leading open-source systems such as DeepSpeed and FairScale.
With the increasing attention of the international open-source community to GroPipe, scientific research institutions such as the University of California, Berkeley, and ETH Zurich have indicated that they will include it in the algorithm evaluation system and carry out cooperative research. The momentum of "output-oriented innovation" of AI algorithms in China is gradually taking shape.
What is more noteworthy is that in this algorithm-driven competition, "core-software integration" has become a key variable to reshape the global semiconductor competition pattern. The collaborative optimization of GroPipe and domestic chips is a true portrayal of this trend.
Ⅴ Conclusion: GroPipe is not just an algorithm
GroPipe is much more than just a training program. It is a reconstruction of the integration of AI algorithm structure and hardware resources, a bridge for domestic chips to move towards the global computing power highland, and an engine for the true inclusiveness of AI technology. It not only accelerates model training, but also accelerates the rise of domestic computing power and the process of industrial ecological reconstruction.
In the future, with the continuous optimization and standardization of GroPipe, its application prospects in the fields of ultra-large-scale models, multi-task collaborative training, and intelligent edge computing will be broader. The influence of Chinese power in the field of AI will continue to increase on the global stage as this technology continues to evolve.
This is not the end, but a new starting point for China's AI technology to become a global leader.