Inserting lightweight optimization code in high-speed network devices has allowed a KAUST-led collaboration to increase the speed of machine learning on parallelized computing systems five-fold.
This “in-network aggregation” technology is being developed with researchers and systems architects at Intel, Microsoft and the University of Washington. It can provide dramatic speed improvements using easily available programmable network hardware.
The primary benefit of artificial intelligence (AI) that gives it so much power to “understand” and interact with the world is the machine-learning step, in which the model is trained using large sets of labelled training data. The more data the AI is trained on, the better the model is likely to perform. Marco Canini from the KAUST research team noted:
“How to train deep-learning models at a large scale is a very challenging problem. The AI models can consist of billions of parameters, and we can use hundreds of processors that need to work efficiently in parallel. In such systems, communication among processors during incremental model updates easily becomes a major performance bottleneck.”
The team found a possible solution in new network technology developed by Barefoot Networks, a division of Intel. Amedeo Sapio, a KAUST alumnus who has since joined the Barefoot Networks team at Intel explained:
“We use Barefoot Networks’ new programmable data plane networking hardware to offload part of the work performed during distributed machine-learning training. Using this new programmable networking hardware, rather than just the network, to move data means that we can perform computations along the network paths.”
Canini also noted that although the programmable switch data plane can do operations very quickly, the operations it can do are limited.
“Our solution had to be simple enough for the hardware and yet flexible enough to solve challenges such as limited onboard memory capacity. SwitchML addresses this challenge by co-designing the communication network and the distributed training algorithm, achieving an acceleration of up to 5.5 times compared to the state-of-the-art approach.”
The fundamental innovation of the team’s SwitchML platform is to allow the network hardware to perform the data aggregation task at each synchronization step during the model update phase of the machine-learning process.