During its re:Invent conference, Amazon’s AWS cloud arm revealed the launch of SageMaker HyperPod, a specialized service designed for training and fine-tuning large language models (LLMs). SageMaker HyperPod is now available to the general public.
Amazon has consistently relied on SageMaker as the cornerstone of its machine learning strategy for building, training, and deploying machine learning models. Now, with the emergence of generative AI, it’s no surprise that SageMaker is being leveraged as the primary product to facilitate the training and fine-tuning of LLMs.
“SageMaker HyperPod provides the capability to create a distributed cluster with optimized accelerated instances for distributed training,” explained Ankur Mehrotra, AWS’ general manager for SageMaker, in an interview prior to the announcement. “It equips you with tools to effectively distribute models and data across your cluster, accelerating the training process.”
Mehrotra also highlighted that SageMaker HyperPod enables users to regularly save checkpoints, allowing them to pause, analyze, and optimize the training process without starting over. The service also includes several fail-safes to prevent the entire training process from failing if a GPU malfunctions.
“For an ML team simply focused on model training, it becomes a near-zero-touch experience, and the cluster essentially becomes self-healing,” Mehrotra explained. “Overall, these capabilities can accelerate the training of foundation models by up to 40%, which is a significant advantage in terms of cost and time-to-market.”
Users can choose to train on Amazon’s custom Trainium (now Trainium 2) chips or Nvidia-based GPU instances, including those using the H100 processor. Amazon guarantees that HyperPod can accelerate the training process by up to 40%.
The company has prior experience with using SageMaker for LLMs, having previously trained the Falcon 180B model on SageMaker using a cluster of thousands of A100 GPUs. Mehrotra mentioned that AWS utilized this experience to develop HyperPod, building on its previous expertise in scaling SageMaker.
Aravind Srinivas, co-founder and CEO of Perplexity AI, mentioned that his company had early access to the service during its private beta. Despite initial skepticism about using AWS for model training and fine-tuning, they found it easy to get support from AWS and access enough GPUs for their use case.
Srinivas emphasized that the AWS HyperPod team focused on optimizing the interconnects that link Nvidia’s graphics cards, significantly speeding up the communication of gradients and parameters across different nodes.