What’s new with AI Hypercomputer?

by oqtey
https://storage.googleapis.com/gweb-cloudblog-publish/images/AI_Hypercomputer.max-2500x2500.jpg

Open software capabilities for training and inference

The real value of hardware is unlocked by co-designed software. AI Hypercomputer’s software layer helps AI practitioners and engineers move faster with open and popular ML frameworks and libraries such as PyTorch, JAX, vLLM, and Keras. For infrastructure teams, that translates to faster delivery times and more cost-efficient resource utilization. We’ve made significant advances in software for both AI training and inference.

Pathways on Cloud: Pathways, developed by Google DeepMind, is the distributed runtime that powers Google’s internal large-scale training and inference infrastructure, now available for the first time on Google Cloud. For inference, it includes features like disaggregated serving, which allows dynamic scaling of the prefill and decode stages of inference workloads on separate compute units, each independently scaling to deliver ultra-low latency and high throughput. It is available to customers through JetStream, our high-throughput and low-latency inference library. Pathways also enables elastic training, allowing your training workloads to automatically scale down on failure and scale up on recovery while providing continuity. To learn more about Pathways on Cloud, including additional use cases for the Pathways architecture, read the documentation.

Train models with high performance and reliability

Training workloads are highly synchronized jobs that run across thousands of nodes. A single degraded node has the potential to disrupt an entire job, resulting in longer time-to-market and higher costs. To provision a cluster quickly you need VMs tuned for specific model architectures located in close proximity. You also need the ability to predict and troubleshoot node failures quickly and ensure workload continuity in the event of a failure.

Cluster Director for GKE and Cluster Director for Slurm. Cluster Director (formerly Hypercompute Cluster) lets you deploy and manage a group of accelerators as a single unit with physically colocated VMs, targeted workload placement, advanced cluster maintenance controls, and topology-aware scheduling. Today we are announcing new updates for Cluster Director, coming later this year:

  • Cluster Director for Slurm, a fully-managed Slurm offering with simplified UI and APIs to provision and operate Slurm clusters, including blueprints for common workloads with pre-configured software to make deployments reliable and repeatable.

  • 360° observability features including dashboards for visibility over cluster utilization, health, and performance, plus advanced features like AI Health Predictor and Straggler Detection to proactively detect and remediate failures, down to individual nodes. 

  • Job continuity capabilities like end-to-end automated health checks that continuously monitor the fleet and preemptively replace unhealthy nodes. The result is uninterrupted training even in degraded clusters, with multi-tier checkpointing for faster save and retrieval.

Cluster Director for GKE will natively support new Cluster Director features as they become available. Cluster Director for Slurm will be available in the coming months, including support for both GPUs and TPUs. Register for early access.

Run inference workloads efficiently at any scale

AI inference has evolved rapidly over the last year. Longer and highly variable context windows are resulting in more sophisticated interactions; reasoning and multi-step inferencing is shifting the incremental demand for compute — and therefore cost — from training to inference time (test-time scaling). To enable useful AI applications for end-users, you need software that can efficiently serve today’s and tomorrow’s interactions.

Announcing AI inference capabilities in GKE: Inference Gateway and Inference Quickstart. 

  • GKE Inference Gateway offers intelligent scaling and load-balancing capabilities, helping you handle request scheduling and routing with gen AI model-aware scaling and load-balancing techniques.

  • With GKE Inference Quickstart, you can choose an AI model and your desired performance, and GKE will configure the right infrastructure, accelerators, and Kubernetes resources to match. 

Both features are available in preview today, together reducing serving costs by over 30%, tail latency by 60%, and increasing throughput by up to 40% compared to other managed and open-source Kubernetes offerings.

vLLM support for TPUs: vLLM is well known for being a fast and efficient library for inference. Starting today, you can easily run inference on TPUs with vLLM and get their price-performance benefits without changing your software stack, with only a few configuration changes. vLLM is supported in Compute Engine, GKE, Vertex AI, and Dataflow. And with GKE custom compute classes, you can use TPUs and GPUs in tandem within the same vLLM deployment.

Making consumption even more flexible

Dynamic Workload Scheduler (DWS) is a resource management and job scheduling platform that helps you get easy and affordable access to accelerators. Today we’re announcing expanded accelerator support in DWS, including for TPU v5e, Trillium, A3 Ultra (NVIDIA H200), and A4 (NVIDIA B200) VMs in preview via Flex Start mode, with Calendar mode support for TPUs coming later this month. Additionally, Flex Start mode now supports a new provisioning method in which resources can be provisioned immediately and scaled dynamically, making it suitable for long-running inference workloads and a wider range of training workloads. This is in addition to the queued provisioning method of Flex Start mode that requires all nodes to be provisioned simultaneously.

Learn about AI Hypercomputer at Next ‘25

Don’t miss the action. Tune in for all of our announcements and deep-dives on the event website. Start with What’s next in compute and AI infrastructure, then check out the following breakouts:


1. arXiv (LMArena), Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference, Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios 1  Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, Ion Stoica, 2024. Accurate as of Mar 19, 2025. This benchmark compares model output quality (as judged by human reviewers) to the price/1M tokens required to generate the output, creating an efficiency comparison. We define ‘intelligence’ as a human’s perception of model output quality.

Related Posts

Leave a Comment