Serve an LLM (Llama3.1 405B) using multiple GPU nodes

Standard

Overview

This tutorial shows you how to serve Llama 3.1 405b using Graphical Processing Units (GPUs) across multiple nodes on Google Kubernetes Engine (GKE), using the vLLM serving framework and the LeaderWorkerSet (LWS) API.

This document is a good starting point if you need the granular control, scalability, resilience, portability, and cost-effectiveness of managed Kubernetes when deploying and serving your AI/ML workloads.

LeaderWorkerSet (LWS)

LWS is a Kubernetes deployment API that addresses common deployment patterns of AI/ML multi-node inference workloads. LWS enables treating multiple Pods as a group.

Multi-Host Serving with vLLM

When deploying exceptionally large language models that cannot fit into a single GPU node, use multiple GPU nodes to serve the model. vLLM supports both tensor parallelism and pipeline parallelism to run workloads across GPUs.

Tensor parallelism splits the matrix multiplications in the transformer layer across multiple GPUs. However, this strategy requires a fast network due to the communication needed between the GPUs, making it less suitable for running workloads across nodes.

Pipeline parallelism splits the model by layer, or vertically. This strategy does not require constant communication between GPUs, making it a better option when running models across nodes.

You can use both strategies in multi-node serving. For example, when using two nodes with 8 H100 GPUs each, you can use two-way pipeline parallelism to shard the model across the two nodes, and eight-way tensor parallelism to shard the model across the eight GPUs on each node.

Objectives

Prepare a GKE Standard cluster.
Deploy vLLM across multiple nodes in your cluster.
Use vLLM to serve Llama3 405b model through curl.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role colunn to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. Click Grant access.
4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
5. In the Select a role list, select a role.
6. To grant additional roles, click Add another role and add each additional role.
7. Click Save.

Create a Hugging Face account, if you don't already have one.
Ensure your project has sufficient quota for GPUs. To learn more, see About GPUs and Allocation quotas.

Get access to the model

Generate an access token

If you don't already have one, generate a new Hugging Face token:

Click Your Profile > Settings > Access Tokens.
Select New Token.
Specify a Name of your choice and a Role of at least Read.
Select Generate a token.

Prepare the environment

In this tutorial, you use Cloud Shell to manage resources hosted on Google Cloud. Cloud Shell comes preinstalled with the software you'll need for this tutorial, including kubectl and gcloud CLI.

To set up your environment with Cloud Shell, follow these steps:

In the Google Cloud console, launch a Cloud Shell session by clicking Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of Google Cloud console.
Set the default environment variables:
```
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export CLUSTER_NAME=CLUSTER_NAME
export ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export IMAGE_NAME=IMAGE_NAME
```
Replace the following values:
- PROJECT_ID: your Google Cloud project ID.
- CLUSTER_NAME: the name of your GKE cluster.
- ZONE: A zone that supports H100s.
- IMAGE_NAME: The vLLM image that includes the ray script. We provide us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240821_1034_RC00 for you, or you can build your own.

Create a GKE cluster

You can serve models using vLLM across multiple GPU nodes in a GKE Autopilot or Standard cluster. We recommend that you use an Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.

Autopilot

In Cloud Shell, run the following command:

  gcloud container clusters create-auto ${CLUSTER_NAME} \
    --project=${PROJECT_ID} \
    --region=${REGION} \
    --cluster-version=${CLUSTER_VERSION}

Standard

Create a GKE Standard cluster with two CPU nodes:

gcloud container clusters create CLUSTER_NAME \
    --project=PROJECT_ID \
    --num-nodes=2 \
    --location=ZONE \
    --machine-type=e2-standard-16

Create an A3 node pool with two nodes, with eight H100s each:

gcloud container node-pools create gpu-nodepool \
    --location=ZONE \
    --num-nodes=2 \
    --machine-type=a3-highgpu-8g \
    --accelerator=type=nvidia-h100-80gb,count=8,gpu-driver-version=LATEST \
    --placement-type=COMPACT \
    --cluster=CLUSTER_NAME

Configure `kubectl` to communicate with your cluster:

gcloud container clusters get-credentials CLUSTER_NAME --location=ZONE

Create a Kubernetes Secret for Hugging Face credentials

Create a Kubernetes Secret that contains the Hugging Face token:

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=${HF_TOKEN} \
  --dry-run=client -o yaml | kubectl apply -f -

(Optional) Create your own vLLM multi-node image

To have more control over the contents of your Docker image and include specific dependencies alongside your script, choose this option. To run vLLM across multiple nodes, you can use Ray for cross-node communication. You can view the Dockerfile, which contains a bash script to set up Ray with vLLM in the LeaderWorkerSet repository.

Build the container

Clone the LeaderWorkerSet repository:

git clone https://github.com/kubernetes-sigs/lws.git

Build the image.

cd lws/docs/examples/vllm/build/ && docker build -f Dockerfile . -t vllm-multihost

Push the image to the Artifact Registry

To ensure your Kubernetes deployment can access the image, store it in Artifact Registry within your Google Cloud project.

gcloud artifacts repositories create vllm-multihost --repository-format=docker --location=REGION_NAME && \
gcloud auth configure-docker REGION_NAME-docker.pkg.dev && \
docker image tag vllm-multihost REGION_NAME-docker.pkg.dev/PROJECT_ID/vllm-multihost/vllm-multihost:latest && \
docker push REGION_NAME-docker.pkg.dev/PROJECT_ID/vllm-multihost/vllm-multihost:latest

Install LeaderWorkerSet

To install LWS, run the following command:

VERSION=v0.4.0
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml

Validate that the LeaderWorkerSet controller is running in the lws-system namespace:

kubectl get pod -n lws-system

The output is similar to the following:

NAME                                      READY   STATUS    RESTARTS   AGE
lws-controller-manager-5c4ff67cbd-9jsfc   2/2     Running   0          6d23h

Deploy vLLM Model Server

To deploy the vLLM model server, follow these steps:

Inspect the manifest vllm-llama3-405b-A3.yaml.


apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: IMAGE_NAME
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "/workspace/vllm/examples/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-worker
            image: IMAGE_NAME
            command:
              - sh
              - -c
              - "/workspace/vllm/examples/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

Apply the manifest by running the following command:
```
kubectl apply -f vllm-llama3-405b-A3.yaml
```

View the logs from the running model server

kubectl logs vllm-0 -c vllm-leader

The output should look similar to the following:

INFO 08-09 21:01:34 api_server.py:297] Route: /detokenize, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/models, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /version, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/chat/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [7428]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Serve the model

Run the following command to set up port forwarding to the model

kubectl port-forward svc/vllm-leader 8080:8080

Interact with the model using curl

In a new terminal, send a request to the server:

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
}'

The output should be similar to the following:

{"id":"cmpl-0a2310f30ac3454aa7f2c5bb6a292e6c",
"object":"text_completion","created":1723238375,"model":"meta-llama/Meta-Llama-3.1-405B-Instruct","choices":[{"index":0,"text":" top destination for foodies, with","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}

(Optional) Speed up model load times with Hyperdisk ML

vLLM can take up to 90 minutes to download, load, and warm up Llama 3.1-405B on each new replica. You can reduce this time to 20 minutes by downloading the model directly to a Hyperdisk ML and mounting it to each pod.

You can follow the tutorial Accelerate AI/ML data loading with Hyperdisk ML using the following YAML files:

Save the following example manifest as producer-pvc.yaml:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: producer-pvc
spec:
  storageClassName: hyperdisk-ml
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 800Gi

Save the following example manifest as producer-job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: producer-job
spec:
  template:  # Template for the Pods the Job will create
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/machine-family
                operator: In
                values:
                - "c3"
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - "ZONE"
      containers:
      - name: copy
        resources:
          requests:
            cpu: "32"
          limits:
            cpu: "32"
        image: python:3.11-bookworm
        command:
        - bash
        - -c
        - "pip install 'huggingface_hub==0.24.6' && \
          huggingface-cli download meta-llama/Meta-Llama-3.1-405B-Instruct --local-dir-use-symlinks=False --local-dir=/data/Meta-Llama-3.1-405B-Instruct --include *.safetensors *.json"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
          - mountPath: "/data"
            name: volume
      restartPolicy: Never
      volumes:
        - name: volume
          persistentVolumeClaim:
            claimName: producer-pvc
  parallelism: 1         # Run 1 Pods concurrently
  completions: 1         # Once 1 Pods complete successfully, the Job is done
  backoffLimit: 4        # Max retries on failure

Deploy the vLLM Model Server

After completing the steps, you can deploy the vLLM multi-node GPU server deployment that consumes the Hyperdisk ML volume.



apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
          - name: vllm-leader
            image: IMAGE_NAME
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "/workspace/vllm/examples/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model /models/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /models
                name: llama3-405b
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: llama3-405b
          persistentVolumeClaim:
            claimName: hdml-static-pvc
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: IMAGE_NAME
            command:
              - sh
              - -c
              - "/workspace/vllm/examples/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /models
                name: llama3-405b
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: llama3-405b
          persistentVolumeClaim:
            claimName: hdml-static-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the deployed resources

To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following command:

gcloud container clusters delete CLUSTER_NAME \
  --location=ZONE

What's next

Learn more about GPUs in GKE.
Explore the vLLM GitHub repository and documentation.
Explore the LWS GitHub repository

Serve an LLM (Llama3.1 405B) using multiple GPU nodes

Overview

LeaderWorkerSet (LWS)

Multi-Host Serving with vLLM

Objectives

Before you begin

Check for the roles

Grant the roles

Get access to the model

Generate an access token

Prepare the environment

Create a GKE cluster

Autopilot

Standard

Configure kubectl to communicate with your cluster:

Create a Kubernetes Secret for Hugging Face credentials

(Optional) Create your own vLLM multi-node image

Build the container

Push the image to the Artifact Registry

Install LeaderWorkerSet

Deploy vLLM Model Server

Serve the model

Interact with the model using curl

(Optional) Speed up model load times with Hyperdisk ML

Deploy the vLLM Model Server

Clean up

Delete the deployed resources

What's next

Configure `kubectl` to communicate with your cluster: