17 min read

vLLM Server with AWS EKS

Wanted to play around with a specific LLM model, and decided I’ll try hosting it via EKS so I could learn something new…

Key Concepts

EKS is a fully managed Kubernetes control plane

  • A node is a physical/virtual machine that runs pods
  • A pod is the smallest deployable unit. It usually has one container (it can run multiple containers that share the same network namespace + volumes), and always runs fully on one node (never span nodes)
  • A node group is a concept for managed Kubernetes (i.e. implementation details of cloud providers). It is a group of identical nodes managed together. Each node group has the same instance type, OS image (AMI), scaling settings, & shares labels & taints
  • A deployment ensures the desired number of pods are running at all times
  • The Cluster Autoscaler (CA) is managed by EKS, and it controls the launch/tear down of nodes

Autoscaling in Kubernetes

There are 2 types of autoscaling in Kubernetes

  • Scaling pods: Adjust the no. of pods based on metrics (e.g. More requests -> Spin up more pods to handle the load)
  • Scaling nodes: Adjust the no. of nodes, based on pending pods that can't be scheduled. Request more nodes when pods are pending

Nodes only react to pods... So if we want to scale resources, we do it through the pods! We don't directly say, "Give me 5 nodes so I can run 5 pods," we say "Give me 5 pods" and the autoscaler ensures the right number of nodes

  • Pods drive scaling; nodes follow
  • If we want more resources: We make the autoscaler scale pods (which go to "Pending" if no nodes exist to run them), and let the Cluster Autoscaler scale nodes to match
  • For our usecase, we enforce the "1 pod == 1 machine == 1 GPU" requirement for our GPU workload. Each pod must have 1 full GPU, and Kubernetes schedules the pod only where a full GPU exists. We don't run multiple vLLM servers on the same GPU - we don't gain any speedup that way...

Autoscaling doesn't happen by default. Later, we'll install Cluster Autoscaler to make autoscaling actually happen.

Autoscaling in action:

  • A pod is "pending" when it can't be scheduled due to (a) No existing node satisfies its nodeSelector, (b) Needs more CPU/memory/GPU than available, (c) taints prevent scheduling
  1. The Cluster Autoscaler (CA) continuously watches for pods that are unschedulable for at least N (~10) seconds
  2. When the CA sees pending pods, the CA examines the pod's nodeSelectors, affinity rules, taints & tolerations, resource requests (CPU, memory, GPU), and pod priority, to find which nodegroup are capable of hosting the pod
  3. Then, the CA requests EKS to scale up the nodegroup. CA calls the AWS EKS API, updating the nodegroup's desiredCapacity (the no. of worker nodes we want our nodegroup to have now) by 1
  4. Now, the nodegroup's Auto Scaling Group (ASG) sees the nodegroup's desiredCapacity has increased. It reacts by booting up a new node (EC2 instance, which has the AMI & instance type specified in the nodegroup's config), which runs the bootstrap script to join the cluster. EKS applies labels/taints from the nodegroup config onto the new node
  5. Once kubelet (the node agent running on the EC2 instance) reports Ready (sends a heartbeat to the control plane), Kubernetes schedules the pending pod to run on that new node

The Cluster Autoscaler relies on the nodegroup's EC2 Auto Scaling Group to scale up the instances in the nodegroup.

Elaborating on (2), no nodegroups are capable of hosting the pod when

  • The pod has a nodeSelector that no nodegroup provides
  • The pod requires mode CPU/RAM/GPU than any nodegroup can provide
  • The pod needs a taint/toleration that no nodegroup matches
  • The nodegroups with matching labels/taints are at maxSize

In this case, the CA does nothing. ASG doesn't create any nodes & the pod stays Pending forever

Pod Pending
     ↓
Cluster Autoscaler notices
     ↓
Cluster Autoscaler → EKS API: scale nodegroup
     ↓
EKS → ASG: desired capacity +1
     ↓
ASG launches EC2 instance
     ↓
Node joins cluster
     ↓
Kubernetes schedules pod

EKS supports 3 options

  1. Managed Node Groups (EC2)
  2. Self-Managed EC2 nodes
  3. Fargate (Serverless)

However, Fargate doesn't support GPUs - which we need for our vLLM inference. So, we'll use option (1)

Guide

1. Set region

export AWS_REGION=ap-southeast-1
aws configure set region $AWS_REGION

2. Create the EKS Control Plane from the eks-gpu-cluster.yaml config file

Ensure you've given the appropriate AWS permissions for the IAM user that's running these AWS commands.

Add the policies AmazonEC2FullAccess and AWSCloudFormationFullAccess.

Add the below policy for iam resource

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "AllowEksIAMActions",
			"Effect": "Allow",
			"Action": [
				"iam:CreateRole",
				"iam:GetRole",
				"iam:PassRole",
				"iam:AttachRolePolicy",
				"iam:CreateInstanceProfile",
				"iam:AddRoleToInstanceProfile",
				"iam:CreateServiceLinkedRole",
				"tag:GetResources",
				"tag:TagResources",
				"tag:UntagResources",
				"iam:TagRole",
				"iam:UntagRole",
				"iam:TagUser",
				"iam:UntagUser",
				"iam:TagPolicy",
				"iam:UntagPolicy"
			],
			"Resource": "*"
		}
	]
}

Add the below policy for eks resource

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "AllowEksActions",
			"Effect": "Allow",
			"Action": [
				"eks: *" # Bad practice... but I was dealing with many permission issues...
			],
			"Resource": "*"
		}
	]
}

eksctl creates our EKS cluster using** CloudFormation**. CloudFormation lets us deploy IaC using templates (JSON/YAML)

  • A CloudFormation stack is a collection of AWS resources created, updated, or deleted together as a single unit basedd on the CloudFormation template

eksctl typically creates two stacks

  1. Cluster Stack: Contains the EKS control plane resources, like EKS cluster, VPC, subnet, RTs, IGWs, IAM roles
  2. NodeGroup Stack(s): EC2 instances, Node IAM roles, Auto Scaling Group, etc.
    • Each nodegroup created by eksctl gets its own CloudFormation stack

Create the cluster with this command

eksctl create cluster -f eks-gpu-cluster.yaml

Verify that the nodegroup was created

  • In the console: Go to EKS > Cluster > Compute
eksctl get nodegroup --cluster=my-vllm --region=ap-southeast-1

Once we've run this command, we've created the EKS cluster itself. We now have a Kubernetes endpoint, a cluster certificate

  • EKS also creates a dedicated VPC, public + private subnets, route tables, security groups for the control plane, and Nodegroup IAM roles (if we didn't explicitly define the VPC)

To see what nodes our EKS cluster is running:

# Update/create the `~/.kube.config` file so kubectl can talk to our AWS EKS Cluster
# Then, it sets this context as our current context. So when we run kubectl commands, we'll refer to this cluster
aws eks update-kubeconfig --name my-vllm --region $AWS_REGION

# Queries the cluster & list all Kubernetes worker nodes
kubectl get nodes

Note that our ~/.kube.config file can contain multiple clusters. We can switch between them using kubectl config use-context <ctx_name>

Once the EC2 node boots (when the Auto Scaling Group decides it should), it bootstraps using the bootstrap.sh script in the instance. It auto-installs the container runtime, kubelet, GPU drivers. It then joins the EKS cluster & gets the labels + taints we've configured in the nodegroup config

3. Configure NVIDIA device plugins for our EKS cluster

Kubernetes has a generic mechanism to support special hardware or resources (GPUs, NICs, etc.) via a framework called "device plugins"

We will apply the NVIDIA device plugin, which makes NVIDIA GPUs on a node visible to Kubernetes as schedulable resources

Once installed and running on GPU-enabled nodes, the plugin reports how many GPUs each node has (capacity), and Kubernetes can then schedule Pods that request GPUs

This file defines a DaemonSet, whose job is to detect Nvidia GPUs on the node & expose them to Kubernetes, so Kubernetes can see & schedule those GPUs.

  • A DaemonSet is a Kubernetes object that creates one device-plugin pod per node. That pod detects the NVIDIA GPUs on that node, and advertises the number of GPUs to the kubelet using the Kubernetes Device Plugin API
  • If we have 3 GPU nodes, then we will have 3 separate device-plugin pods, each one monitoring its own GPU node
kubectl apply -f <https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml>

Then, edit the DaemonSet to include the taint that our nodegroup specified

kubectl edit ds nvidia-device-plugin-daemonset -n kube-system

Under spec.template.spec.tolerations, add

- effect: NoSchedule
  key: gpu
  operator: Exists

This will tolerate any taint with key gpu regardless of value. This DaemonSet will only run the pod on nodes having the key "gpu" (i.e. nodes we marked as gpu-capable)

Once done, verify that we can see the NVIDIA-related pods

kubectl get pods -n kube-system | grep nvidia

3. (Alternative) Use AWS-managed NVIDIA device plugin add-on

aws eks create-addon --cluster-name <your-cluster-name> --addon-name aws-nvidia-device-plugin
  • Note: I didn’t test this option, but read that it’s a simplier alternative…

4. Grant AWS root user access to the kubernetes cluster with EKS access entries

To allow our root user to be part of the system:masters (full admin) group in our kubernetes cluster, navigate to the cluster in the AWS Console > Access. We will create an IAM access entry

  • If the IAM principal "arn:aws:iam::<org_id>:root" doesn't have an entry under the IAM access entries, create one
  • Provide it with the AmazonEKSClusterAdminPolicy Access Policy

Go to mapUsers and add the following under data (Replace the [account_id] portion)

mapUsers: |
  - userarn: arn:aws:iam::[account_id]:root
    groups:
    - system:masters

5. Create Kubernetes namespace

A namespace is a logical grouping of Kubernetes resources inside a Kubernetes cluster. Namespaces help with organizing workloads, applying resource quotas/limits, using RBAC permissions, etc.

kubectl create namespace inference

When we use kubectl & want to affect our resources, we MUST reference this “inference” namespace

6. Create Kubernetes Deployment & Service

In our vllm-deployment.yaml file, we specify a deployment & service

A deployment is a Kubernetes object that manages Pods, ensuring the right no. of them are always running. It creates pods, keeps them running, replaces them if they crash, updates them safely, and scales them up or down

  • TLDR: We use a deployment to create pods.

A service exposes our pods to network access (either inside/outside the cluster). A service gives us a stable network endpoint to reach our pods

We create a kubernetes secret to store our HF_TOKEN. Our deployment will reference this secret when creating a pod

Our secret has name hf-token, with the key token. We create it in the inference namespace, so nothing outside that namespace can see/use it by default. Only workloads (pods, deployments, services, etc.) that are running in the "inference" namespace and explicitly referencing the secret can access this secret

kubectl create secret generic hf-token --from-literal=token=<HF_TOKEN_HERE> -n inference

Create the deployment & service

  • In our deployment, we specify how many pods to create, which nodes it can be scheduled on, what container runs in the pod (the image, command, args), and the min/max resources
  • In our service, we specify the loadbalancer configs - which pods it should send traffic to, which port the client should connect to, and which port to send traffic to on the pods
kubectl apply -f vllm-deployment.yaml

Get all the pods that are spun up

kubectl get pods -n inference -w

In our case, since we configured our desiredCapacity=0 in the nodegroup, there are currently 0 worker nodes. Hence, our pod will remain in the pending state.

So, we'll manually scale up the GPU node group. Later, we'll create Autoscaler that spins up nodes based on ALB request metrics

(Note: It'll take some time for the nodes to spin up.. ~3-5 mins)

# we set "--nodes 1" to set desired capacity to 1, so we spin up one GPU instance
eksctl scale nodegroup --cluster my-vllm --name gpu-ng --nodes 1

kubectl get nodes # Check that node status=READY

# To set desired capacity to 0 (i.e. tear down ALL instances)
eksctl scale nodegroup --cluster my-vllm --name gpu-ng --nodes 0

If we want to stop all running pods & prevent new ones from being created, scale all deployments to 0. We basically tell Kubernetes: "This deployment should run 0 pods"

(Note: It'll take some time for the pods to go from Pending to Ready... ~5-7 mins. It needs to pull the container image, download the model, etc..)

# Scale all deployments in the "inference" namespace to 0
kubectl scale deployment --all --replicas=0 -n inference

kubectl get pods -n inference # Check that pod status=READY
kubectl logs -n inference <POD_NAME> -f # Or monitor that pod live

# Scale back to 1...
kubectl scale deployment --all --replicas=1 -n inference

Potential issues that may arise from spinning up instances

  • vCPU limit in EC2: Go to service quota to raise the quota (for "g" family instances in this case... as we're using "g4dn.xlarge"). Note that there's a separate service quota for On-Demand & Spot instances. Since we're using On-Demand EC2... make sure you set the quota for On-Demand!
    • The quota name should be "Running On-Demand G and VT instances". We can set the quota to 64 (defined in total vCPU), which means we can run 64 vCPU worth of G/VT instance capacity.
    • A g4dn.xlarge has a vCPU capacity of 4, so with our quota=64, we can run 16 of such GPUs at once...
    • Quotas are region-specific, so apply it to the region you will use...
  • nvidia.com/gpu isn't showing in the node: Ensure that the NVIDIA device plugin DaemonSet is running, and NVIDIA drivers are installed on the node
  • Insufficient GPU memory error: Check that (1) the instanceType specified in the nodegroup (in the cluster spec) has sufficient GPU memory, (2) in the deployment spec, the requests & limit for nvidia.com/gpu is set to the desired no. of GPUs (request & limit MUST BE THE SAME value for GPU, since GPU isn't a sharable resource), (3) in the deployment spec, you're running the container with the ""--tensor-parallel-size" flag set to the no. of GPUs you are using, (4) Also, ensure there is sufficient CPU memory allocated to the deployment (~32GB RAM minimum)!

For debugging, we can get a shell into our node using

kubectl get nodes -n inference # Get the node names
kubectl -n inference exec -it <NODE_NAME> -- bash

If there are deployment issues, inspect the pod to see what's the issues...

kubectl describe pod <POD_NAME> -n inference

Alternatively, view the live logs of the pod

kubectl logs -n inference <POD_NAME> -f

If you have edited the .yaml files, remember to apply them

  • If we're updating our cluster config, we need to delete the old cluster first, and then recreate if from the updated cluster config file...
  • This is unlike updating our deployment config, where we can just apply it & it'll override the old deployment
# If cluster spec nodegroup was changed
eksctl delete nodegroup --cluster my-vllm --name gpu-ng # Delete old cluster. Give it some time (~10 mins) to FULLY DELETE before recreating the nodegroup
eksctl create nodegroup --config-file eks-gpu-cluster.yaml # Recreate it. Only nodegroups that don't already exist are created. We have 2 nodegroups here "gpu-ng" and "general-ng", and since we only deleted gpu-ng, only gpu-ng will be recreated (the old general-ng will remain)

# If cluster itself was changed
eksctl upgrade cluster --config-file eks-gpu-cluster.yaml --approve

# If deployment was changed
kubectl apply -f vllm-deployment.yaml
kubectl -n inference delete pod <OLD_POD_NAME> # If no cluster autoscaler is configured, and we only have one running GPU node, then we need to delete the old pod first. This is because the old pod will continue using our sole GPU resource (no new GPU node is created because no CA was setup yet), and our new pod will be stuck in pending state because it doesn't have any GPU nodes to schedule on (the single node already has the GPU fully allocated to the old pod).

When we apply the new vllm-deployment.yaml, it updates the deployment spec. A new ReplicaSet (RS-new) with the new updated pod spec is created, which starts creating new pods from RS-new. Once the new pods has status=ready, the old ReplicaSet is scaled down, which deletes the old pods

7. Get the Service endpoint

Once the pod is running, get the service with

kubectl get svc -n inference vllm-gpu-svc

Copy the EXTERNAL-IP and test the OpenAI-compatible endpoint

curl aaa5e090e994d4f3db9c49a1d3ba8b49-616128841.ap-southeast-1.elb.amazonaws.com/v1/models

curl "<http://aaa5e090e994d4f3db9c49a1d3ba8b49-616128841.ap-southeast-1.elb.amazonaws.com/v1/chat/completions>" \\\\
  -H "Content-Type: application/json" \\\\
  -d '{
        "model": "meta-llama/Meta-Llama-3-8B-Instruct",
        "messages": [{"role": "user", "content": "Hello from EKS vLLM!"}]
      }'

8. Create Cluster Autoscaler

Autoscaling - nodes follow pods... nodes only scale up when there are pending pods... so if we want to scale up resources, we start by scaling the pods ...

  • In our cluster spec, we specified the desiredCapacity, min & max. We merely say that AWS can scale this group to between 0-2 nodes.
  • However, nothing is currently telling it when to change desiredCapacity. We'll need a cluster autoscaler for that. Without a Cluster Autoscaler, even if there are pending pods, no new nodes will be spun up
  • With a Cluster Autoscaler, it will detect pending pods, determine which node groups can satisfy them, and call AWS APIs to raise the desiredCapacity of that ASG

We'll create an IAM policy for our cluster autoscaler. Name this policy cluster-autoscaler-policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeTags",
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup",
        "ec2:DescribeLaunchTemplateVersions"
      ],
      "Resource": "*"
    }
  ]
}

Give the Kubernetes cluster-autoscaler pod permission to call AWS APIs, so it can request AWS to add/remove nodes, etc... We attach the IAM policy we just created to this account

eksctl create iamserviceaccount \\\\
  --cluster=my-vllm \\\\
  --region=ap-southeast-1 \\\\
  --namespace=kube-system \\\\
  --name=cluster-autoscaler \\\\
  --attach-policy-arn=arn:aws:iam::<ACCOUNT_ID>:policy/cluster-autoscaler-policy \\\\
  --override-existing-serviceaccounts \\\\
  --approve

Apply the cluster autoscaler

kubectl apply -f cluster-autoscaler.yaml

The scaling up/down logic is built into the Kubernetes Cluster Autoscaler binary (./cluster-autoscaler) inside the image (registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.7) we're using in the CA deployment

  • It discovers ASGs (each nodegroup has its own ASG) with both the tags "k8s.io/cluster-autoscaler/enabled", "k8s.io/cluster-autoscaler/my-vllm"
  • When the Kubernetes scheduler reports that a pod is pending & unschedulable, the autoscaler
    1. Simulates adding a node from each discovered ASG.
    2. Checks whether the pending pods would fit on such a node.
    3. If yes, it triggers a scale-up on that nodegroup's ASG

The autoscaler also continuously checks for underutilized nodes in the discovered ASGs.

  • If a node is mostly empty, and all pods can be moved elsewhere safely, then the autoscaler will drain the node and AWS will scale the ASG down.

9. Wrapping up - Scaling down

Delete the cluster. This will delete all the cloudformation stack that was spun up

eksctl delete cluster --name my-vllm --region ap-southeast-1

Go to cloudformation to ensure the stacks were deleted! Check that the cluster stack, and the nodegroup stakcs are deleted. You may have to manually delete them...

  • If a stack is stuck in the DELETE_IN_PROGRESS mode... manually delete all the network stuff associated with that stack. This includes security groups, subnets, VPCs
  • To see all resources associated with a stack: Select stack > Resources > See the full list of resources...

Go to EKS to ensure the cluster was deleted! You may have to manually delete it...

Additional Notes

We create our EKS cluster. If i want to run my docker container, I can do it in 2 ways

  1. Run it as a deployment: Create it as a deployment; it'll be managed by kubernetes. If the pod dies, it gets recreated. Good for podss we want "always on"
  2. Use kubectl run to run it.. If the pod dies, it won't be auto recreated. Must use kubectl delete pod... when we're done, if not the pod just sits there. Good for quick experiments

Quick cmds (remember to set the namespace to "inference")

  • kubectl get deployment -n inference: Show all deployment in the "inference" namespace
  • kubectl delete pod vllm-debug: Delete the "vllm-debug" pod in the "inference" namespace. If this pod was created by a deployment, it will be auto re-created. If it was manually created with "eksctl run..", then it won't be auto recreated

YAML configs

vllm-deployment.yaml

apiVersion: apps/v1
kind: Deployment # This is a deployment, which creates pods
metadata:
  name: vllm-gpu
  namespace: inference # It lives in the "inference" namespace which we created earlier
spec:
  replicas: 1 # "Run 1 copy of this application". Create one pod & make sure exactly one stays running - if the pod dies/is deleted, it is recreated.
  selector:
    matchLabels:
      app: vllm-gpu # This deployment manages pods with label "app=vllm-gpu"
  template:
    metadata:
      labels:
        app: vllm-gpu # Pods created by this deployment will have the label "app=vllm-gpu"
    spec:
      # This part controls where Kubernetes is allowed to place (schedule) the pod. It ensure the pods will only run on GPU nodes, not on regular CPU nodes
      tolerations: # This pod can run on nodes with this taint
        - key: "gpu"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      nodeSelector:
        workload: gpu # Only schedule this pod onto nodes that have label "workload=gpu"
      containers: # Define the container that actually runs inside the pod
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Meta-Llama-3-8B-Instruct" # The model we want to run
            - "--dtype"
            - "half"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8000"
            - "--tensor-parallel-size"
            - "4" # We're using 4 gpus
            - "--gpu-memory-utilization"
            - "0.7"
          ports:
            - containerPort: 8000
          env:
            # HF token for private modelss
            - name: HF_TOKEN
              valueFrom:
                # This reads our hf-token secret, using the key "token"
                secretKeyRef:
                  name: hf-token
                  key: token # We don't store the actual secret here! We create a kubectl secret... and we simply reference it here
          resources:
            # Requests - what the pod needs to be scheduled
            # "To run this pod, you must place it on a node that has at least these specs..."
            # If no node has these available, the pod stays in Pending
            # GPUs are not sharable so MUST be request==limit for GPU
            # CPU & memory is sharable, so can do limit > request for CPU/Mem
            requests:
              nvidia.com/gpu: "4" # 4 GPU. Note that GPUs are not sharable, so request & limit for GPU must be the SAME!
              cpu: "3.5" # 3.5 CPU cores
              memory: "14Gi"
            # Limits - the maximum the pod can use at runtime (not the maximum the node can have)
            # "Don't allow the pod to use more than 1GPU, 4 CPU cores, 16GB RAM"
            limits:
              nvidia.com/gpu: "4" # If the pod tries to use 8 GPUs, it only sees 4 GPU
              cpu: "8" # If the pod tries to use 100% of CPU, it's limited to 8 cores
              memory: "192Gi" # If the pod uses >192GB RAM, it is killed

---
apiVersion: v1
kind: Service # Create a Service (specifically a LoadBalancer Service) that creates a stable network endpoint exposing our vLLM pod to the outside world
metadata:
  name: vllm-gpu-svc
  namespace: inference
spec:
  type: LoadBalancer
  selector:
    app: vllm-gpu # Define the pods that the loadbalancer should send traffic to. Send traffic to all pods with the label "app=vllm-gpu".
  ports:
    - port: 80 # Clients connect to the service on port 80
      targetPort: 8000 # Traffic is forwarded inside the pod to port 8000 (our vLLM server is listening on 8000)

eks-gpu-cluster.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig # This YAML describes a config for creating an EKS cluster

metadata:
  name: my-vllm
  region: ap-southeast-1
  version: "1.30" # EKS control plane will use Kubernetes 1.30

# A node group is a group of nodes managed together by eksctl (EKS in our case)
# Each node group has the same instance type, the same OS image (AMI), scaling settings, & shares labels & taints
# We can autoscale the nodegroup (scale up when pods need GPUs, scale down when GPU pods disappear)
managedNodeGroups:
  - name: gpu-ng # Create a managed node group called "gpu-ng"
    instanceType: g4dn.12xlarge # Ensure this instance type is available in your specified region
    desiredCapacity: 1 # start with 1 GPU node
    # Here, we merely say that AWS can scale this group to between 0-2 nodes
    # However, nothing is currently telling it **when** to change desiredCapacity. We'll need a cluster autoscaler for that
    minSize: 0 # Allow it to scale down to 0
    maxSize: 2
    amiFamily: AmazonLinux2023 # Use the EKS-optimized AMI
    # capacityType: ON_DEMAND # "ON_DEMAND" uses regular on-demand EC2 instances. We can also set it to "SPOT" so it uses EC2 spot instances for this node group
    labels: # Add label to the node group - used when we schedule apps via `nodeSelector`
      # Our pods can target this node group using the "nodeSelector -> workload: GPU" config in their PodSpec
      workload: gpu
    taints: # A taint is a property of a node/nodegroup that repels all pods unless they have explicit permissions. Only pods that tolerate this taint can run on this GPU node group
      # Prevent non-GPU workloads from using GPU nodes. No pod will be able to schedule onto this node group unless that pod has a matching toleration in its PodSpec
      - key: "gpu"
        value: "true"
        effect: "NoSchedule" # Controls what kubernetes does when a pod doesn't tolerate the taint. A taint with `NoSchedule` tells Kubernetes: "Don't put pods here if they don't tolerate this taint"
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/my-vllm: "owned"
  - name: general-ng
    instanceType: t3.medium
    desiredCapacity: 1 # keep at least 1 general node
    minSize: 1
    maxSize: 3
    amiFamily: AmazonLinux2023
    labels:
      workload: general
    # For autoscaler...
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/my-vllm: "owned"

cluster-autoscaler.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    spec:
      serviceAccountName: cluster-autoscaler
      priorityClassName: system-cluster-critical
      containers:
        - name: cluster-autoscaler
          # Match your k8s minor version (1.30)
          image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.7
          resources:
            limits:
              cpu: 100m
              memory: 600Mi
            requests:
              cpu: 100m
              memory: 600Mi
          command: # Discover ASGs with the tags "k8s.io/cluster-autoscaler/enabled", "k8s.io/cluster-autoscaler/my-vllm"
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --cluster-name=my-vllm
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-vllm
            - --balance-similar-node-groups
            - --skip-nodes-with-local-storage=false
            - --skip-nodes-with-system-pods=false
          env:
            - name: AWS_REGION
              value: ap-southeast-1
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
      volumes:
        - name: ssl-certs
          hostPath:
            path: /etc/ssl/certs/ca-bundle.crt