vLLM Server with AWS EKS
Wanted to play around with a specific LLM model, and decided I’ll try hosting it via EKS so I could learn something new…
Key Concepts
EKS is a fully managed Kubernetes control plane
- A node is a physical/virtual machine that runs pods
- A pod is the smallest deployable unit. It usually has one container (it can run multiple containers that share the same network namespace + volumes), and always runs fully on one node (never span nodes)
- A node group is a concept for managed Kubernetes (i.e. implementation details of cloud providers). It is a group of identical nodes managed together. Each node group has the same instance type, OS image (AMI), scaling settings, & shares labels & taints
- A deployment ensures the desired number of pods are running at all times
- The Cluster Autoscaler (CA) is managed by EKS, and it controls the launch/tear down of nodes
Autoscaling in Kubernetes
There are 2 types of autoscaling in Kubernetes
- Scaling pods: Adjust the no. of pods based on metrics (e.g. More requests -> Spin up more pods to handle the load)
- Scaling nodes: Adjust the no. of nodes, based on pending pods that can't be scheduled. Request more nodes when pods are pending
Nodes only react to pods... So if we want to scale resources, we do it through the pods! We don't directly say, "Give me 5 nodes so I can run 5 pods," we say "Give me 5 pods" and the autoscaler ensures the right number of nodes
- Pods drive scaling; nodes follow
- If we want more resources: We make the autoscaler scale pods (which go to "Pending" if no nodes exist to run them), and let the Cluster Autoscaler scale nodes to match
- For our usecase, we enforce the "1 pod == 1 machine == 1 GPU" requirement for our GPU workload. Each pod must have 1 full GPU, and Kubernetes schedules the pod only where a full GPU exists. We don't run multiple vLLM servers on the same GPU - we don't gain any speedup that way...
Autoscaling doesn't happen by default. Later, we'll install Cluster Autoscaler to make autoscaling actually happen.
Autoscaling in action:
- A pod is "pending" when it can't be scheduled due to (a) No existing node satisfies its
nodeSelector, (b) Needs more CPU/memory/GPU than available, (c) taints prevent scheduling
- The Cluster Autoscaler (CA) continuously watches for pods that are unschedulable for at least N (~10) seconds
- When the CA sees pending pods, the CA examines the pod's nodeSelectors, affinity rules, taints & tolerations, resource requests (CPU, memory, GPU), and pod priority, to find which nodegroup are capable of hosting the pod
- Then, the CA requests EKS to scale up the nodegroup. CA calls the AWS EKS API, updating the nodegroup's
desiredCapacity(the no. of worker nodes we want our nodegroup to have now) by 1 - Now, the nodegroup's Auto Scaling Group (ASG) sees the nodegroup's desiredCapacity has increased. It reacts by booting up a new node (EC2 instance, which has the AMI & instance type specified in the nodegroup's config), which runs the bootstrap script to join the cluster. EKS applies labels/taints from the nodegroup config onto the new node
- Once kubelet (the node agent running on the EC2 instance) reports Ready (sends a heartbeat to the control plane), Kubernetes schedules the pending pod to run on that new node
The Cluster Autoscaler relies on the nodegroup's EC2 Auto Scaling Group to scale up the instances in the nodegroup.
Elaborating on (2), no nodegroups are capable of hosting the pod when
- The pod has a
nodeSelectorthat no nodegroup provides - The pod requires mode CPU/RAM/GPU than any nodegroup can provide
- The pod needs a taint/toleration that no nodegroup matches
- The nodegroups with matching labels/taints are at maxSize
In this case, the CA does nothing. ASG doesn't create any nodes & the pod stays Pending forever
Pod Pending
↓
Cluster Autoscaler notices
↓
Cluster Autoscaler → EKS API: scale nodegroup
↓
EKS → ASG: desired capacity +1
↓
ASG launches EC2 instance
↓
Node joins cluster
↓
Kubernetes schedules pod
EKS supports 3 options
- Managed Node Groups (EC2)
- Self-Managed EC2 nodes
- Fargate (Serverless)
However, Fargate doesn't support GPUs - which we need for our vLLM inference. So, we'll use option (1)
Guide
1. Set region
export AWS_REGION=ap-southeast-1
aws configure set region $AWS_REGION
2. Create the EKS Control Plane from the eks-gpu-cluster.yaml config file
Ensure you've given the appropriate AWS permissions for the IAM user that's running these AWS commands.
Add the policies AmazonEC2FullAccess and AWSCloudFormationFullAccess.
Add the below policy for iam resource
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowEksIAMActions",
"Effect": "Allow",
"Action": [
"iam:CreateRole",
"iam:GetRole",
"iam:PassRole",
"iam:AttachRolePolicy",
"iam:CreateInstanceProfile",
"iam:AddRoleToInstanceProfile",
"iam:CreateServiceLinkedRole",
"tag:GetResources",
"tag:TagResources",
"tag:UntagResources",
"iam:TagRole",
"iam:UntagRole",
"iam:TagUser",
"iam:UntagUser",
"iam:TagPolicy",
"iam:UntagPolicy"
],
"Resource": "*"
}
]
}
Add the below policy for eks resource
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowEksActions",
"Effect": "Allow",
"Action": [
"eks: *" # Bad practice... but I was dealing with many permission issues...
],
"Resource": "*"
}
]
}
eksctl creates our EKS cluster using** CloudFormation**. CloudFormation lets us deploy IaC using templates (JSON/YAML)
- A CloudFormation stack is a collection of AWS resources created, updated, or deleted together as a single unit basedd on the CloudFormation template
eksctl typically creates two stacks
- Cluster Stack: Contains the EKS control plane resources, like EKS cluster, VPC, subnet, RTs, IGWs, IAM roles
- NodeGroup Stack(s): EC2 instances, Node IAM roles, Auto Scaling Group, etc.
- Each nodegroup created by eksctl gets its own CloudFormation stack
Create the cluster with this command
eksctl create cluster -f eks-gpu-cluster.yaml
Verify that the nodegroup was created
- In the console: Go to EKS > Cluster > Compute
eksctl get nodegroup --cluster=my-vllm --region=ap-southeast-1
Once we've run this command, we've created the EKS cluster itself. We now have a Kubernetes endpoint, a cluster certificate
- EKS also creates a dedicated VPC, public + private subnets, route tables, security groups for the control plane, and Nodegroup IAM roles (if we didn't explicitly define the VPC)
To see what nodes our EKS cluster is running:
# Update/create the `~/.kube.config` file so kubectl can talk to our AWS EKS Cluster
# Then, it sets this context as our current context. So when we run kubectl commands, we'll refer to this cluster
aws eks update-kubeconfig --name my-vllm --region $AWS_REGION
# Queries the cluster & list all Kubernetes worker nodes
kubectl get nodes
Note that our ~/.kube.config file can contain multiple clusters. We can switch between them using kubectl config use-context <ctx_name>
Once the EC2 node boots (when the Auto Scaling Group decides it should), it bootstraps using the bootstrap.sh script in the instance. It auto-installs the container runtime, kubelet, GPU drivers. It then joins the EKS cluster & gets the labels + taints we've configured in the nodegroup config
3. Configure NVIDIA device plugins for our EKS cluster
Kubernetes has a generic mechanism to support special hardware or resources (GPUs, NICs, etc.) via a framework called "device plugins"
We will apply the NVIDIA device plugin, which makes NVIDIA GPUs on a node visible to Kubernetes as schedulable resources
Once installed and running on GPU-enabled nodes, the plugin reports how many GPUs each node has (capacity), and Kubernetes can then schedule Pods that request GPUs
This file defines a DaemonSet, whose job is to detect Nvidia GPUs on the node & expose them to Kubernetes, so Kubernetes can see & schedule those GPUs.
- A DaemonSet is a Kubernetes object that creates one device-plugin pod per node. That pod detects the NVIDIA GPUs on that node, and advertises the number of GPUs to the kubelet using the Kubernetes Device Plugin API
- If we have 3 GPU nodes, then we will have 3 separate device-plugin pods, each one monitoring its own GPU node
kubectl apply -f <https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml>
Then, edit the DaemonSet to include the taint that our nodegroup specified
kubectl edit ds nvidia-device-plugin-daemonset -n kube-system
Under spec.template.spec.tolerations, add
- effect: NoSchedule
key: gpu
operator: Exists
This will tolerate any taint with key gpu regardless of value. This DaemonSet will only run the pod on nodes having the key "gpu" (i.e. nodes we marked as gpu-capable)
Once done, verify that we can see the NVIDIA-related pods
kubectl get pods -n kube-system | grep nvidia
3. (Alternative) Use AWS-managed NVIDIA device plugin add-on
aws eks create-addon --cluster-name <your-cluster-name> --addon-name aws-nvidia-device-plugin
- Note: I didn’t test this option, but read that it’s a simplier alternative…
4. Grant AWS root user access to the kubernetes cluster with EKS access entries
To allow our root user to be part of the system:masters (full admin) group in our kubernetes cluster, navigate to the cluster in the AWS Console > Access. We will create an IAM access entry
- If the IAM principal "arn:aws:iam::<org_id>:root" doesn't have an entry under the IAM access entries, create one
- Provide it with the
AmazonEKSClusterAdminPolicyAccess Policy
Go to mapUsers and add the following under data (Replace the [account_id] portion)
mapUsers: |
- userarn: arn:aws:iam::[account_id]:root
groups:
- system:masters
5. Create Kubernetes namespace
A namespace is a logical grouping of Kubernetes resources inside a Kubernetes cluster. Namespaces help with organizing workloads, applying resource quotas/limits, using RBAC permissions, etc.
kubectl create namespace inference
When we use kubectl & want to affect our resources, we MUST reference this “inference” namespace
6. Create Kubernetes Deployment & Service
In our vllm-deployment.yaml file, we specify a deployment & service
A deployment is a Kubernetes object that manages Pods, ensuring the right no. of them are always running. It creates pods, keeps them running, replaces them if they crash, updates them safely, and scales them up or down
- TLDR: We use a deployment to create pods.
A service exposes our pods to network access (either inside/outside the cluster). A service gives us a stable network endpoint to reach our pods
We create a kubernetes secret to store our HF_TOKEN. Our deployment will reference this secret when creating a pod
Our secret has name hf-token, with the key token. We create it in the inference namespace, so nothing outside that namespace can see/use it by default. Only workloads (pods, deployments, services, etc.) that are running in the "inference" namespace and explicitly referencing the secret can access this secret
kubectl create secret generic hf-token --from-literal=token=<HF_TOKEN_HERE> -n inference
Create the deployment & service
- In our deployment, we specify how many pods to create, which nodes it can be scheduled on, what container runs in the pod (the image, command, args), and the min/max resources
- In our service, we specify the loadbalancer configs - which pods it should send traffic to, which port the client should connect to, and which port to send traffic to on the pods
kubectl apply -f vllm-deployment.yaml
Get all the pods that are spun up
kubectl get pods -n inference -w
In our case, since we configured our desiredCapacity=0 in the nodegroup, there are currently 0 worker nodes. Hence, our pod will remain in the pending state.
So, we'll manually scale up the GPU node group. Later, we'll create Autoscaler that spins up nodes based on ALB request metrics
(Note: It'll take some time for the nodes to spin up.. ~3-5 mins)
# we set "--nodes 1" to set desired capacity to 1, so we spin up one GPU instance
eksctl scale nodegroup --cluster my-vllm --name gpu-ng --nodes 1
kubectl get nodes # Check that node status=READY
# To set desired capacity to 0 (i.e. tear down ALL instances)
eksctl scale nodegroup --cluster my-vllm --name gpu-ng --nodes 0
If we want to stop all running pods & prevent new ones from being created, scale all deployments to 0. We basically tell Kubernetes: "This deployment should run 0 pods"
(Note: It'll take some time for the pods to go from Pending to Ready... ~5-7 mins. It needs to pull the container image, download the model, etc..)
# Scale all deployments in the "inference" namespace to 0
kubectl scale deployment --all --replicas=0 -n inference
kubectl get pods -n inference # Check that pod status=READY
kubectl logs -n inference <POD_NAME> -f # Or monitor that pod live
# Scale back to 1...
kubectl scale deployment --all --replicas=1 -n inference
Potential issues that may arise from spinning up instances
- vCPU limit in EC2: Go to service quota to raise the quota (for "g" family instances in this case... as we're using "g4dn.xlarge"). Note that there's a separate service quota for On-Demand & Spot instances. Since we're using On-Demand EC2... make sure you set the quota for On-Demand!
- The quota name should be "Running On-Demand G and VT instances". We can set the quota to 64 (defined in total vCPU), which means we can run 64 vCPU worth of G/VT instance capacity.
- A
g4dn.xlargehas a vCPU capacity of 4, so with our quota=64, we can run 16 of such GPUs at once... - Quotas are region-specific, so apply it to the region you will use...
nvidia.com/gpuisn't showing in the node: Ensure that the NVIDIA device plugin DaemonSet is running, and NVIDIA drivers are installed on the node- Insufficient GPU memory error: Check that (1) the instanceType specified in the nodegroup (in the cluster spec) has sufficient GPU memory, (2) in the deployment spec, the requests & limit for
nvidia.com/gpuis set to the desired no. of GPUs (request & limit MUST BE THE SAME value for GPU, since GPU isn't a sharable resource), (3) in the deployment spec, you're running the container with the ""--tensor-parallel-size" flag set to the no. of GPUs you are using, (4) Also, ensure there is sufficient CPU memory allocated to the deployment (~32GB RAM minimum)!
For debugging, we can get a shell into our node using
kubectl get nodes -n inference # Get the node names
kubectl -n inference exec -it <NODE_NAME> -- bash
If there are deployment issues, inspect the pod to see what's the issues...
kubectl describe pod <POD_NAME> -n inference
Alternatively, view the live logs of the pod
kubectl logs -n inference <POD_NAME> -f
If you have edited the .yaml files, remember to apply them
- If we're updating our cluster config, we need to delete the old cluster first, and then recreate if from the updated cluster config file...
- This is unlike updating our deployment config, where we can just apply it & it'll override the old deployment
# If cluster spec nodegroup was changed
eksctl delete nodegroup --cluster my-vllm --name gpu-ng # Delete old cluster. Give it some time (~10 mins) to FULLY DELETE before recreating the nodegroup
eksctl create nodegroup --config-file eks-gpu-cluster.yaml # Recreate it. Only nodegroups that don't already exist are created. We have 2 nodegroups here "gpu-ng" and "general-ng", and since we only deleted gpu-ng, only gpu-ng will be recreated (the old general-ng will remain)
# If cluster itself was changed
eksctl upgrade cluster --config-file eks-gpu-cluster.yaml --approve
# If deployment was changed
kubectl apply -f vllm-deployment.yaml
kubectl -n inference delete pod <OLD_POD_NAME> # If no cluster autoscaler is configured, and we only have one running GPU node, then we need to delete the old pod first. This is because the old pod will continue using our sole GPU resource (no new GPU node is created because no CA was setup yet), and our new pod will be stuck in pending state because it doesn't have any GPU nodes to schedule on (the single node already has the GPU fully allocated to the old pod).
When we apply the new vllm-deployment.yaml, it updates the deployment spec. A new ReplicaSet (RS-new) with the new updated pod spec is created, which starts creating new pods from RS-new. Once the new pods has status=ready, the old ReplicaSet is scaled down, which deletes the old pods
7. Get the Service endpoint
Once the pod is running, get the service with
kubectl get svc -n inference vllm-gpu-svc
Copy the EXTERNAL-IP and test the OpenAI-compatible endpoint
curl aaa5e090e994d4f3db9c49a1d3ba8b49-616128841.ap-southeast-1.elb.amazonaws.com/v1/models
curl "<http://aaa5e090e994d4f3db9c49a1d3ba8b49-616128841.ap-southeast-1.elb.amazonaws.com/v1/chat/completions>" \\\\
-H "Content-Type: application/json" \\\\
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [{"role": "user", "content": "Hello from EKS vLLM!"}]
}'
8. Create Cluster Autoscaler
Autoscaling - nodes follow pods... nodes only scale up when there are pending pods... so if we want to scale up resources, we start by scaling the pods ...
- In our cluster spec, we specified the desiredCapacity, min & max. We merely say that AWS can scale this group to between 0-2 nodes.
- However, nothing is currently telling it when to change desiredCapacity. We'll need a cluster autoscaler for that. Without a Cluster Autoscaler, even if there are pending pods, no new nodes will be spun up
- With a Cluster Autoscaler, it will detect pending pods, determine which node groups can satisfy them, and call AWS APIs to raise the
desiredCapacityof that ASG
We'll create an IAM policy for our cluster autoscaler. Name this policy cluster-autoscaler-policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeTags",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeLaunchTemplateVersions"
],
"Resource": "*"
}
]
}
Give the Kubernetes cluster-autoscaler pod permission to call AWS APIs, so it can request AWS to add/remove nodes, etc... We attach the IAM policy we just created to this account
eksctl create iamserviceaccount \\\\
--cluster=my-vllm \\\\
--region=ap-southeast-1 \\\\
--namespace=kube-system \\\\
--name=cluster-autoscaler \\\\
--attach-policy-arn=arn:aws:iam::<ACCOUNT_ID>:policy/cluster-autoscaler-policy \\\\
--override-existing-serviceaccounts \\\\
--approve
Apply the cluster autoscaler
kubectl apply -f cluster-autoscaler.yaml
The scaling up/down logic is built into the Kubernetes Cluster Autoscaler binary (./cluster-autoscaler) inside the image (registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.7) we're using in the CA deployment
- It discovers ASGs (each nodegroup has its own ASG) with both the tags "k8s.io/cluster-autoscaler/enabled", "k8s.io/cluster-autoscaler/my-vllm"
- When the Kubernetes scheduler reports that a pod is pending & unschedulable, the autoscaler
- Simulates adding a node from each discovered ASG.
- Checks whether the pending pods would fit on such a node.
- If yes, it triggers a scale-up on that nodegroup's ASG
The autoscaler also continuously checks for underutilized nodes in the discovered ASGs.
- If a node is mostly empty, and all pods can be moved elsewhere safely, then the autoscaler will drain the node and AWS will scale the ASG down.
9. Wrapping up - Scaling down
Delete the cluster. This will delete all the cloudformation stack that was spun up
eksctl delete cluster --name my-vllm --region ap-southeast-1
Go to cloudformation to ensure the stacks were deleted! Check that the cluster stack, and the nodegroup stakcs are deleted. You may have to manually delete them...
- If a stack is stuck in the
DELETE_IN_PROGRESSmode... manually delete all the network stuff associated with that stack. This includes security groups, subnets, VPCs - To see all resources associated with a stack: Select stack > Resources > See the full list of resources...
Go to EKS to ensure the cluster was deleted! You may have to manually delete it...
Additional Notes
We create our EKS cluster. If i want to run my docker container, I can do it in 2 ways
- Run it as a deployment: Create it as a deployment; it'll be managed by kubernetes. If the pod dies, it gets recreated. Good for podss we want "always on"
- Use
kubectl runto run it.. If the pod dies, it won't be auto recreated. Must usekubectl delete pod...when we're done, if not the pod just sits there. Good for quick experiments
Quick cmds (remember to set the namespace to "inference")
kubectl get deployment -n inference: Show all deployment in the "inference" namespacekubectl delete pod vllm-debug: Delete the "vllm-debug" pod in the "inference" namespace. If this pod was created by a deployment, it will be auto re-created. If it was manually created with "eksctl run..", then it won't be auto recreated
YAML configs
vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment # This is a deployment, which creates pods
metadata:
name: vllm-gpu
namespace: inference # It lives in the "inference" namespace which we created earlier
spec:
replicas: 1 # "Run 1 copy of this application". Create one pod & make sure exactly one stays running - if the pod dies/is deleted, it is recreated.
selector:
matchLabels:
app: vllm-gpu # This deployment manages pods with label "app=vllm-gpu"
template:
metadata:
labels:
app: vllm-gpu # Pods created by this deployment will have the label "app=vllm-gpu"
spec:
# This part controls where Kubernetes is allowed to place (schedule) the pod. It ensure the pods will only run on GPU nodes, not on regular CPU nodes
tolerations: # This pod can run on nodes with this taint
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
nodeSelector:
workload: gpu # Only schedule this pod onto nodes that have label "workload=gpu"
containers: # Define the container that actually runs inside the pod
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Meta-Llama-3-8B-Instruct" # The model we want to run
- "--dtype"
- "half"
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
- "--tensor-parallel-size"
- "4" # We're using 4 gpus
- "--gpu-memory-utilization"
- "0.7"
ports:
- containerPort: 8000
env:
# HF token for private modelss
- name: HF_TOKEN
valueFrom:
# This reads our hf-token secret, using the key "token"
secretKeyRef:
name: hf-token
key: token # We don't store the actual secret here! We create a kubectl secret... and we simply reference it here
resources:
# Requests - what the pod needs to be scheduled
# "To run this pod, you must place it on a node that has at least these specs..."
# If no node has these available, the pod stays in Pending
# GPUs are not sharable so MUST be request==limit for GPU
# CPU & memory is sharable, so can do limit > request for CPU/Mem
requests:
nvidia.com/gpu: "4" # 4 GPU. Note that GPUs are not sharable, so request & limit for GPU must be the SAME!
cpu: "3.5" # 3.5 CPU cores
memory: "14Gi"
# Limits - the maximum the pod can use at runtime (not the maximum the node can have)
# "Don't allow the pod to use more than 1GPU, 4 CPU cores, 16GB RAM"
limits:
nvidia.com/gpu: "4" # If the pod tries to use 8 GPUs, it only sees 4 GPU
cpu: "8" # If the pod tries to use 100% of CPU, it's limited to 8 cores
memory: "192Gi" # If the pod uses >192GB RAM, it is killed
---
apiVersion: v1
kind: Service # Create a Service (specifically a LoadBalancer Service) that creates a stable network endpoint exposing our vLLM pod to the outside world
metadata:
name: vllm-gpu-svc
namespace: inference
spec:
type: LoadBalancer
selector:
app: vllm-gpu # Define the pods that the loadbalancer should send traffic to. Send traffic to all pods with the label "app=vllm-gpu".
ports:
- port: 80 # Clients connect to the service on port 80
targetPort: 8000 # Traffic is forwarded inside the pod to port 8000 (our vLLM server is listening on 8000)
eks-gpu-cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig # This YAML describes a config for creating an EKS cluster
metadata:
name: my-vllm
region: ap-southeast-1
version: "1.30" # EKS control plane will use Kubernetes 1.30
# A node group is a group of nodes managed together by eksctl (EKS in our case)
# Each node group has the same instance type, the same OS image (AMI), scaling settings, & shares labels & taints
# We can autoscale the nodegroup (scale up when pods need GPUs, scale down when GPU pods disappear)
managedNodeGroups:
- name: gpu-ng # Create a managed node group called "gpu-ng"
instanceType: g4dn.12xlarge # Ensure this instance type is available in your specified region
desiredCapacity: 1 # start with 1 GPU node
# Here, we merely say that AWS can scale this group to between 0-2 nodes
# However, nothing is currently telling it **when** to change desiredCapacity. We'll need a cluster autoscaler for that
minSize: 0 # Allow it to scale down to 0
maxSize: 2
amiFamily: AmazonLinux2023 # Use the EKS-optimized AMI
# capacityType: ON_DEMAND # "ON_DEMAND" uses regular on-demand EC2 instances. We can also set it to "SPOT" so it uses EC2 spot instances for this node group
labels: # Add label to the node group - used when we schedule apps via `nodeSelector`
# Our pods can target this node group using the "nodeSelector -> workload: GPU" config in their PodSpec
workload: gpu
taints: # A taint is a property of a node/nodegroup that repels all pods unless they have explicit permissions. Only pods that tolerate this taint can run on this GPU node group
# Prevent non-GPU workloads from using GPU nodes. No pod will be able to schedule onto this node group unless that pod has a matching toleration in its PodSpec
- key: "gpu"
value: "true"
effect: "NoSchedule" # Controls what kubernetes does when a pod doesn't tolerate the taint. A taint with `NoSchedule` tells Kubernetes: "Don't put pods here if they don't tolerate this taint"
tags:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/my-vllm: "owned"
- name: general-ng
instanceType: t3.medium
desiredCapacity: 1 # keep at least 1 general node
minSize: 1
maxSize: 3
amiFamily: AmazonLinux2023
labels:
workload: general
# For autoscaler...
tags:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/my-vllm: "owned"
cluster-autoscaler.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
spec:
serviceAccountName: cluster-autoscaler
priorityClassName: system-cluster-critical
containers:
- name: cluster-autoscaler
# Match your k8s minor version (1.30)
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.7
resources:
limits:
cpu: 100m
memory: 600Mi
requests:
cpu: 100m
memory: 600Mi
command: # Discover ASGs with the tags "k8s.io/cluster-autoscaler/enabled", "k8s.io/cluster-autoscaler/my-vllm"
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --cluster-name=my-vllm
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-vllm
- --balance-similar-node-groups
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=false
env:
- name: AWS_REGION
value: ap-southeast-1
volumeMounts:
- name: ssl-certs
mountPath: /etc/ssl/certs/ca-certificates.crt
readOnly: true
volumes:
- name: ssl-certs
hostPath:
path: /etc/ssl/certs/ca-bundle.crt