Deploying vLLM with Qwen Llama on Minikube
In my previous post GPU-Accelerated Kubernetes: Setting up NVIDIA on Minikube, we successfully exposed the GPU to our Minikube pods. Congrats, we survived the hardest part of the WSL2/Docker Desktop stack! 🛠️
Now that your RTX 4060 (8 GB VRAM) is visible, and your NiFi + Kafka + Flink stack is humming, let’s jump into the fun part: Real-time Streaming AI Inference. The 4060 is a sweet spot for local development; it can comfortably run 3B to 7B models with lightning-fast speeds, or even 13B models with some clever quantization. Our goal is to turn Minikube into a private AI endpoint that NiFi can call via the InvokeHTTP processor to summarize streams, classify events, or extract entities in real-time. 🚀
🤖 Deploying the vLLM Inference Server
vLLM is a high-throughput, memory-efficient inference engine. It is perfect for 40-series cards because it supports advanced quantization and paged attention.
We have two great options to get you started:
Option A: Llama 3.2 (The Heavy Hitter)
Llama 3.2-3B is incredibly smart for its size, but it is a “gated” model.
Requirement: This requires Hugging Face (HF) access to download. You must have an HF account and an API token. You’ll need to create a Kubernetes secret named hf-token containing your token before deploying this. Last but not least you need to complete user input form for access.
Create vllm-llama.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm-server
image: vllm/vllm-openai:latest
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: HF_TOKEN
resources:
limits:
[nvidia.com/gpu](https://nvidia.com/gpu): "1"
requests:
[nvidia.com/gpu](https://nvidia.com/gpu): "1"
args:
- "meta-llama/Llama-3.2-3B-Instruct"
- "--quantization"
- "bitsandbytes"
- "--load-format"
- "bitsandbytes"
- "--gpu-memory-utilization"
- "0.80"
- "--max-model-len"
- "4096"
- "--enforce-eager"
ports:
- containerPort: 8000
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm-server
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
Option B: Qwen 2.5 (The “Easy” Path)
Without waiting for approval, Qwen 2.5-3B-Instruct is a world-class, free, non-gated model that performs exceptionally well. It is roughly comparable to Llama 3.2 in quality but loads without any token.
ProTip: You still need your hf-token but you wont have to wait for approval.
s
Create vllm-qwen.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm-server
image: vllm/vllm-openai:latest
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: HF_TOKEN
resources:
limits:
nvidia.com/gpu: 1
args:
- "Qwen/Qwen2.5-3B-Instruct"
- "--quantization"
- "bitsandbytes"
- "--load-format"
- "bitsandbytes"
- "--gpu-memory-utilization"
- "0.80" # Fits comfortably in 6.92 GiB
- "--max-model-len"
- "2048" # 2k is a solid sweet spot for 3B models
- "--enforce-eager"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: default
spec:
selector:
app: vllm-server
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
🛠️ How to Apply and Test
1. Fire it up
Apply your chosen YAML file and watch the logs. The first run will download several gigabytes of model weights, so grab a coffee! ☕
kubectl apply -f vllm-qwen.yaml
kubectl logs -f deployment/vllm-server -c vllm-server
2. Connect via Port-Forward
Since we are in Minikube, we need to bridge the gap to our local machine:
kubectl port-forward svc/vllm-service 8000:8000
3. Test with curl
vLLM uses an OpenAI-compatible API, making it very easy to test from your terminal.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-3B-Instruct",
"messages": [{"role": "user", "content": "Tell me a short joke about a data engineer."}],
"temperature": 0.7
}'
📮 Setting up Postman
Postman works perfectly since vLLM mimics the OpenAI API. Use it to test your prompts before automating them in NiFi.
-
New Collection: Create one named
vLLM Local Server. -
Environment Variables: Add
base_urlashttp://localhost:8000/v1. -
The Request:
-
Method:
POST -
URL:
/chat/completions -
Body: Select
raw->JSON. -
Paste this:
{ "model": "Qwen/Qwen2.5-3B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful NiFi assistant."}, {"role": "user", "content": "How do I use InvokeHTTP with vLLM?"} ], "stream": false }
-
Method:
ProTip: Set "stream": true in the JSON body if you want to see Postman render the response chunk-by-chunk in real-time!
Questions?
If you have any questions about the integration between these components or you would like a deeper dive, hands-on experience, or demos, please reach out to schedule a discussion. 🤝