Lessons Learned Audio Transcription RAG

95 minute read

Updated: March 30, 2026

This post is comprised of the backing lessons from Insanely Fast Audio Transcription with Cloudera Streaming Operators with a summary of the hurdles, a log of the terminal commands, and terminal output.

Lessons from the Edge: Operationalizing Whisper on Local K8s

Building a production-grade inference container for the RTX 4060 required moving past “standard” tutorials and into deep infrastructure tuning. Here are the hard-won lessons from the StreamToWhisper build.

1. The CUDA Version Pinning Trap

Standard pip install torch often pulls a version bundled with an older CUDA toolkit (e.g., 12.1). For modern hardware like the G1 (RTX 4060), we had to explicitly target the +cu124 index. Failure to do this results in a “Runtime Device Mismatch” where the GPU is visible but unusable.

2. The HF_TOKEN Performance Leak

Running “unauthenticated” requests to HuggingFace during a Docker build is a silent throttle. Without passing the HF_TOKEN as a BUILD_ARG (sourced from a Minikube Secret), we hit rate limits and gated model blocks. Authenticating the build-time download ensures high-speed model baking.

3. The “No-Deps” Strategy (Manual Dependency Tax)

To prevent transformers or whisper from accidentally overwriting our optimized Torch version, we used the --no-deps flag. This created a secondary challenge: Recursive Dependency Failures. We learned that FastAPI physically cannot boot without its “Web Plumbing” (starlette, pydantic, anyio). The fix was a hybrid install approach: letting the web stack resolve naturally while keeping the AI stack in a straightjacket.

4. Flash Attention 2 vs. Standard Inference

For the Large-v3 model, standard attention is too slow for real-time streaming. Compiling flash-attn requires ninja-build and --no-build-isolation. This adds ~3 minutes to the build time but results in a 3x-5x throughput increase on the 4060, allowing us to use batch_size=24 comfortably.

5. Multi-Stage Build & Layer Caching

By separating the Builder (which contains compilers and pip caches) from the Runtime (which only contains the Venv and Model weights), we reduced the final image size and ensured that minor logic changes in main.py don’t trigger a full 20-minute CUDA recompilation.

6. The Audio Codec Gap

Even with a perfect model, transcription fails if the container lacks the OS-level codecs to “read” the audio header. Including ffmpeg in the Runtime stage and ensuring the input audio is a clean 16kHz Mono PCM file is the difference between a 200 OK and a 400 Malformed error.

Terminal History

mkdir whisper
cd whisper
nano Dockerfile.whisper
docker build -t streamwhisper:latest -f Dockerfile.whisper .
nano Dockerfile.whisper.2
docker build -t streamwhisper:latest -f Dockerfile.whisper.2 .
nano Dockerfile.whisper.3
docker build -t streamwhisper:latest -f Dockerfile.whisper.3 .
nano Dockerfile.whisper.4
docker build -t streamwhisper:latest -f Dockerfile.whisper.4 .
nano Dockerfile.whisper.5
docker build -t streamwhisper:latest -f Dockerfile.whisper.5 .
nano Dockerfile.whisper.6
docker build -t streamwhisper:latest -f Dockerfile.whisper.6 .
nano Dockerfile.whisper.7
docker build -t streamwhisper:latest -f Dockerfile.whisper.7 .
nano Dockerfile.whisper.8
docker build -t streamwhisper:latest -f Dockerfile.whisper.8 .
nano Dockerfile.whisper.9
docker build -t streamwhisper:latest -f Dockerfile.whisper.9 .
nano Dockerfile.whisper.10
docker build -t streamwhisper:latest -f Dockerfile.whisper.10 .
nano Dockerfile.whisper.11
docker build -t streamwhisper:latest -f Dockerfile.whisper.11 .
nano Dockerfile.whisper.12
docker build -t streamwhisper:latest -f Dockerfile.whisper.12 .
export MY_TOKEN=$(kubectl get secret hf-token -o jsonpath='{.data.token}' | base64 --decode)
docker build -t streamwhisper:latest --build-arg HF_TOKEN=$MY_TOKEN -f Dockerfile.whisper.12 .
minikube service whisper-server-service --url
curl -X POST http://whisper-server-service:8001/transcribe   -H "Content-Type: multipart/form-data"   -F "file=@test-speech.wav"
curl -X POST http://127.0.0.1:8001/transcribe   -H "Content-Type: multipart/form-data"   -F "file=@test-speech.wav"
wget -O test-speech.wav https://github.com/Azure-Samples/cognitive-services-speech-sdk/raw/master/samples/cpp/windows/console/samples/whatstheweatherlike.wav
curl -X POST http://127.0.0.1:8001/transcribe   -H "Content-Type: multipart/form-data"   -F "file=@test-speech.wav"
nano whisper-server.yaml
kubectl apply -f whisper-server.yaml
kubectl describe pod whisper-server-779f955f5c-j2vr7
kubectl delete -f whisper-server.yaml
kubectl apply -f whisper-server.yaml
kubectl delete -f vllm-qwen.yaml
kubectl delete -f whisper-server.yaml
kubectl apply -f whisper-server.yaml
kubectl describe pod whisper-server-6486568f56-xrcg2
kubectl rollout restart deployment whisper-server
kubectl describe pod whisper-server-6486568f56-xrcg2
kubectl logs whisper-server-6486568f56-xrcg2
kubectl delete -f whisper-server.yaml
kubectl apply -f whisper-server.yaml
kubectl logs whisper-server-6486568f56-gzvqs
kubectl delete -f whisper-server.yaml
kubectl apply -f whisper-server.yaml
kubectl delete -f whisper-server.yaml
kubectl apply -f whisper-server.yaml
kubectl delete -f whisper-server.yaml
kubectl apply -f whisper-server.yaml
kubectl logs whisper-server-6486568f56-bbz6z
kubectl delete -f whisper-server.yaml
export MY_TOKEN=$(kubectl get secret hf-token -o jsonpath='{.data.token}' | base64 --decode)
kubectl rollout restart deployment whisper-server
kubectl apply -f whisper-server.yaml
kubectl port-forward deployment/whisper-server 8001:8001
kubectl delete -f whisper-server.yaml
kubectl port-forward deployment/whisper-server 8001:8001
kubectl port-forward deployment/whisper-server 8001:8001 -n cfm-streaming
kubectl apply -f whisper-server.yaml
kubectl delete -f whisper-server.yaml
kubectl apply -f whisper-server.yaml
kubectl port-forward deployment/whisper-server 8001:8001 -n cfm-streaming
kubectl port-forward deployment/whisper-server 8001:8001
kubectl delete -f whisper-server.yaml
kubectl apply -f whisper-server.yaml
kubectl delete -f whisper-server.yaml
kubectl exec -it mynifi-0 -n cfm-streaming -- curl -s -o /dev/null -w "%{http_code}" http://whisper-server-service.default.svc.cluster.local:8001/transcribe
kubectl apply -f vllm-qwen.yaml

Terminal 1 Output

This terminal is mostly me building the DockerFile.

steven@CSO:~$ mkdir whisper
steven@CSO:~$ cd whisper
steven@CSO:~/whisper$ nano Dockerfile.whisper
steven@CSO:~/whisper$ eval $(minikube docker-env)
docker build -t streamwhisper:latest -f Dockerfile.whisper .
[+] Building 401.3s (13/15)                                                          docker:default
 => [internal] load build definition from Dockerfile.whisper                                   0.1s
 => => transferring dockerfile: 1.85kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          1.1s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                     0.0s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => [1/9] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1d02c0f90  173.8s
 => => resolve docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1d02c0f90ed  0.0s
 => => sha256:622e78a1d02c0f90ed900e3985d6c975d8e2dc9ee5e61643aed587dcf9129f42 743B / 743B     0.0s
 => => sha256:0a1cb6e7bd047a1067efe14efdf0276352d5ca643dfd77963dab1a4f05a003a 2.84kB / 2.84kB  0.0s
 => => sha256:3c645031de2917ade93ec54b118d5d3e45de72ef580b8f419a8cdc41e01d0 29.53MB / 29.53MB  2.5s
 => => sha256:0a7674e3e8fe69dcd7f1424fa29aa033b32c42269aab46cbe9818f8dd7154 57.59MB / 57.59MB  3.7s
 => => sha256:edd3b6bf59a6acc4d56fdcdfade4d1bc9aa206359a6823a1a43a162c30213 19.68kB / 19.68kB  0.0s
 => => sha256:0d6448aff88945ea46a37cfe4330bdb0ada228268b80da6258a0fec63086f40 4.62MB / 4.62MB  0.8s
 => => sha256:b71b637b97c5efb435b9965058ad414f07afa99d320cf05e89f10441ec1becf4 185B / 185B     0.9s
 => => sha256:56dc8550293751a1604e97ac949cfae82ba20cb2a28e034737bafd738255960 6.89kB / 6.89kB  1.0s
 => => sha256:ec6d5f6c9ed94d2ee2eeaf048d90242af638325f57696909f1737b3158d838 1.37GB / 1.37GB  83.5s
 => => sha256:47b8539d532f561cac6d7fb8ee2f46c902b66e4a60b103d19701829742a0d 64.05kB / 64.05kB  2.7s
 => => extracting sha256:3c645031de2917ade93ec54b118d5d3e45de72ef580b8f419a8cdc41e01d042c      1.7s
 => => sha256:fd9cc1ad8dee47ca559003714d462f4eb79cb6315a2708927c240b84d022b55 1.68kB / 1.68kB  2.8s
 => => sha256:83525caeeb359731f869f1ee87a32acdfdd5efb8af4cab06d8f4fdcf1f317da 1.52kB / 1.52kB  2.9s
 => => sha256:8e79813a7b9d5784bb880ca2909887465549de5183411b24f6de72fab0802 2.65GB / 2.65GB  132.4s
 => => sha256:312a542960e3345001fc709156a5139ff8a1d8cc21a51a50f83e87ec2982f 88.86kB / 88.86kB  3.8s
 => => sha256:ae033ce9621d2cceaef2769ead17429ae8b29f098fb0350bdd4e0f55a3 670.18MB / 670.18MB  51.4s
 => => extracting sha256:0d6448aff88945ea46a37cfe4330bdb0ada228268b80da6258a0fec63086f404      0.4s
 => => extracting sha256:0a7674e3e8fe69dcd7f1424fa29aa033b32c42269aab46cbe9818f8dd7154754      3.7s
 => => extracting sha256:b71b637b97c5efb435b9965058ad414f07afa99d320cf05e89f10441ec1becf4      0.0s
 => => extracting sha256:56dc8550293751a1604e97ac949cfae82ba20cb2a28e034737bafd7382559609      0.0s
 => => extracting sha256:ec6d5f6c9ed94d2ee2eeaf048d90242af638325f57696909f1737b3158d838cf     26.1s
 => => extracting sha256:47b8539d532f561cac6d7fb8ee2f46c902b66e4a60b103d19701829742a0d11e      0.0s
 => => extracting sha256:fd9cc1ad8dee47ca559003714d462f4eb79cb6315a2708927c240b84d022b55f      0.0s
 => => extracting sha256:83525caeeb359731f869f1ee87a32acdfdd5efb8af4cab06d8f4fdcf1f317daa      0.0s
 => => extracting sha256:8e79813a7b9d5784bb880ca2909887465549de5183411b24f6de72fab0802bcd     35.2s
 => => extracting sha256:312a542960e3345001fc709156a5139ff8a1d8cc21a51a50f83e87ec2982f579      0.0s
 => => extracting sha256:ae033ce9621d2cceaef2769ead17429ae8b29f098fb0350bdd4e0f55a36996db      5.8s
 => [internal] preparing inline document                                                       0.0s
 => [internal] load build context                                                              0.0s
 => => transferring context: 1.85kB                                                            0.0s
 => [2/9] RUN apt-get update && apt-get install -y python3.11 python3.11-venv python3-pip gi  46.9s
 => [3/9] WORKDIR /app                                                                         0.0s
 => [4/9] RUN python3.11 -m venv /opt/venv                                                     2.3s
 => [5/9] COPY . /app                                                                          0.0s
 => [6/9] RUN pip install --no-cache-dir torch torchvision torchaudio --index-url https://d  176.0s
 => ERROR [7/9] RUN pip install --no-cache-dir -r requirements.txt  # (or just the packages b  0.8s
------
 > [7/9] RUN pip install --no-cache-dir -r requirements.txt  # (or just the packages below):
0.769 ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'
------
Dockerfile.whisper:13
--------------------
  11 |     COPY . /app
  12 |     RUN pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
  13 | >>> RUN pip install --no-cache-dir -r requirements.txt  # (or just the packages below)
  14 |     RUN pip install --no-cache-dir \
  15 |         insanely-fast-whisper==0.0.15 \
--------------------
ERROR: failed to build: failed to solve: process "/bin/sh -c pip install --no-cache-dir -r requirements.txt  # (or just the packages below)" did not complete successfully: exit code: 1
steven@CSO:~/whisper$ nano requirements.txt
steven@CSO:~/whisper$ nano Dockerfile.whisper
steven@CSO:~/whisper$ eval $(minikube docker-env)
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper .
[+] Building 178.6s (13/14)                                                          docker:default
 => [internal] load build definition from Dockerfile.whisper                                   0.0s
 => => transferring dockerfile: 1.84kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.5s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                     0.0s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => [1/8] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1d02c0f90ed  0.0s
 => [internal] load build context                                                              0.0s
 => => transferring context: 1.84kB                                                            0.0s
 => CACHED [internal] preparing inline document                                                0.0s
 => CACHED [2/8] RUN apt-get update && apt-get install -y python3.11 python3.11-venv python3-  0.0s
 => CACHED [3/8] WORKDIR /app                                                                  0.0s
 => CACHED [4/8] RUN python3.11 -m venv /opt/venv                                              0.0s
 => [5/8] COPY . /app                                                                          0.0s
 => [6/8] RUN pip install --no-cache-dir torch torchvision torchaudio --index-url https://d  174.3s
 => ERROR [7/8] RUN pip install --no-cache-dir     insanely-fast-whisper==0.0.15     fastapi   3.7s
------
 > [7/8] RUN pip install --no-cache-dir     insanely-fast-whisper==0.0.15     fastapi uvicorn python-multipart huggingface_hub flash-attn --no-build-isolation:
0.900 Collecting insanely-fast-whisper==0.0.15
1.013   Downloading insanely_fast_whisper-0.0.15-py3-none-any.whl (16 kB)
1.139 Collecting fastapi
1.154   Downloading fastapi-0.135.2-py3-none-any.whl (117 kB)
1.186      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 117.4/117.4 KB 4.9 MB/s eta 0:00:00
1.226 Collecting uvicorn
1.242   Downloading uvicorn-0.42.0-py3-none-any.whl (68 kB)
1.245      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68.8/68.8 KB 47.8 MB/s eta 0:00:00
1.293 Collecting python-multipart
1.308   Downloading python_multipart-0.0.22-py3-none-any.whl (24 kB)
1.408 Collecting huggingface_hub
1.422   Downloading huggingface_hub-1.8.0-py3-none-any.whl (625 kB)
1.471      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 625.2/625.2 KB 12.9 MB/s eta 0:00:00
1.519 Collecting flash-attn
1.534   Downloading flash_attn-2.8.3.tar.gz (8.4 MB)
2.054      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 16.3 MB/s eta 0:00:00
3.416   Preparing metadata (setup.py): started
3.521   Preparing metadata (setup.py): finished with status 'error'
3.523   error: subprocess-exited-with-error
3.523
3.523   × python setup.py egg_info did not run successfully.
3.523   │ exit code: 1
3.523   ╰─> [6 lines of output]
3.523       Traceback (most recent call last):
3.523         File "<string>", line 2, in <module>
3.523         File "<pip-setuptools-caller>", line 34, in <module>
3.523         File "/tmp/pip-install-hl031e95/flash-attn_260b47862be940aba0932cc81566a5bb/setup.py", line 12, in <module>
3.523           from packaging.version import parse, Version
3.523       ModuleNotFoundError: No module named 'packaging'
3.523       [end of output]
3.523
3.523   note: This error originates from a subprocess, and is likely not a problem with pip.
3.524 error: metadata-generation-failed
3.524
3.524 × Encountered error while generating package metadata.
3.524 ╰─> See above for output.
3.524
3.524 note: This is an issue with the package mentioned above, not pip.
3.524 hint: See above for details.
------
Dockerfile.whisper:15
--------------------
  14 |     # The failing requirements.txt line has been removed to use the specific packages below
  15 | >>> RUN pip install --no-cache-dir \
  16 | >>>     insanely-fast-whisper==0.0.15 \
  17 | >>>     fastapi uvicorn python-multipart huggingface_hub flash-attn --no-build-isolation
  18 |
--------------------
ERROR: failed to build: failed to solve: process "/bin/sh -c pip install --no-cache-dir     insanely-fast-whisper==0.0.15     fastapi uvicorn python-multipart huggingface_hub flash-attn --no-build-isolation" did not complete successfully: exit code: 1
steven@CSO:~/whisper$ rm -rf Dockerfile.whisper
steven@CSO:~/whisper$ nano Dockerfile.whisper
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper .
[+] Building 465.3s (16/16) FINISHED                                                 docker:default
 => [internal] load build definition from Dockerfile.whisper                                   0.0s
 => => transferring dockerfile: 2.18kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.5s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                     0.0s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => CACHED [1/9] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1d02  0.0s
 => [internal] load build context                                                              0.0s
 => => transferring context: 2.18kB                                                            0.0s
 => [internal] preparing inline document                                                       0.0s
 => [2/9] RUN apt-get update && apt-get install -y     python3.11 python3.11-venv python3-pi  47.9s
 => [3/9] WORKDIR /app                                                                         0.0s
 => [4/9] RUN python3.11 -m venv /opt/venv                                                     2.3s
 => [5/9] COPY . /app                                                                          0.1s
 => [6/9] RUN pip install --no-cache-dir torch torchvision torchaudio --index-url https://d  173.5s
 => [7/9] RUN pip install --no-cache-dir packaging setuptools wheel                            1.3s
 => [8/9] RUN pip install --no-cache-dir     insanely-fast-whisper==0.0.15     fastapi uvic  218.4s
 => [9/9] COPY <<EOF /app/main.py                                                              0.0s
 => exporting to image                                                                        21.2s
 => => exporting layers                                                                       21.2s
 => => writing image sha256:0aef498fd237777f54b6bf049c9250ceadcf682889e6041c75f3261f877e935f   0.0s
 => => naming to docker.io/library/streamwhisper:latest                                        0.0s
steven@CSO:~/whisper$ kubectl apply -f whisper-server.yaml
error: the path "whisper-server.yaml" does not exist
steven@CSO:~/whisper$ curl -L https://www.voiptroubleshooter.com/open_speech/american/eng_m1.wav -o test-speech.wav
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    91    0    91    0     0    176      0 --:--:-- --:--:-- --:--:--   176
steven@CSO:~/whisper$ rm -rf Dockerfile.whisper
steven@CSO:~/whisper$ nano Dockerfile.whisper
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper .
[+] Building 398.7s (16/16) FINISHED                                                 docker:default
 => [internal] load build definition from Dockerfile.whisper                                   0.0s
 => => transferring dockerfile: 2.52kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.6s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                     0.0s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => [1/9] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1d02c0f90ed  0.0s
 => [internal] preparing inline document                                                       0.0s
 => [internal] load build context                                                              0.0s
 => => transferring context: 2.65kB                                                            0.0s
 => CACHED [2/9] RUN apt-get update && apt-get install -y     python3.11 python3.11-venv pyth  0.0s
 => CACHED [3/9] WORKDIR /app                                                                  0.0s
 => CACHED [4/9] RUN python3.11 -m venv /opt/venv                                              0.0s
 => [5/9] COPY . /app                                                                          0.0s
 => [6/9] RUN pip install --no-cache-dir --upgrade pip &&     pip install --no-cache-dir pack  2.8s
 => [7/9] RUN pip install --no-cache-dir     torch==2.4.1     torchvision==0.19.1     torch  171.7s
 => [8/9] RUN pip install --no-cache-dir     transformers     insanely-fast-whisper==0.0.15  204.4s
 => [9/9] COPY <<EOF /app/main.py                                                              0.0s
 => exporting to image                                                                        18.9s
 => => exporting layers                                                                       18.9s
 => => writing image sha256:f2b18f2c77ffe7a04a7b62efecb9503977f38b97708ba35bf109a402efc3912c   0.0s
 => => naming to docker.io/library/streamwhisper:latest                                        0.0s
steven@CSO:~/whisper$ nano Dockerfile.whisper.2
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.2 .
[+] Building 377.9s (14/16)                                                          docker:default
 => [internal] load build definition from Dockerfile.whisper.2                                 0.0s
 => => transferring dockerfile: 2.78kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.4s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                     0.0s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => [ 1/10] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1d02c0f90  0.0s
 => [internal] preparing inline document                                                       0.0s
 => [internal] load build context                                                              0.0s
 => => transferring context: 2.85kB                                                            0.0s
 => CACHED [ 2/10] RUN apt-get update && apt-get install -y     python3.11 python3.11-venv py  0.0s
 => CACHED [ 3/10] WORKDIR /app                                                                0.0s
 => CACHED [ 4/10] RUN python3.11 -m venv /opt/venv                                            0.0s
 => [ 5/10] RUN pip install --no-cache-dir --upgrade pip &&     pip install --no-cache-dir pa  2.8s
 => [ 6/10] RUN pip install --no-cache-dir     torch==2.4.1 torchvision==0.19.1 torchaudio=  168.3s
 => [ 7/10] RUN pip install --no-cache-dir     transformers     insanely-fast-whisper==0.0.  202.7s
 => ERROR [ 8/10] RUN python3 -c "from transformers import pipeline; pipeline('automatic-spee  3.6s
------
 > [ 8/10] RUN python3 -c "from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')":
2.812 Traceback (most recent call last):
2.812   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2169, in __getattr__
2.813     module = self._get_module(self._class_to_module[name])
2.813              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.813   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2403, in _get_module
2.813     raise e
2.813   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2401, in _get_module
2.813     return importlib.import_module("." + module_name, self.__name__)
2.813            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.813   File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
2.813     return _bootstrap._gcd_import(name[level:], package, level)
2.813            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.813   File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
2.813   File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
2.813   File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
2.813   File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
2.813   File "<frozen importlib._bootstrap_external>", line 940, in exec_module
2.813   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
2.813   File "/opt/venv/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 27, in <module>
2.813     from ..image_processing_utils import BaseImageProcessor
2.813   File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_utils.py", line 24, in <module>
2.813     from .image_processing_base import BatchFeature, ImageProcessingMixin
2.813   File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_base.py", line 25, in <module>
2.813     from .image_utils import is_valid_image, load_image
2.813   File "/opt/venv/lib/python3.11/site-packages/transformers/image_utils.py", line 53, in <module>
2.814     from torchvision.transforms import InterpolationMode
2.814   File "/opt/venv/lib/python3.11/site-packages/torchvision/__init__.py", line 10, in <module>
2.814     from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
2.814     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.814   File "/opt/venv/lib/python3.11/site-packages/torchvision/_meta_registrations.py", line 163, in <module>
2.814     @torch.library.register_fake("torchvision::nms")
2.814      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.814   File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 1087, in register
2.814     use_lib._register_fake(
2.814   File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 204, in _register_fake
2.814     handle = entry.fake_impl.register(
2.814              ^^^^^^^^^^^^^^^^^^^^^^^^^
2.814   File "/opt/venv/lib/python3.11/site-packages/torch/_library/fake_impl.py", line 50, in register
2.814     if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
2.814        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.814 RuntimeError: operator torchvision::nms does not exist
2.814
2.814 The above exception was the direct cause of the following exception:
2.814
2.814 Traceback (most recent call last):
2.814   File "<string>", line 1, in <module>
2.814   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2257, in __getattr__
2.814     raise ModuleNotFoundError(
2.814 ModuleNotFoundError: Could not import module 'pipeline'. Are this object's requirements defined correctly?
------
Dockerfile.whisper.2:36
--------------------
  34 |     # This "bakes" the model into the image so the Pod doesn't have to
  35 |     # download 3GB from HuggingFace every time it starts up.
  36 | >>> RUN python3 -c "from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')"
  37 |
  38 |     # --- THE "CHANGE" ZONE ---
--------------------
ERROR: failed to build: failed to solve: process "/bin/sh -c python3 -c \"from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')\"" did not complete successfully: exit code: 1
steven@CSO:~/whisper$ nano Dockerfile.whisper.3
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.3 .
[+] Building 182.3s (13/17)                                                          docker:default
 => [internal] load build definition from Dockerfile.whisper.3                                 0.0s
 => => transferring dockerfile: 2.70kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.4s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                     0.0s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => [internal] preparing inline document                                                       0.0s
 => CACHED [builder 1/8] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622  0.0s
 => CACHED [builder 2/8] RUN apt-get update && apt-get install -y     python3.11 python3.11-v  0.0s
 => CACHED [builder 3/8] WORKDIR /app                                                          0.0s
 => CACHED [builder 4/8] RUN python3.11 -m venv /opt/venv                                      0.0s
 => [builder 5/8] RUN pip install --no-cache-dir --upgrade pip &&     pip install --no-cache-  2.7s
 => [stage-1 2/5] WORKDIR /app                                                                 0.1s
 => [builder 6/8] RUN pip install --no-cache-dir --force-reinstall     torch==2.4.1+cu124    174.1s
 => ERROR [builder 7/8] RUN pip install --no-cache-dir     transformers     insanely-fast-whi  4.9s
------
 > [builder 7/8] RUN pip install --no-cache-dir     transformers     insanely-fast-whisper==0.0.15     fastapi uvicorn python-multipart huggingface_hub flash-attn --no-build-isolation:
0.722 Collecting transformers
0.838   Downloading transformers-5.4.0-py3-none-any.whl.metadata (32 kB)
0.921 Collecting insanely-fast-whisper==0.0.15
0.944   Downloading insanely_fast_whisper-0.0.15-py3-none-any.whl.metadata (9.9 kB)
1.006 Collecting fastapi
1.018   Downloading fastapi-0.135.2-py3-none-any.whl.metadata (28 kB)
1.047 Collecting uvicorn
1.061   Downloading uvicorn-0.42.0-py3-none-any.whl.metadata (6.7 kB)
1.077 Collecting python-multipart
1.091   Downloading python_multipart-0.0.22-py3-none-any.whl.metadata (1.8 kB)
1.126 Collecting huggingface_hub
1.141   Downloading huggingface_hub-1.8.0-py3-none-any.whl.metadata (13 kB)
1.185 Collecting flash-attn
1.197   Downloading flash_attn-2.8.3.tar.gz (8.4 MB)
1.612      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 21.5 MB/s  0:00:00
2.896   Preparing metadata (pyproject.toml): started
4.591   Preparing metadata (pyproject.toml): finished with status 'error'
4.595   error: subprocess-exited-with-error
4.595
4.595   × Preparing metadata (pyproject.toml) did not run successfully.
4.595   │ exit code: 1
4.595   ╰─> [66 lines of output]
4.595       /opt/venv/lib/python3.11/site-packages/wheel/bdist_wheel.py:4: FutureWarning: The 'wheel' package is no longer the canonical location of the 'bdist_wheel' command, and will be removed in a future release. Please update to setuptools v70.1 or later which contains an integrated version of this command.
4.595         warn(
4.595       No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
4.595
4.595
4.595       torch.__version__  = 2.4.1+cu124
4.595
4.595
4.595       running dist_info
4.595       creating /tmp/pip-modern-metadata-o887qc_w/flash_attn.egg-info
4.595       writing /tmp/pip-modern-metadata-o887qc_w/flash_attn.egg-info/PKG-INFO
4.595       writing dependency_links to /tmp/pip-modern-metadata-o887qc_w/flash_attn.egg-info/dependency_links.txt
4.595       writing requirements to /tmp/pip-modern-metadata-o887qc_w/flash_attn.egg-info/requires.txt
4.595       writing top-level names to /tmp/pip-modern-metadata-o887qc_w/flash_attn.egg-info/top_level.txt
4.595       writing manifest file '/tmp/pip-modern-metadata-o887qc_w/flash_attn.egg-info/SOURCES.txt'
4.595       Traceback (most recent call last):
4.595         File "/opt/venv/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 389, in <module>
4.595           main()
4.595         File "/opt/venv/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 373, in main
4.595           json_out["return_val"] = hook(**hook_input["kwargs"])
4.595                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4.595         File "/opt/venv/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 175, in prepare_metadata_for_build_wheel
4.595           return hook(metadata_directory, config_settings)
4.595                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4.595         File "/opt/venv/lib/python3.11/site-packages/setuptools/build_meta.py", line 174, in prepare_metadata_for_build_wheel
4.595           self.run_setup()
4.595         File "/opt/venv/lib/python3.11/site-packages/setuptools/build_meta.py", line 268, in run_setup
4.595           self).run_setup(setup_script=setup_script)
4.595                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4.595         File "/opt/venv/lib/python3.11/site-packages/setuptools/build_meta.py", line 158, in run_setup
4.595           exec(compile(code, __file__, 'exec'), locals())
4.595         File "setup.py", line 526, in <module>
4.595           setup(
4.595         File "/opt/venv/lib/python3.11/site-packages/setuptools/__init__.py", line 153, in setup
4.595           return distutils.core.setup(**attrs)
4.595                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4.595         File "/usr/lib/python3.11/distutils/core.py", line 148, in setup
4.595           dist.run_commands()
4.595         File "/usr/lib/python3.11/distutils/dist.py", line 966, in run_commands
4.595           self.run_command(cmd)
4.595         File "/usr/lib/python3.11/distutils/dist.py", line 985, in run_command
4.595           cmd_obj.run()
4.595         File "/opt/venv/lib/python3.11/site-packages/setuptools/command/dist_info.py", line 31, in run
4.595           egg_info.run()
4.595         File "/opt/venv/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 299, in run
4.595           self.find_sources()
4.595         File "/opt/venv/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 306, in find_sources
4.595           mm.run()
4.595         File "/opt/venv/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 541, in run
4.595           self.add_defaults()
4.595         File "/opt/venv/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 578, in add_defaults
4.595           sdist.add_defaults(self)
4.595         File "/usr/lib/python3.11/distutils/command/sdist.py", line 228, in add_defaults
4.595           self._add_defaults_ext()
4.595         File "/usr/lib/python3.11/distutils/command/sdist.py", line 311, in _add_defaults_ext
4.595           build_ext = self.get_finalized_command('build_ext')
4.595                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4.595         File "/usr/lib/python3.11/distutils/cmd.py", line 298, in get_finalized_command
4.595           cmd_obj = self.distribution.get_command_obj(command, create)
4.595                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4.595         File "/usr/lib/python3.11/distutils/dist.py", line 858, in get_command_obj
4.595           cmd_obj = self.command_obj[command] = klass(self)
4.595                                                 ^^^^^^^^^^^
4.595         File "setup.py", line 510, in __init__
4.595           import psutil
4.595       ModuleNotFoundError: No module named 'psutil'
4.595       [end of output]
4.595
4.595   note: This error originates from a subprocess, and is likely not a problem with pip.
4.711 error: metadata-generation-failed
4.711
4.711 × Encountered error while generating package metadata.
4.711 ╰─> flash-attn
4.711
4.711 note: This is an issue with the package mentioned above, not pip.
4.711 hint: See above for details.
------
Dockerfile.whisper.3:26
--------------------
  25 |     # 3. Install dependencies
  26 | >>> RUN pip install --no-cache-dir \
  27 | >>>     transformers \
  28 | >>>     insanely-fast-whisper==0.0.15 \
  29 | >>>     fastapi uvicorn python-multipart huggingface_hub flash-attn --no-build-isolation
  30 |
--------------------
ERROR: failed to build: failed to solve: process "/bin/sh -c pip install --no-cache-dir     transformers     insanely-fast-whisper==0.0.15     fastapi uvicorn python-multipart huggingface_hub flash-attn --no-build-isolation" did not complete successfully: exit code: 1
steven@CSO:~/whisper$ nano Dockerfile.whisper.4
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.4 .
[+] Building 358.0s (13/17)                                                          docker:default
 => [internal] load build definition from Dockerfile.whisper.4                                 0.0s
 => => transferring dockerfile: 2.85kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.3s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => [builder 1/9] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1d0  0.0s
 => CACHED [internal] preparing inline document                                                0.0s
 => CACHED [builder 2/9] RUN apt-get update && apt-get install -y     python3.11 python3.11-v  0.0s
 => CACHED [builder 3/9] WORKDIR /app                                                          0.0s
 => CACHED [builder 4/9] RUN python3.11 -m venv /opt/venv                                      0.0s
 => CACHED [stage-1 2/5] WORKDIR /app                                                          0.0s
 => [builder 5/9] RUN pip install --no-cache-dir --upgrade pip &&     pip install --no-cache-  2.8s
 => [builder 6/9] RUN pip install --no-cache-dir --force-reinstall     torch==2.4.1+cu124    173.2s
 => [builder 7/9] RUN pip install --no-cache-dir     transformers     insanely-fast-whisper  174.3s
 => ERROR [builder 8/9] RUN pip install --no-cache-dir flash-attn --no-build-isolation         7.3s
------
 > [builder 8/9] RUN pip install --no-cache-dir flash-attn --no-build-isolation:
0.690 Collecting flash-attn
0.804   Downloading flash_attn-2.8.3.tar.gz (8.4 MB)
1.285      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 18.9 MB/s  0:00:00
2.568   Preparing metadata (pyproject.toml): started
5.215   Preparing metadata (pyproject.toml): finished with status 'done'
5.217 Requirement already satisfied: torch in /opt/venv/lib/python3.11/site-packages (from flash-attn) (2.11.0)
5.218 Requirement already satisfied: einops in /opt/venv/lib/python3.11/site-packages (from flash-attn) (0.8.2)
5.220 Requirement already satisfied: filelock in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (3.25.2)
5.221 Requirement already satisfied: typing-extensions>=4.10.0 in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (4.15.0)
5.221 Requirement already satisfied: setuptools<82 in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (81.0.0)
5.221 Requirement already satisfied: sympy>=1.13.3 in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (1.14.0)
5.222 Requirement already satisfied: networkx>=2.5.1 in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (3.6.1)
5.222 Requirement already satisfied: jinja2 in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (3.1.6)
5.222 Requirement already satisfied: fsspec>=0.8.5 in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (2026.3.0)
5.223 Requirement already satisfied: cuda-toolkit==13.0.2 in /opt/venv/lib/python3.11/site-packages (from cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2; platform_system == "Linux"->torch->flash-attn) (13.0.2)
5.223 Requirement already satisfied: cuda-bindings<14,>=13.0.3 in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (13.2.0)
5.224 Requirement already satisfied: nvidia-cudnn-cu13==9.19.0.56 in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (9.19.0.56)
5.224 Requirement already satisfied: nvidia-cusparselt-cu13==0.8.0 in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (0.8.0)
5.225 Requirement already satisfied: nvidia-nccl-cu13==2.28.9 in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (2.28.9)
5.225 Requirement already satisfied: nvidia-nvshmem-cu13==3.4.5 in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (3.4.5)
5.225 Requirement already satisfied: triton==3.6.0 in /opt/venv/lib/python3.11/site-packages (from torch->flash-attn) (3.6.0)
5.235 Requirement already satisfied: nvidia-cuda-nvrtc==13.0.88.* in /opt/venv/lib/python3.11/site-packages (from cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2; platform_system == "Linux"->torch->flash-attn) (13.0.88)
5.235 Requirement already satisfied: nvidia-curand==10.4.0.35.* in /opt/venv/lib/python3.11/site-packages (from cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2; platform_system == "Linux"->torch->flash-attn) (10.4.0.35)
5.236 Requirement already satisfied: nvidia-nvjitlink==13.0.88.* in /opt/venv/lib/python3.11/site-packages (from cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2; platform_system == "Linux"->torch->flash-attn) (13.0.88)
5.236 Requirement already satisfied: nvidia-cufile==1.15.1.6.* in /opt/venv/lib/python3.11/site-packages (from cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2; platform_system == "Linux"->torch->flash-attn) (1.15.1.6)
5.236 Requirement already satisfied: nvidia-cublas==13.1.0.3.* in /opt/venv/lib/python3.11/site-packages (from cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2; platform_system == "Linux"->torch->flash-attn) (13.1.0.3)
5.237 Requirement already satisfied: nvidia-cusolver==12.0.4.66.* in /opt/venv/lib/python3.11/site-packages (from cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2; platform_system == "Linux"->torch->flash-attn) (12.0.4.66)
5.237 Requirement already satisfied: nvidia-cusparse==12.6.3.3.* in /opt/venv/lib/python3.11/site-packages (from cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2; platform_system == "Linux"->torch->flash-attn) (12.6.3.3)
5.238 Requirement already satisfied: nvidia-cufft==12.0.0.61.* in /opt/venv/lib/python3.11/site-packages (from cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2; platform_system == "Linux"->torch->flash-attn) (12.0.0.61)
5.238 Requirement already satisfied: nvidia-nvtx==13.0.85.* in /opt/venv/lib/python3.11/site-packages (from cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2; platform_system == "Linux"->torch->flash-attn) (13.0.85)
5.238 Requirement already satisfied: nvidia-cuda-cupti==13.0.85.* in /opt/venv/lib/python3.11/site-packages (from cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2; platform_system == "Linux"->torch->flash-attn) (13.0.85)
5.239 Requirement already satisfied: nvidia-cuda-runtime==13.0.96.* in /opt/venv/lib/python3.11/site-packages (from cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2; platform_system == "Linux"->torch->flash-attn) (13.0.96)
5.243 Requirement already satisfied: cuda-pathfinder~=1.1 in /opt/venv/lib/python3.11/site-packages (from cuda-bindings<14,>=13.0.3->torch->flash-attn) (1.5.0)
5.270 Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/venv/lib/python3.11/site-packages (from sympy>=1.13.3->torch->flash-attn) (1.3.0)
5.273 Requirement already satisfied: MarkupSafe>=2.0 in /opt/venv/lib/python3.11/site-packages (from jinja2->torch->flash-attn) (3.0.3)
5.275 Building wheels for collected packages: flash-attn
5.276   Building wheel for flash-attn (pyproject.toml): started
7.011   Building wheel for flash-attn (pyproject.toml): finished with status 'error'
7.017   error: subprocess-exited-with-error
7.017
7.017   × Building wheel for flash-attn (pyproject.toml) did not run successfully.
7.017   │ exit code: 1
7.017   ╰─> [224 lines of output]
7.017       /opt/venv/lib/python3.11/site-packages/wheel/bdist_wheel.py:4: FutureWarning: The 'wheel' package is no longer the canonical location of the 'bdist_wheel' command, and will be removed in a future release. Please update to setuptools v70.1 or later which contains an integrated version of this command.
7.017         warn(
7.017       W0327 22:15:31.910114 29 torch/utils/cpp_extension.py:140] No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
7.017       /opt/venv/lib/python3.11/site-packages/setuptools/dist.py:765: SetuptoolsDeprecationWarning: License classifiers are deprecated.
7.017       !!
7.017
7.017               ********************************************************************************
7.017               Please consider removing the following classifiers in favor of a SPDX license expression:
7.017
7.017               License :: OSI Approved :: BSD License
7.017
7.017               See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
7.017               ********************************************************************************
7.017
7.017       !!
7.017         self._finalize_license_expression()
7.017
7.017
7.017       torch.__version__  = 2.11.0+cu130
7.017
7.017
7.017       running bdist_wheel
7.017       Guessing wheel URL:  https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.11cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
7.017       Precompiled wheel not found. Building from source...
7.017       running build
7.017       running build_py
7.017       creating build/lib.linux-x86_64-cpython-311/flash_attn
7.017       copying flash_attn/bert_padding.py -> build/lib.linux-x86_64-cpython-311/flash_attn
7.017       copying flash_attn/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-311/flash_attn
7.017       copying flash_attn/__init__.py -> build/lib.linux-x86_64-cpython-311/flash_attn
7.017       copying flash_attn/flash_attn_triton.py -> build/lib.linux-x86_64-cpython-311/flash_attn
7.017       copying flash_attn/flash_attn_triton_og.py -> build/lib.linux-x86_64-cpython-311/flash_attn
7.017       copying flash_attn/flash_blocksparse_attn_interface.py -> build/lib.linux-x86_64-cpython-311/flash_attn
7.017       copying flash_attn/flash_blocksparse_attention.py -> build/lib.linux-x86_64-cpython-311/flash_attn
7.017       creating build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/generate_kernels.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/test_flash_attn.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/padding.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/setup.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/__init__.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/test_util.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/benchmark_flash_attention_fp8.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/benchmark_attn.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/test_kvcache.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/test_attn_kvcache.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/benchmark_mla_decode.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       copying hopper/benchmark_split_kv.py -> build/lib.linux-x86_64-cpython-311/hopper
7.017       creating build/lib.linux-x86_64-cpython-311/flash_attn/ops
7.017       copying flash_attn/ops/rms_norm.py -> build/lib.linux-x86_64-cpython-311/flash_attn/ops
7.017       copying flash_attn/ops/__init__.py -> build/lib.linux-x86_64-cpython-311/flash_attn/ops
7.017       copying flash_attn/ops/layer_norm.py -> build/lib.linux-x86_64-cpython-311/flash_attn/ops
7.017       copying flash_attn/ops/fused_dense.py -> build/lib.linux-x86_64-cpython-311/flash_attn/ops
7.017       copying flash_attn/ops/activations.py -> build/lib.linux-x86_64-cpython-311/flash_attn/ops
7.017       creating build/lib.linux-x86_64-cpython-311/flash_attn/utils
7.017       copying flash_attn/utils/distributed.py -> build/lib.linux-x86_64-cpython-311/flash_attn/utils
7.017       copying flash_attn/utils/testing.py -> build/lib.linux-x86_64-cpython-311/flash_attn/utils
7.017       copying flash_attn/utils/__init__.py -> build/lib.linux-x86_64-cpython-311/flash_attn/utils
7.017       copying flash_attn/utils/benchmark.py -> build/lib.linux-x86_64-cpython-311/flash_attn/utils
7.017       copying flash_attn/utils/library.py -> build/lib.linux-x86_64-cpython-311/flash_attn/utils
7.017       copying flash_attn/utils/pretrained.py -> build/lib.linux-x86_64-cpython-311/flash_attn/utils
7.017       copying flash_attn/utils/torch.py -> build/lib.linux-x86_64-cpython-311/flash_attn/utils
7.017       copying flash_attn/utils/generation.py -> build/lib.linux-x86_64-cpython-311/flash_attn/utils
7.017       creating build/lib.linux-x86_64-cpython-311/flash_attn/losses
7.017       copying flash_attn/losses/cross_entropy.py -> build/lib.linux-x86_64-cpython-311/flash_attn/losses
7.017       copying flash_attn/losses/__init__.py -> build/lib.linux-x86_64-cpython-311/flash_attn/losses
7.017       creating build/lib.linux-x86_64-cpython-311/flash_attn/layers
7.017       copying flash_attn/layers/patch_embed.py -> build/lib.linux-x86_64-cpython-311/flash_attn/layers
7.017       copying flash_attn/layers/__init__.py -> build/lib.linux-x86_64-cpython-311/flash_attn/layers
7.017       copying flash_attn/layers/rotary.py -> build/lib.linux-x86_64-cpython-311/flash_attn/layers
7.017       creating build/lib.linux-x86_64-cpython-311/flash_attn/modules
7.017       copying flash_attn/modules/__init__.py -> build/lib.linux-x86_64-cpython-311/flash_attn/modules
7.017       copying flash_attn/modules/mlp.py -> build/lib.linux-x86_64-cpython-311/flash_attn/modules
7.017       copying flash_attn/modules/embedding.py -> build/lib.linux-x86_64-cpython-311/flash_attn/modules
7.017       copying flash_attn/modules/mha.py -> build/lib.linux-x86_64-cpython-311/flash_attn/modules
7.017       copying flash_attn/modules/block.py -> build/lib.linux-x86_64-cpython-311/flash_attn/modules
7.017       creating build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/utils.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/bwd_prefill_onekernel.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/__init__.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/fwd_decode.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/bwd_ref.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/bwd_prefill.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/fwd_ref.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/train.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/fwd_prefill.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/bwd_prefill_fused.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/test.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/fp8.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/interface_fa.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/bwd_prefill_split.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       copying flash_attn/flash_attn_triton_amd/bench.py -> build/lib.linux-x86_64-cpython-311/flash_attn/flash_attn_triton_amd
7.017       creating build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/hopper_helpers.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/seqlen_info.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/utils.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/flash_fwd_sm100.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/__init__.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/named_barrier.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/softmax.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/mma_sm100_desc.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/mask.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/block_info.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/interface.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/flash_bwd.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/blackwell_helpers.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/pack_gqa.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/flash_fwd.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/fast_math.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/pipeline.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/flash_bwd_postprocess.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/flash_bwd_preprocess.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/ampere_helpers.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       copying flash_attn/cute/tile_scheduler.py -> build/lib.linux-x86_64-cpython-311/flash_attn/cute
7.017       creating build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       copying flash_attn/models/__init__.py -> build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       copying flash_attn/models/gpt_neox.py -> build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       copying flash_attn/models/bert.py -> build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       copying flash_attn/models/llama.py -> build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       copying flash_attn/models/gptj.py -> build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       copying flash_attn/models/falcon.py -> build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       copying flash_attn/models/btlm.py -> build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       copying flash_attn/models/opt.py -> build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       copying flash_attn/models/baichuan.py -> build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       copying flash_attn/models/vit.py -> build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       copying flash_attn/models/bigcode.py -> build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       copying flash_attn/models/gpt.py -> build/lib.linux-x86_64-cpython-311/flash_attn/models
7.017       creating build/lib.linux-x86_64-cpython-311/flash_attn/ops/triton
7.017       copying flash_attn/ops/triton/linear.py -> build/lib.linux-x86_64-cpython-311/flash_attn/ops/triton
7.017       copying flash_attn/ops/triton/cross_entropy.py -> build/lib.linux-x86_64-cpython-311/flash_attn/ops/triton
7.017       copying flash_attn/ops/triton/__init__.py -> build/lib.linux-x86_64-cpython-311/flash_attn/ops/triton
7.017       copying flash_attn/ops/triton/mlp.py -> build/lib.linux-x86_64-cpython-311/flash_attn/ops/triton
7.017       copying flash_attn/ops/triton/rotary.py -> build/lib.linux-x86_64-cpython-311/flash_attn/ops/triton
7.017       copying flash_attn/ops/triton/k_activations.py -> build/lib.linux-x86_64-cpython-311/flash_attn/ops/triton
7.017       copying flash_attn/ops/triton/layer_norm.py -> build/lib.linux-x86_64-cpython-311/flash_attn/ops/triton
7.017       running build_ext
7.017       Traceback (most recent call last):
7.017         File "<string>", line 486, in run
7.017         File "/usr/lib/python3.11/urllib/request.py", line 241, in urlretrieve
7.017           with contextlib.closing(urlopen(url, data)) as fp:
7.017                                   ^^^^^^^^^^^^^^^^^^
7.017         File "/usr/lib/python3.11/urllib/request.py", line 216, in urlopen
7.017           return opener.open(url, data, timeout)
7.017                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7.017         File "/usr/lib/python3.11/urllib/request.py", line 525, in open
7.017           response = meth(req, response)
7.017                      ^^^^^^^^^^^^^^^^^^^
7.017         File "/usr/lib/python3.11/urllib/request.py", line 634, in http_response
7.017           response = self.parent.error(
7.017                      ^^^^^^^^^^^^^^^^^^
7.017         File "/usr/lib/python3.11/urllib/request.py", line 563, in error
7.017           return self._call_chain(*args)
7.017                  ^^^^^^^^^^^^^^^^^^^^^^^
7.017         File "/usr/lib/python3.11/urllib/request.py", line 496, in _call_chain
7.017           result = func(*args)
7.017                    ^^^^^^^^^^^
7.017         File "/usr/lib/python3.11/urllib/request.py", line 643, in http_error_default
7.017           raise HTTPError(req.full_url, code, msg, hdrs, fp)
7.017       urllib.error.HTTPError: HTTP Error 404: Not Found
7.017
7.017       During handling of the above exception, another exception occurred:
7.017
7.017       Traceback (most recent call last):
7.017         File "/opt/venv/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 389, in <module>
7.017           main()
7.017         File "/opt/venv/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 373, in main
7.017           json_out["return_val"] = hook(**hook_input["kwargs"])
7.017                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7.017         File "/opt/venv/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 280, in build_wheel
7.017           return _build_backend().build_wheel(
7.017                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/build_meta.py", line 441, in build_wheel
7.017           return _build(['bdist_wheel', '--dist-info-dir', str(metadata_directory)])
7.017                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/build_meta.py", line 429, in _build
7.017           return self._build_with_temp_dir(
7.017                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/build_meta.py", line 410, in _build_with_temp_dir
7.017           self.run_setup()
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/build_meta.py", line 520, in run_setup
7.017           super().run_setup(setup_script=setup_script)
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/build_meta.py", line 317, in run_setup
7.017           exec(code, locals())
7.017         File "<string>", line 526, in <module>
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/__init__.py", line 117, in setup
7.017           return distutils.core.setup(**attrs)  # type: ignore[return-value]
7.017                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 186, in setup
7.017           return run_commands(dist)
7.017                  ^^^^^^^^^^^^^^^^^^
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 202, in run_commands
7.017           dist.run_commands()
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 1000, in run_commands
7.017           self.run_command(cmd)
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/dist.py", line 1107, in run_command
7.017           super().run_command(command)
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 1019, in run_command
7.017           cmd_obj.run()
7.017         File "<string>", line 503, in run
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/command/bdist_wheel.py", line 370, in run
7.017           self.run_command("build")
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 341, in run_command
7.017           self.distribution.run_command(command)
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/dist.py", line 1107, in run_command
7.017           super().run_command(command)
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 1019, in run_command
7.017           cmd_obj.run()
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/_distutils/command/build.py", line 135, in run
7.017           self.run_command(cmd_name)
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 341, in run_command
7.017           self.distribution.run_command(command)
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/dist.py", line 1107, in run_command
7.017           super().run_command(command)
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 1019, in run_command
7.017           cmd_obj.run()
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 97, in run
7.017           _build_ext.run(self)
7.017         File "/opt/venv/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 367, in run
7.017           self.build_extensions()
7.017         File "/opt/venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 716, in build_extensions
7.017           _check_cuda_version(compiler_name, compiler_version)
7.017         File "/opt/venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 545, in _check_cuda_version
7.017           raise RuntimeError(CUDA_MISMATCH_MESSAGE, cuda_str_version, torch.version.cuda)
7.017       RuntimeError: ('The detected CUDA version (%s) mismatches the version that was used to compilePyTorch (%s). Please make sure to use the same CUDA versions.', '12.4', '13.0')
7.017       [end of output]
7.017
7.017   note: This error originates from a subprocess, and is likely not a problem with pip.
7.017   ERROR: Failed building wheel for flash-attn
7.017 Failed to build flash-attn
7.128 error: failed-wheel-build-for-install
7.128
7.128 × Failed to build installable wheels for some pyproject.toml based projects
7.128 ╰─> flash-attn
------
Dockerfile.whisper.4:37
--------------------
  35 |         fastapi uvicorn python-multipart huggingface_hub
  36 |
  37 | >>> RUN pip install --no-cache-dir flash-attn --no-build-isolation
  38 |
  39 |     # 4. Bake the model
--------------------
ERROR: failed to build: failed to solve: process "/bin/sh -c pip install --no-cache-dir flash-attn --no-build-isolation" did not complete successfully: exit code: 1
steven@CSO:~/whisper$ nano Dockerfile.whisper.5
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.5 .
[+] Building 201.4s (16/19)                                                          docker:default
 => [internal] load build definition from Dockerfile.whisper.5                                 0.0s
 => => transferring dockerfile: 3.10kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.5s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                     0.0s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => [builder  1/10] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1  0.0s
 => [internal] preparing inline document                                                       0.0s
 => CACHED [builder  2/10] RUN apt-get update && apt-get install -y     python3.11 python3.11  0.0s
 => CACHED [builder  3/10] WORKDIR /app                                                        0.0s
 => CACHED [builder  4/10] RUN python3.11 -m venv /opt/venv                                    0.0s
 => CACHED [builder  5/10] RUN pip install --no-cache-dir --upgrade pip &&     pip install --  0.0s
 => CACHED [stage-1 2/5] WORKDIR /app                                                          0.0s
 => [builder  6/10] RUN pip install --no-cache-dir     torch==2.4.1+cu124     torchvision==  169.7s
 => [builder  7/10] RUN pip install --no-cache-dir --no-deps     transformers     insanely-fa  6.1s
 => [builder  8/10] RUN pip install --no-cache-dir pyyaml requests tqdm numpy regex sentencep  2.4s
 => [builder  9/10] RUN pip install --no-cache-dir flash-attn --no-build-isolation            22.1s
 => ERROR [builder 10/10] RUN python3 -c "from transformers import pipeline; pipeline('automa  0.5s
------
 > [builder 10/10] RUN python3 -c "from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')":
0.424 Traceback (most recent call last):
0.424   File "<string>", line 1, in <module>
0.424   File "/opt/venv/lib/python3.11/site-packages/transformers/__init__.py", line 30, in <module>
0.424     from . import dependency_versions_check
0.424   File "/opt/venv/lib/python3.11/site-packages/transformers/dependency_versions_check.py", line 16, in <module>
0.424     from .utils.versions import require_version, require_version_core
0.424   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/__init__.py", line 22, in <module>
0.424     from .auto_docstring import (
0.424   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/auto_docstring.py", line 32, in <module>
0.425     from .generic import ModelOutput
0.425   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/generic.py", line 35, in <module>
0.425     from ..utils import logging
0.425   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/logging.py", line 35, in <module>
0.425     import huggingface_hub.utils as hf_hub_utils
0.425   File "/opt/venv/lib/python3.11/site-packages/huggingface_hub/utils/__init__.py", line 17, in <module>
0.425     from huggingface_hub.errors import (
0.425   File "/opt/venv/lib/python3.11/site-packages/huggingface_hub/errors.py", line 6, in <module>
0.425     from httpx import HTTPError, Response
0.425 ModuleNotFoundError: No module named 'httpx'
------
Dockerfile.whisper.5:42
--------------------
  40 |
  41 |     # 6. Bake the model (The "Long Wait" Step)
  42 | >>> RUN python3 -c "from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')"
  43 |
  44 |     # STAGE 2: Final Runtime
--------------------
ERROR: failed to build: failed to solve: process "/bin/sh -c python3 -c \"from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')\"" did not complete successfully: exit code: 1
steven@CSO:~/whisper$ nano Dockerfile.whisper.6
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.5 .
[+] Building 0.9s (16/19)                                                            docker:default
 => [internal] load build definition from Dockerfile.whisper.5                                 0.0s
 => => transferring dockerfile: 3.10kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.4s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                     0.0s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => CACHED [internal] preparing inline document                                                0.0s
 => [builder  1/10] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1  0.0s
 => CACHED [stage-1 2/5] WORKDIR /app                                                          0.0s
 => CACHED [builder  2/10] RUN apt-get update && apt-get install -y     python3.11 python3.11  0.0s
 => CACHED [builder  3/10] WORKDIR /app                                                        0.0s
 => CACHED [builder  4/10] RUN python3.11 -m venv /opt/venv                                    0.0s
 => CACHED [builder  5/10] RUN pip install --no-cache-dir --upgrade pip &&     pip install --  0.0s
 => CACHED [builder  6/10] RUN pip install --no-cache-dir     torch==2.4.1+cu124     torchvis  0.0s
 => CACHED [builder  7/10] RUN pip install --no-cache-dir --no-deps     transformers     insa  0.0s
 => CACHED [builder  8/10] RUN pip install --no-cache-dir pyyaml requests tqdm numpy regex se  0.0s
 => CACHED [builder  9/10] RUN pip install --no-cache-dir flash-attn --no-build-isolation      0.0s
 => ERROR [builder 10/10] RUN python3 -c "from transformers import pipeline; pipeline('automa  0.4s
------
 > [builder 10/10] RUN python3 -c "from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')":
0.375 Traceback (most recent call last):
0.375   File "<string>", line 1, in <module>
0.375   File "/opt/venv/lib/python3.11/site-packages/transformers/__init__.py", line 30, in <module>
0.375     from . import dependency_versions_check
0.375   File "/opt/venv/lib/python3.11/site-packages/transformers/dependency_versions_check.py", line 16, in <module>
0.375     from .utils.versions import require_version, require_version_core
0.375   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/__init__.py", line 22, in <module>
0.376     from .auto_docstring import (
0.376   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/auto_docstring.py", line 32, in <module>
0.376     from .generic import ModelOutput
0.376   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/generic.py", line 35, in <module>
0.376     from ..utils import logging
0.376   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/logging.py", line 35, in <module>
0.376     import huggingface_hub.utils as hf_hub_utils
0.376   File "/opt/venv/lib/python3.11/site-packages/huggingface_hub/utils/__init__.py", line 17, in <module>
0.376     from huggingface_hub.errors import (
0.376   File "/opt/venv/lib/python3.11/site-packages/huggingface_hub/errors.py", line 6, in <module>
0.376     from httpx import HTTPError, Response
0.376 ModuleNotFoundError: No module named 'httpx'
------
Dockerfile.whisper.5:42
--------------------
  40 |
  41 |     # 6. Bake the model (The "Long Wait" Step)
  42 | >>> RUN python3 -c "from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')"
  43 |
  44 |     # STAGE 2: Final Runtime
--------------------
ERROR: failed to build: failed to solve: process "/bin/sh -c python3 -c \"from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')\"" did not complete successfully: exit code: 1
steven@CSO:~/whisper$ ^C
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.6 .
[+] Building 0.1s (1/1) FINISHED                                                     docker:default
 => [internal] load build definition from Dockerfile.whisper.6                                 0.0s
 => => transferring dockerfile: 1.95kB                                                         0.0s
Dockerfile.whisper.6:55
--------------------
  54 |     # Final Server Script
  55 | >>> COPY <<EOF /app/main.py
  56 | >>> from fastapi import FastAPI, File
  57 |
--------------------
ERROR: failed to build: failed to solve: unterminated heredoc
steven@CSO:~/whisper$ nano Dockerfile.whisper.6
steven@CSO:~/whisper$ rm -rf Dockerfile.whisper.6
steven@CSO:~/whisper$ nano Dockerfile.whisper.6
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.6 .
[+] Building 27.2s (15/18)                                                           docker:default
 => [internal] load build definition from Dockerfile.whisper.6                                 0.0s
 => => transferring dockerfile: 3.05kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.3s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => [internal] preparing inline document                                                       0.0s
 => [builder  1/10] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1  0.0s
 => CACHED [stage-1 2/5] WORKDIR /app                                                          0.0s
 => CACHED [builder  2/10] RUN apt-get update && apt-get install -y     python3.11 python3.11  0.0s
 => CACHED [builder  3/10] WORKDIR /app                                                        0.0s
 => CACHED [builder  4/10] RUN python3.11 -m venv /opt/venv                                    0.0s
 => CACHED [builder  5/10] RUN pip install --no-cache-dir --upgrade pip &&     pip install --  0.0s
 => CACHED [builder  6/10] RUN pip install --no-cache-dir     torch==2.4.1+cu124     torchvis  0.0s
 => CACHED [builder  7/10] RUN pip install --no-cache-dir --no-deps     transformers     insa  0.0s
 => [builder  8/10] RUN pip install --no-cache-dir     pyyaml requests tqdm numpy regex sente  2.6s
 => [builder  9/10] RUN pip install --no-cache-dir flash-attn --no-build-isolation            22.6s
 => ERROR [builder 10/10] RUN python3 -c "from transformers import pipeline; pipeline('automa  1.6s
------
 > [builder 10/10] RUN python3 -c "from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')":
1.337 Traceback (most recent call last):
1.337   File "<string>", line 1, in <module>
1.337   File "/opt/venv/lib/python3.11/site-packages/transformers/__init__.py", line 30, in <module>
1.337     from . import dependency_versions_check
1.337   File "/opt/venv/lib/python3.11/site-packages/transformers/dependency_versions_check.py", line 16, in <module>
1.337     from .utils.versions import require_version, require_version_core
1.337   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/__init__.py", line 22, in <module>
1.337     from .auto_docstring import (
1.337   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/auto_docstring.py", line 32, in <module>
1.337     from .generic import ModelOutput
1.337   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/generic.py", line 45, in <module>
1.337     from ..model_debugging_utils import model_addition_debugger_context
1.337   File "/opt/venv/lib/python3.11/site-packages/transformers/model_debugging_utils.py", line 29, in <module>
1.337     from safetensors.torch import save_file
1.337 ModuleNotFoundError: No module named 'safetensors'
------
Dockerfile.whisper.6:44
--------------------
  42 |
  43 |     # 6. Bake the model (The "Long Wait" Step)
  44 | >>> RUN python3 -c "from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')"
  45 |
  46 |     # STAGE 2: Final Runtime
--------------------
ERROR: failed to build: failed to solve: process "/bin/sh -c python3 -c \"from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')\"" did not complete successfully: exit code: 1
steven@CSO:~/whisper$ nano Dockerfile.whisper.7
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.7 .
[+] Building 41.3s (15/18)                                                           docker:default
 => [internal] load build definition from Dockerfile.whisper.7                                 0.0s
 => => transferring dockerfile: 2.89kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.3s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => [internal] preparing inline document                                                       0.0s
 => [builder  1/10] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1  0.0s
 => CACHED [builder  2/10] RUN apt-get update && apt-get install -y     python3.11 python3.11  0.0s
 => CACHED [builder  3/10] WORKDIR /app                                                        0.0s
 => CACHED [builder  4/10] RUN python3.11 -m venv /opt/venv                                    0.0s
 => CACHED [builder  5/10] RUN pip install --no-cache-dir --upgrade pip &&     pip install --  0.0s
 => CACHED [builder  6/10] RUN pip install --no-cache-dir     torch==2.4.1+cu124     torchvis  0.0s
 => CACHED [builder  7/10] RUN pip install --no-cache-dir --no-deps     transformers     insa  0.0s
 => CACHED [stage-1 2/5] WORKDIR /app                                                          0.0s
 => [builder  8/10] RUN pip install --no-cache-dir     pyyaml requests tqdm numpy regex sent  17.4s
 => [builder  9/10] RUN pip install --no-cache-dir flash-attn --no-build-isolation            21.2s
 => ERROR [builder 10/10] RUN python3 -c "from transformers import pipeline; pipeline('automa  2.4s
------
 > [builder 10/10] RUN python3 -c "from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')":
2.038 Traceback (most recent call last):
2.038   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2169, in __getattr__
2.038     module = self._get_module(self._class_to_module[name])
2.038              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.038   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2403, in _get_module
2.038     raise e
2.038   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2401, in _get_module
2.039     return importlib.import_module("." + module_name, self.__name__)
2.039            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.039   File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
2.040     return _bootstrap._gcd_import(name[level:], package, level)
2.040            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.040   File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
2.040   File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
2.040   File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
2.040   File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
2.040   File "<frozen importlib._bootstrap_external>", line 940, in exec_module
2.040   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
2.040   File "/opt/venv/lib/python3.11/site-packages/transformers/integrations/ggml.py", line 23, in <module>
2.040     from tokenizers import Tokenizer, decoders, normalizers, pre_tokenizers, processors
2.040 ModuleNotFoundError: No module named 'tokenizers'
2.041
2.041 The above exception was the direct cause of the following exception:
2.041
2.041 Traceback (most recent call last):
2.041   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2169, in __getattr__
2.041     module = self._get_module(self._class_to_module[name])
2.041              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.041   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2403, in _get_module
2.042     raise e
2.042   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2401, in _get_module
2.042     return importlib.import_module("." + module_name, self.__name__)
2.042            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.042   File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
2.042     return _bootstrap._gcd_import(name[level:], package, level)
2.042            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.042   File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
2.042   File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
2.042   File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
2.042   File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
2.042   File "<frozen importlib._bootstrap_external>", line 940, in exec_module
2.042   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
2.042   File "/opt/venv/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 24, in <module>
2.042     from ..configuration_utils import PreTrainedConfig
2.043   File "/opt/venv/lib/python3.11/site-packages/transformers/configuration_utils.py", line 33, in <module>
2.043     from .modeling_gguf_pytorch_utils import load_gguf_checkpoint
2.043   File "/opt/venv/lib/python3.11/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 22, in <module>
2.043     from .integrations import (
2.043   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2257, in __getattr__
2.044     raise ModuleNotFoundError(
2.044 ModuleNotFoundError: Could not import module 'GGUF_CONFIG_DEFAULTS_MAPPING'. Are this object's requirements defined correctly?
2.044
2.044 The above exception was the direct cause of the following exception:
2.044
2.044 Traceback (most recent call last):
2.044   File "<string>", line 1, in <module>
2.044   File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2257, in __getattr__
2.044     raise ModuleNotFoundError(
2.044 ModuleNotFoundError: Could not import module 'pipeline'. Are this object's requirements defined correctly?
------
Dockerfile.whisper.7:44
--------------------
  42 |
  43 |     # 6. Bake the model (This will start the ~3GB download once imports pass)
  44 | >>> RUN python3 -c "from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')"
  45 |
  46 |     # STAGE 2: Final Runtime
--------------------
ERROR: failed to build: failed to solve: process "/bin/sh -c python3 -c \"from transformers import pipeline; pipeline('automatic-speech-recognition', model='openai/whisper-large-v3')\"" did not complete successfully: exit code: 1
steven@CSO:~/whisper$ nano Dockerfile.whisper.8
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.8 .
[+] Building 188.0s (19/19) FINISHED                                                 docker:default
 => [internal] load build definition from Dockerfile.whisper.8                                 0.0s
 => => transferring dockerfile: 2.89kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.3s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => CACHED [internal] preparing inline document                                                0.0s
 => [builder  1/10] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1  0.0s
 => CACHED [stage-1 2/5] WORKDIR /app                                                          0.0s
 => CACHED [builder  2/10] RUN apt-get update && apt-get install -y     python3.11 python3.11  0.0s
 => CACHED [builder  3/10] WORKDIR /app                                                        0.0s
 => CACHED [builder  4/10] RUN python3.11 -m venv /opt/venv                                    0.0s
 => CACHED [builder  5/10] RUN pip install --no-cache-dir --upgrade pip &&     pip install --  0.0s
 => CACHED [builder  6/10] RUN pip install --no-cache-dir     torch==2.4.1+cu124     torchvis  0.0s
 => CACHED [builder  7/10] RUN pip install --no-cache-dir --no-deps     transformers     insa  0.0s
 => [builder  8/10] RUN pip install --no-cache-dir     pyyaml requests tqdm numpy regex sent  17.1s
 => [builder  9/10] RUN pip install --no-cache-dir flash-attn --no-build-isolation            22.4s
 => [builder 10/10] RUN python3 -c "from transformers import pipeline; pipeline('automatic-s  74.7s
 => [stage-1 3/5] COPY --from=builder /opt/venv /opt/venv                                     25.8s
 => [stage-1 4/5] COPY --from=builder /root/.cache/huggingface /root/.cache/huggingface        6.2s
 => [stage-1 5/5] COPY <<EOF /app/main.py                                                      0.0s
 => exporting to image                                                                        23.2s
 => => exporting layers                                                                       23.2s
 => => writing image sha256:d2ff1e3f729f57a733366950eddc7a04a62d47c35ba06517d07a20a5448ed60e   0.0s
 => => naming to docker.io/library/streamwhisper:latest                                        0.0s
steven@CSO:~/whisper$ nano Dockerfile.whisper.9
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.9 .
[+] Building 0.1s (1/1) FINISHED                                                     docker:default
 => [internal] load build definition from Dockerfile.whisper.9                                 0.0s
 => => transferring dockerfile: 1.78kB                                                         0.0s
Dockerfile.whisper.9:52
--------------------
  50 |     WORKDIR /app
  51 |     COPY --from=builder /opt/venv /opt/venv
  52 | >>> COPY --
  53 |
--------------------
ERROR: failed to build: failed to solve: dockerfile parse error on line 52: COPY requires at least two arguments, but only one was provided. Destination could not be determined
steven@CSO:~/whisper$ rm -rf Dockerfile.whisper.9
steven@CSO:~/whisper$ nano Dockerfile.whisper.9
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.9 .
[+] Building 0.6s (20/20) FINISHED                                                   docker:default
 => [internal] load build definition from Dockerfile.whisper.9                                 0.0s
 => => transferring dockerfile: 3.12kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.4s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                     0.0s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => [builder  1/10] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1  0.0s
 => [internal] preparing inline document                                                       0.0s
 => CACHED [stage-1 2/5] WORKDIR /app                                                          0.0s
 => CACHED [builder  2/10] RUN apt-get update && apt-get install -y     python3.11 python3.11  0.0s
 => CACHED [builder  3/10] WORKDIR /app                                                        0.0s
 => CACHED [builder  4/10] RUN python3.11 -m venv /opt/venv                                    0.0s
 => CACHED [builder  5/10] RUN pip install --no-cache-dir --upgrade pip &&     pip install --  0.0s
 => CACHED [builder  6/10] RUN pip install --no-cache-dir     torch==2.4.1+cu124     torchvis  0.0s
 => CACHED [builder  7/10] RUN pip install --no-cache-dir --no-deps     transformers     insa  0.0s
 => CACHED [builder  8/10] RUN pip install --no-cache-dir     pyyaml requests tqdm numpy rege  0.0s
 => CACHED [builder  9/10] RUN pip install --no-cache-dir flash-attn --no-build-isolation      0.0s
 => CACHED [builder 10/10] RUN python3 -c "from transformers import pipeline; pipeline('autom  0.0s
 => CACHED [stage-1 3/5] COPY --from=builder /opt/venv /opt/venv                               0.0s
 => CACHED [stage-1 4/5] COPY --from=builder /root/.cache/huggingface /root/.cache/huggingfac  0.0s
 => [stage-1 5/5] COPY <<EOF /app/main.py                                                      0.0s
 => exporting to image                                                                         0.0s
 => => exporting layers                                                                        0.0s
 => => writing image sha256:c0f5c233779188215eb49154502051254316db97cdfd342cd1e55e11c7124e30   0.0s
 => => naming to docker.io/library/streamwhisper:latest                                        0.0s
steven@CSO:~/whisper$ nano Dockerfile.whisper.10
steven@CSO:~/whisper$ nano Dockerfile.whisper.10
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.10 .
[+] Building 200.1s (14/18)                                                          docker:default
 => [internal] load build definition from Dockerfile.whisper.10                                0.0s
 => => transferring dockerfile: 3.02kB                                                         0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04          0.3s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => [internal] preparing inline document                                                       0.0s
 => CACHED [builder 1/9] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622  0.0s
 => CACHED [builder 2/9] RUN apt-get update && apt-get install -y     python3.11 python3.11-v  0.0s
 => CACHED [builder 3/9] WORKDIR /app                                                          0.0s
 => CACHED [builder 4/9] RUN python3.11 -m venv /opt/venv                                      0.0s
 => [builder 5/9] RUN pip install --no-cache-dir     torch==2.4.1+cu124 torchvision==0.19.1  171.3s
 => [stage-1 2/6] RUN apt-get update && apt-get install -y     python3.11 ffmpeg     && rm -  45.5s
 => [stage-1 3/6] WORKDIR /app                                                                 0.0s
 => [builder 6/9] RUN pip install --no-cache-dir --no-deps     transformers insanely-fast-whi  6.0s
 => [builder 7/9] RUN pip install --no-cache-dir     pyyaml requests tqdm numpy regex senten  19.8s
 => ERROR [builder 8/9] RUN pip install --no-cache-dir flash-attn --no-build-isolation         2.7s
------
 > [builder 8/9] RUN pip install --no-cache-dir flash-attn --no-build-isolation:
0.639 Collecting flash-attn
0.734   Downloading flash_attn-2.8.3.tar.gz (8.4 MB)
1.098      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 23.4 MB/s eta 0:00:00
2.359   Preparing metadata (setup.py): started
2.484   Preparing metadata (setup.py): finished with status 'error'
2.486   error: subprocess-exited-with-error
2.486
2.486   × python setup.py egg_info did not run successfully.
2.486   │ exit code: 1
2.486   ╰─> [6 lines of output]
2.486       Traceback (most recent call last):
2.486         File "<string>", line 2, in <module>
2.486         File "<pip-setuptools-caller>", line 34, in <module>
2.486         File "/tmp/pip-install-u7ise9jk/flash-attn_27a3af69bc014ffe85f3de78c9fbc3e9/setup.py", line 20, in <module>
2.486           from wheel.bdist_wheel import bdist_wheel as _bdist_wheel
2.486       ModuleNotFoundError: No module named 'wheel'
2.486       [end of output]
2.486
2.486   note: This error originates from a subprocess, and is likely not a problem with pip.
2.487 error: metadata-generation-failed
2.487
2.487 × Encountered error while generating package metadata.
2.487 ╰─> See above for output.
2.487
2.487 note: This is an issue with the package mentioned above, not pip.
2.487 hint: See above for details.
------
Dockerfile.whisper.10:29
--------------------
  27 |
  28 |     # 3. Compile Flash Attention (CACHED)
  29 | >>> RUN pip install --no-cache-dir flash-attn --no-build-isolation
  30 |
  31 |     # 4. Bake the model (CACHED)
--------------------
ERROR: failed to build: failed to solve: process "/bin/sh -c pip install --no-cache-dir flash-attn --no-build-isolation" did not complete successfully: exit code: 1
steven@CSO:~/whisper$ nano Dockerfile.whisper.11
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.11 .
[+] Building 116.8s (12/20)                                                           docker:default
 => [builder  1/10] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1d  0.0s
 => [internal] preparing inline document                                                        0.0s
 => CACHED [builder  2/10] RUN apt-get update && apt-get install -y     python3.11 python3.11-  0.0s
 => CACHED [builder  3/10] WORKDIR /app                                                         0.0s
 => CACHED [builder  4/10] RUN python3.11 -m venv /opt/venv                                     0.0s
[+] Building 116.9s (12/20)                                                           docker:default
 => [builder  1/10] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1d  0.0s
 => [internal] preparing inline document                                                        0.0s
 => CACHED [builder  2/10] RUN apt-get update && apt-get install -y     python3.11 python3.11-  0.0s
 => CACHED [builder  3/10] WORKDIR /app                                                         0.0s> #   Downloading https://download.pytorch.org/whl/cu124/nvidia_nvtx_cu12-12.4.99-py3-none
 => CACHED [builder  4/10] RUN python3.11 -m venv /opt/venv                                     0.0s> # linux2014_x86_64.whl (99 kB)
 => [builder  5/10] RUN pip install --no-cache-dir --upgrade pip setuptools wheel packaging     2.9s> #   Downloading https://download.pytorch.org/whl/cu124/nvidia_nvjitlink_cu12-12.4.99-py3
 => CACHED [stage-1 2/6] RUN apt-get update && apt-get install -y     python3.11 ffmpeg     &&  0.0s> # -manylinux2014_x86_64.whl (21.1 MB)
[+] Building 117.1s (12/20)                                                           docker:default
 => [builder  1/10] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1d  0.0s
 => [internal] preparing inline document                                                        0.0s
 => CACHED [builder  2/10] RUN apt-get update && apt-get install -y     python3.11 python3.11-  0.0s
 => CACHED [builder  3/10] WORKDIR /app                                                         0.0s
 => CACHED [builder  4/10] RUN python3.11 -m venv /opt/venv                                     0.0s
 => [builder  5/10] RUN pip install --no-cache-dir --upgrade pip setuptools wheel packaging     2.9s
 => CACHED [stage-1 2/6] RUN apt-get update && apt-get install -y     python3.11 ffmpeg     &&  0.0s
[+] Building 117.2s (12/20)                                                           docker:default
 => [builder  1/10] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a1d  0.0s
 => [internal] preparing inline document                                                        0.0s=> #   Downloading https://download.pytorch.org/whl/cu124/nvidia_nvtx_cu12-12.4.99-py3-none-
 => CACHED [builder  2/10] RUN apt-get update && apt-get install -y     python3.11 python3.11-  0.0s=> # linux2014_x86_64.whl (99 kB)
 => CACHED [builder  3/10] WORKDIR /app                                                         0.0s=> #   Downloading https://download.pytorch.org/whl/cu124/nvidia_nvjitlink_cu12-12.4.99-py3-
 => CACHED [builder  4/10] RUN python3.11 -m venv /opt/venv                                     0.0s=> # -manylinux2014_x86_64.whl (21.1 MB)
 => [builder  5/10] RUN pip install --no-cache-dir --upgrade pip setuptools wheel packaging     2.9s
 => CACHED [stage-1 2/6] RUN apt-get update && apt-get install -y     python3.11 ffmpeg     &&  0.0s
[+] Building 359.9s (21/21) FINISHED                                                docker:default
 => [internal] load build definition from Dockerfile.whisper.11                               0.0s
 => => transferring dockerfile: 2.91kB                                                        0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04         0.4s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                    0.0s
 => [internal] load .dockerignore                                                             0.0s
 => => transferring context: 2B                                                               0.0s
 => [builder  1/10] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a  0.0s
 => CACHED [internal] preparing inline document                                               0.0s
 => CACHED [builder  2/10] RUN apt-get update && apt-get install -y     python3.11 python3.1  0.0s
 => CACHED [builder  3/10] WORKDIR /app                                                       0.0s
 => CACHED [builder  4/10] RUN python3.11 -m venv /opt/venv                                   0.0s
 => [builder  5/10] RUN pip install --no-cache-dir --upgrade pip setuptools wheel packaging   2.9s
 => CACHED [stage-1 2/6] RUN apt-get update && apt-get install -y     python3.11 ffmpeg       0.0s
 => CACHED [stage-1 3/6] WORKDIR /app                                                         0.0s
 => [builder  6/10] RUN pip install --no-cache-dir     torch==2.4.1+cu124 torchvision==0.1  167.9s
 => [builder  7/10] RUN pip install --no-cache-dir --no-deps     transformers insanely-fast-  6.1s
 => [builder  8/10] RUN pip install --no-cache-dir     pyyaml requests tqdm numpy regex sen  17.8s
 => [builder  9/10] RUN pip install --no-cache-dir flash-attn --no-build-isolation           21.0s
 => [builder 10/10] RUN python3 -c "from transformers import pipeline; pipeline('automatic-  74.4s
 => [stage-1 4/6] COPY --from=builder /opt/venv /opt/venv                                    22.9s
 => [stage-1 5/6] COPY --from=builder /root/.cache/huggingface /root/.cache/huggingface       6.5s
 => [stage-1 6/6] COPY <<EOF /app/main.py                                                     0.0s
 => exporting to image                                                                       22.8s
 => => exporting layers                                                                      22.8s
 => => writing image sha256:9c12899036f867f3b45f9cdc03e368568c36c2f492e36d074a6953828c988f5c  0.0s
 => => naming to docker.io/library/streamwhisper:latest                                       0.0s
steven@CSO:~/whisper$ nano Dockerfile.whisper.12
steven@CSO:~/whisper$ docker build -t streamwhisper:latest -f Dockerfile.whisper.12 .
[+] Building 104.4s (11/21)                                                         docker:default
 => [internal] load build definition from Dockerfile.whisper.12                               0.0s
 => => transferring dockerfile: 3.29kB                                                        0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04         0.4s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                    0.0s
 => [internal] load .dockerignore                                                             0.0s
 => => transferring context: 2B                                                               0.0s
 => CACHED [builder  1/11] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:  0.0s
 => [internal] preparing inline document                                                      0.0s
 => [builder  2/11] RUN apt-get update && apt-get install -y     python3.11 python3.11-venv  42.4s
 => [builder  3/11] WORKDIR /app                                                              0.0s
 => [builder  4/11] RUN python3.11 -m venv /opt/venv                                          2.4s
 => [builder  5/11] RUN pip install --no-cache-dir --upgrade pip setuptools wheel packaging   2.6s
 => CANCELED [builder  6/11] RUN pip install --no-cache-dir     torch==2.4.1+cu124 torchvis  56.4s

 2 warnings found (use docker --debug to expand):
 - SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ARG "HF_TOKEN") (line 5)
 - SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ENV "HF_TOKEN") (line 6)
ERROR: failed to build: failed to solve: Canceled: context canceled
steven@CSO:~/whisper$ export MY_TOKEN=$(kubectl get secret hf-token -o jsonpath='{.data.token}' |
base64 --decode)
steven@CSO:~/whisper$ docker build -t streamwhisper:latest --build-arg HF_TOKEN=$MY_TOKEN -f Dockerfile.whisper.12 .
[+] Building 373.0s (21/21) FINISHED                                                docker:default
 => [internal] load build definition from Dockerfile.whisper.12                               0.0s
 => => transferring dockerfile: 3.29kB                                                        0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04         0.3s
 => [internal] load .dockerignore                                                             0.0s
 => => transferring context: 2B                                                               0.0s
 => CACHED [internal] preparing inline document                                               0.0s
 => [builder  1/11] FROM docker.io/nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04@sha256:622e78a  0.0s
 => CACHED [builder  2/11] RUN apt-get update && apt-get install -y     python3.11 python3.1  0.0s
 => CACHED [builder  3/11] WORKDIR /app                                                       0.0s
 => CACHED [builder  4/11] RUN python3.11 -m venv /opt/venv                                   0.0s
 => CACHED [builder  5/11] RUN pip install --no-cache-dir --upgrade pip setuptools wheel pac  0.0s
 => [builder  6/11] RUN pip install --no-cache-dir     torch==2.4.1+cu124 torchvision==0.1  170.4s
 => [builder  7/11] RUN pip install --no-cache-dir     fastapi uvicorn starlette pydantic py  3.0s
 => [builder  8/11] RUN pip install --no-cache-dir --no-deps     transformers insanely-fast-  5.3s
 => [builder  9/11] RUN pip install --no-cache-dir     pyyaml requests tqdm numpy regex sen  16.9s
 => [builder 10/11] RUN pip install --no-cache-dir flash-attn --no-build-isolation           22.9s
 => [builder 11/11] RUN python3 -c "from transformers import pipeline; pipeline('automatic-  74.1s
 => CACHED [stage-1 2/6] RUN apt-get update && apt-get install -y     python3.11 ffmpeg       0.0s
 => CACHED [stage-1 3/6] WORKDIR /app                                                         0.0s
 => [stage-1 4/6] COPY --from=builder /opt/venv /opt/venv                                    27.1s
 => [stage-1 5/6] COPY --from=builder /root/.cache/huggingface /root/.cache/huggingface       8.4s
 => [stage-1 6/6] COPY <<EOF /app/main.py                                                     0.1s
 => exporting to image                                                                       25.3s
 => => exporting layers                                                                      25.3s
 => => writing image sha256:5196a989b7101dad6a12e8a2dd01bab45fea29dccda69ded5f639c92717f2dc4  0.0s
 => => naming to docker.io/library/streamwhisper:latest                                       0.0s

 2 warnings found (use docker --debug to expand):
 - SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ARG "HF_TOKEN") (line 5)
 - SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ENV "HF_TOKEN") (line 6)

Terminal 2 Output

Testing Insanely Fast Whisper

steven@CSO:~$ nano whisper-server.yaml
steven@CSO:~$ kubectl apply -f whisper-server.yaml
deployment.apps/whisper-server created
service/whisper-service created
steven@CSO:~$ kubectl describe pod whisper-server-779f955f5c-j2vr7
Name:             whisper-server-779f955f5c-j2vr7
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app=whisper-server
                  pod-template-hash=779f955f5c
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/whisper-server-779f955f5c
Containers:
  whisper-server:
    Image:      streamwhisper:latest
    Port:       8001/TCP
    Host Port:  0/TCP
    Limits:
      memory:          20Gi
      nvidia.com/gpu:  1
    Requests:
      memory:          20Gi
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w5447 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-w5447:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  102s (x3 over 12m)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
steven@CSO:~$ nano whisper-server.yaml
steven@CSO:~$ kubectl delete -f whisper-server.yaml
deployment.apps "whisper-server" deleted from default namespace
service "whisper-service" deleted from default namespace
steven@CSO:~$ kubectl apply -f whisper-server.yaml
deployment.apps/whisper-server created
service/whisper-service created
steven@CSO:~$ kubectl describe pod whisper-server-6486568f56-rsv4v
Name:             whisper-server-6486568f56-rsv4v
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app=whisper-server
                  pod-template-hash=6486568f56
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/whisper-server-6486568f56
Containers:
  whisper-server:
    Image:      streamwhisper:latest
    Port:       8001/TCP
    Host Port:  0/TCP
    Limits:
      memory:          8Gi
      nvidia.com/gpu:  1
    Requests:
      memory:          8Gi
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8l2r4 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-8l2r4:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  18s   default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
steven@CSO:~$ kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable."nvidia\.com\/gpu"
NAME       GPU
minikube   1
steven@CSO:~$ kubectl delete -f vllm-qwen.yaml
deployment.apps "vllm-server" deleted from default namespace
service "vllm-service" deleted from default namespace
steven@CSO:~$ kubectl delete -f whisper-server.yaml
deployment.apps "whisper-server" deleted from default namespace
service "whisper-service" deleted from default namespace
steven@CSO:~$ kubectl apply -f whisper-server.yaml
deployment.apps/whisper-server created
service/whisper-service created

steven@CSO:~$ kubectl describe pod whisper-server-6486568f56-xrcg2
Name:             whisper-server-6486568f56-xrcg2
Namespace:        default
Priority:         0
Service Account:  default
Node:             minikube/192.168.49.2
Start Time:       Fri, 27 Mar 2026 17:39:19 -0400
Labels:           app=whisper-server
                  pod-template-hash=6486568f56
Annotations:      <none>
Status:           Running
IP:               10.244.0.58
IPs:
  IP:           10.244.0.58
Controlled By:  ReplicaSet/whisper-server-6486568f56
Containers:
  whisper-server:
    Container ID:   docker://4a27011f083c1c206893cea30dacb826cc92fab426e949b30ff9d626c9316a7d
    Image:          streamwhisper:latest
    Image ID:       docker://sha256:0aef498fd237777f54b6bf049c9250ceadcf682889e6041c75f3261f877e935f
    Port:           8001/TCP
    Host Port:      0/TCP
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 27 Mar 2026 17:39:42 -0400
      Finished:     Fri, 27 Mar 2026 17:39:45 -0400
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 27 Mar 2026 17:39:25 -0400
      Finished:     Fri, 27 Mar 2026 17:39:28 -0400
    Ready:          False
    Restart Count:  2
    Limits:
      memory:          8Gi
      nvidia.com/gpu:  1
    Requests:
      memory:          8Gi
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gddtm (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-gddtm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  40s                default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Normal   Scheduled         39s                default-scheduler  Successfully assigned default/whisper-server-6486568f56-xrcg2 to minikube
  Normal   Pulled            16s (x3 over 39s)  kubelet            Container image "streamwhisper:latest" already present on machine and can be accessed by the pod
  Normal   Created           16s (x3 over 39s)  kubelet            Container created
  Normal   Started           16s (x3 over 38s)  kubelet            Container started
  Warning  BackOff           13s (x2 over 30s)  kubelet            Back-off restarting failed container whisper-server in pod whisper-server-6486568f56-xrcg2_default(02ecffea-ebca-4f84-a8f2-8145b2a3921a)
steven@CSO:~$ # Force a restart of the pod to ensure a clean slate on the GPU
kubectl rollout restart deployment whisper-server
deployment.apps/whisper-server restarted


steven@CSO:~$ steven@CSO:~$ kubectl describe pod whisper-server-6486568f56-xrcg2
Name:             whisper-server-6486568f56-xrcg2
Namespace:        default
Priority:         0
Service Account:  default
Node:             minikube/192.168.49.2
Start Time:       Fri, 27 Mar 2026 17:39:19 -0400
Labels:           app=whisper-server
                  pod-template-hash=6486568f56
Annotations:      <none>
Status:           Running
IP:               10.244.0.58
IPs:
  IP:           10.244.0.58
Controlled By:  ReplicaSet/whisper-server-6486568f56
Containers:
  whisper-server:
    Container ID:   docker://b6f5a7dc6abf916f503c68b3f8bfe72c4a0d5aa1921c5e86e1bc1296732280a4
    Image:          streamwhisper:latest
    Image ID:       docker://sha256:0aef498fd237777f54b6bf049c9250ceadcf682889e6041c75f3261f877e935f
    Port:           8001/TCP
    Host Port:      0/TCP
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 27 Mar 2026 17:42:25 -0400
      Finished:     Fri, 27 Mar 2026 17:42:28 -0400
921a)^Chisper-server in pod whisper-server-6486568f56-xrcg2_default(02ecffea-ebca-4f84-a8f2-8145b2a3
steven@CSO:~$ kubectl logs whisper-server-6486568f56-xrcg2

==========
== CUDA ==
==========

CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2169, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2403, in _get_module
    raise e
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2401, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/venv/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 27, in <module>
    from ..image_processing_utils import BaseImageProcessor
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_utils.py", line 24, in <module>
    from .image_processing_base import BatchFeature, ImageProcessingMixin
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_base.py", line 25, in <module>
    from .image_utils import is_valid_image, load_image
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_utils.py", line 53, in <module>
    from torchvision.transforms import InterpolationMode
  File "/opt/venv/lib/python3.11/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torchvision/_meta_registrations.py", line 163, in <module>
    @torch.library.register_fake("torchvision::nms")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 1087, in register
    use_lib._register_fake(
  File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 204, in _register_fake
    handle = entry.fake_impl.register(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torch/_library/fake_impl.py", line 50, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator torchvision::nms does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/main.py", line 4, in <module>
    from transformers import pipeline
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2257, in __getattr__
    raise ModuleNotFoundError(
ModuleNotFoundError: Could not import module 'pipeline'. Are this object's requirements defined correctly?
steven@CSO:~$ kubectl delete -f whisper-server.yaml
deployment.apps "whisper-server" deleted from default namespace
service "whisper-service" deleted from default namespace
steven@CSO:~$ kubectl apply -f whisper-server.yaml
deployment.apps/whisper-server created
service/whisper-service created
steven@CSO:~$ kubectl logs whisper-server-6486568f56-gzvqs

==========
== CUDA ==
==========

CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2169, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2403, in _get_module
    raise e
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2401, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/venv/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 27, in <module>
    from ..image_processing_utils import BaseImageProcessor
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_utils.py", line 24, in <module>
    from .image_processing_base import BatchFeature, ImageProcessingMixin
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_base.py", line 25, in <module>
    from .image_utils import is_valid_image, load_image
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_utils.py", line 53, in <module>
    from torchvision.transforms import InterpolationMode
  File "/opt/venv/lib/python3.11/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torchvision/_meta_registrations.py", line 163, in <module>
    @torch.library.register_fake("torchvision::nms")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 1087, in register
    use_lib._register_fake(
  File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 204, in _register_fake
    handle = entry.fake_impl.register(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torch/_library/fake_impl.py", line 50, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator torchvision::nms does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/main.py", line 4, in <module>
    from transformers import pipeline
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2257, in __getattr__
    raise ModuleNotFoundError(
ModuleNotFoundError: Could not import module 'pipeline'. Are this object's requirements defined correctly?
steven@CSO:~$ kubectl logs whisper-server-6486568f56-gzvqs

==========
== CUDA ==
==========

CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2169, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2403, in _get_module
    raise e
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2401, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/venv/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 27, in <module>
    from ..image_processing_utils import BaseImageProcessor
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_utils.py", line 24, in <module>
    from .image_processing_base import BatchFeature, ImageProcessingMixin
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_base.py", line 25, in <module>
    from .image_utils import is_valid_image, load_image
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_utils.py", line 53, in <module>
    from torchvision.transforms import InterpolationMode
  File "/opt/venv/lib/python3.11/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torchvision/_meta_registrations.py", line 163, in <module>
    @torch.library.register_fake("torchvision::nms")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 1087, in register
    use_lib._register_fake(
  File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 204, in _register_fake
    handle = entry.fake_impl.register(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torch/_library/fake_impl.py", line 50, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator torchvision::nms does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/main.py", line 4, in <module>
    from transformers import pipeline
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2257, in __getattr__
    raise ModuleNotFoundError(
ModuleNotFoundError: Could not import module 'pipeline'. Are this object's requirements defined correctly?
steven@CSO:~$ kubectl logs whisper-server-6486568f56-gzvqs

==========
== CUDA ==
==========

CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2169, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2403, in _get_module
    raise e
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2401, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/venv/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 27, in <module>
    from ..image_processing_utils import BaseImageProcessor
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_utils.py", line 24, in <module>
    from .image_processing_base import BatchFeature, ImageProcessingMixin
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_base.py", line 25, in <module>
    from .image_utils import is_valid_image, load_image
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_utils.py", line 53, in <module>
    from torchvision.transforms import InterpolationMode
  File "/opt/venv/lib/python3.11/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torchvision/_meta_registrations.py", line 163, in <module>
    @torch.library.register_fake("torchvision::nms")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 1087, in register
    use_lib._register_fake(
  File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 204, in _register_fake
    handle = entry.fake_impl.register(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torch/_library/fake_impl.py", line 50, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator torchvision::nms does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/main.py", line 4, in <module>
    from transformers import pipeline
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2257, in __getattr__
    raise ModuleNotFoundError(
ModuleNotFoundError: Could not import module 'pipeline'. Are this object's requirements defined correctly?

steven@CSO:~$ kubectl logs whisper-server-6486568f56-gzvqs

==========
== CUDA ==
==========

CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2169, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2403, in _get_module
    raise e
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2401, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/venv/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 27, in <module>
    from ..image_processing_utils import BaseImageProcessor
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_utils.py", line 24, in <module>
    from .image_processing_base import BatchFeature, ImageProcessingMixin
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_base.py", line 25, in <module>
    from .image_utils import is_valid_image, load_image
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_utils.py", line 53, in <module>
    from torchvision.transforms import InterpolationMode
  File "/opt/venv/lib/python3.11/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torchvision/_meta_registrations.py", line 163, in <module>
    @torch.library.register_fake("torchvision::nms")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 1087, in register
    use_lib._register_fake(
  File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 204, in _register_fake
    handle = entry.fake_impl.register(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torch/_library/fake_impl.py", line 50, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator torchvision::nms does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/main.py", line 4, in <module>
    from transformers import pipeline
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2257, in __getattr__
    raise ModuleNotFoundError(
ModuleNotFoundError: Could not import module 'pipeline'. Are this object's requirements defined correctly?
steven@CSO:~$ kubectl logs whisper-server-6486568f56-gzvqs

==========
== CUDA ==
==========

CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2169, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2403, in _get_module
    raise e
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2401, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/venv/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 27, in <module>
    from ..image_processing_utils import BaseImageProcessor
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_utils.py", line 24, in <module>
    from .image_processing_base import BatchFeature, ImageProcessingMixin
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_processing_base.py", line 25, in <module>
    from .image_utils import is_valid_image, load_image
  File "/opt/venv/lib/python3.11/site-packages/transformers/image_utils.py", line 53, in <module>
    from torchvision.transforms import InterpolationMode
  File "/opt/venv/lib/python3.11/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torchvision/_meta_registrations.py", line 163, in <module>
    @torch.library.register_fake("torchvision::nms")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 1087, in register
    use_lib._register_fake(
  File "/opt/venv/lib/python3.11/site-packages/torch/library.py", line 204, in _register_fake
    handle = entry.fake_impl.register(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/torch/_library/fake_impl.py", line 50, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator torchvision::nms does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/main.py", line 4, in <module>
    from transformers import pipeline
  File "/opt/venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 2257, in __getattr__
    raise ModuleNotFoundError(
ModuleNotFoundError: Could not import module 'pipeline'. Are this object's requirements defined correctly?
steven@CSO:~$ kubectl delete -f whisper-server.yaml
deployment.apps "whisper-server" deleted from default namespace
service "whisper-service" deleted from default namespace
steven@CSO:~$ kubectl apply -f whisper-server.yaml
deployment.apps/whisper-server created
service/whisper-service created
steven@CSO:~$ kubectl delete -f whisper-server.yaml
deployment.apps "whisper-server" deleted from default namespace
service "whisper-service" deleted from default namespace
steven@CSO:~$ kubectl apply -f whisper-server.yaml
deployment.apps/whisper-server created
service/whisper-service created
steven@CSO:~$ kubectl delete -f whisper-server.yaml
deployment.apps "whisper-server" deleted from default namespace
service "whisper-service" deleted from default namespace
steven@CSO:~$ kubectl apply -f whisper-server.yaml
deployment.apps/whisper-server created
service/whisper-service created
steven@CSO:~$ kubectl logs whisper-server-6486568f56-bbz6z
Traceback (most recent call last):
  File "/app/main.py", line 1, in <module>
    from fastapi import FastAPI, File, UploadFile
  File "/opt/venv/lib/python3.11/site-packages/fastapi/__init__.py", line 5, in <module>
    from starlette import status as status
ModuleNotFoundError: No module named 'starlette'

Terminal 3 Output

With each service working, I was attempting to get both services (whisper-server and vllm-server) attached to the GPU and running at the same time. I was not successful during this sitting but will try again. To complete testing in this session I just ran them separately.

tunas@MINI-Gaming-G1:~$ kubectl delete -f vllm-qwen.yaml
deployment.apps "vllm-server" deleted from default namespace
service "vllm-service" deleted from default namespace
tunas@MINI-Gaming-G1:~$ kubectl get svc -A
NAMESPACE       NAME                                              TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                                        AGE
cert-manager    cert-manager                                      ClusterIP      10.96.68.185     <none>        9402/TCP                                       9d
cert-manager    cert-manager-webhook                              ClusterIP      10.96.255.243    <none>        443/TCP                                        9d
cfm-streaming   cfm-operator-controller-manager-metrics-service   ClusterIP      10.97.244.27     <none>        8443/TCP                                       9d
cfm-streaming   cfm-operator-webhook-service                      ClusterIP      10.105.176.237   <none>        443/TCP                                        9d
cfm-streaming   mynifi                                            ClusterIP      None             <none>        6007/TCP,5000/TCP                              9d
cfm-streaming   mynifi-web                                        LoadBalancer   10.104.227.241   127.0.0.1     443:31666/TCP,8443:31190/TCP                   9d
cld-streaming   cloudera-surveyor-service                         NodePort       10.104.39.241    <none>        8080:31659/TCP                                 9d
cld-streaming   my-cluster-kafka-bootstrap                        ClusterIP      10.105.253.110   <none>        9091/TCP,9092/TCP,9093/TCP                     9d
cld-streaming   my-cluster-kafka-brokers                          ClusterIP      None             <none>        9090/TCP,9091/TCP,8443/TCP,9092/TCP,9093/TCP   9d
cld-streaming   schema-registry-service                           NodePort       10.99.115.169    <none>        9090:30714/TCP                                 9d
default         embedding-server-service                          ClusterIP      10.102.75.210    <none>        80/TCP                                         3d7h
default         kubernetes                                        ClusterIP      10.96.0.1        <none>        443/TCP                                        10d
default         qdrant                                            ClusterIP      10.99.157.61     <none>        6333/TCP,6334/TCP                              3d8h
default         whisper-service                                   ClusterIP      10.107.2.89      <none>        8001/TCP                                       84s
ingress-nginx   ingress-nginx-controller                          NodePort       10.108.197.181   <none>        80:30662/TCP,443:30470/TCP                     9d
ingress-nginx   ingress-nginx-controller-admission                ClusterIP      10.100.118.146   <none>        443/TCP                                        9d
kube-system     kube-dns                                          ClusterIP      10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP                         10d
kube-system     metrics-server                                    ClusterIP      10.110.229.243   <none>        443/TCP                                        8d
tunas@MINI-Gaming-G1:~$ kubectl apply -f vllm-qwen.yaml
deployment.apps/vllm-server created
service/vllm-service created
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 3000
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 3000
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 03-28 00:34:07 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 03-28 00:34:07 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) WARNING 03-28 00:34:09 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=101) WARNING 03-28 00:34:14 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.0.70:46601 backend=nccl
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101) INFO 03-28 00:34:14 [gpu_model_runner.py:4481] Starting to load model Qwen/Qwen2.5-3B-Instruct...
(EngineCore pid=101) INFO 03-28 00:34:16 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101) INFO 03-28 00:34:16 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=101) INFO 03-28 00:34:16 [bitsandbytes_loader.py:786] Loading weights with BitsAndBytes quantization. May take a while ...
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 03-28 00:34:07 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 03-28 00:34:07 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) WARNING 03-28 00:34:09 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=101) WARNING 03-28 00:34:14 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.0.70:46601 backend=nccl
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101) INFO 03-28 00:34:14 [gpu_model_runner.py:4481] Starting to load model Qwen/Qwen2.5-3B-Instruct...
(EngineCore pid=101) INFO 03-28 00:34:16 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101) INFO 03-28 00:34:16 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=101) INFO 03-28 00:34:16 [bitsandbytes_loader.py:786] Loading weights with BitsAndBytes quantization. May take a while ...
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 03-28 00:34:07 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 03-28 00:34:07 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) WARNING 03-28 00:34:09 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=101) WARNING 03-28 00:34:14 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.0.70:46601 backend=nccl
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101) INFO 03-28 00:34:14 [gpu_model_runner.py:4481] Starting to load model Qwen/Qwen2.5-3B-Instruct...
(EngineCore pid=101) INFO 03-28 00:34:16 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101) INFO 03-28 00:34:16 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=101) INFO 03-28 00:34:16 [bitsandbytes_loader.py:786] Loading weights with BitsAndBytes quantization. May take a while ...
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 03-28 00:34:07 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 03-28 00:34:07 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) WARNING 03-28 00:34:09 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=101) WARNING 03-28 00:34:14 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.0.70:46601 backend=nccl
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101) INFO 03-28 00:34:14 [gpu_model_runner.py:4481] Starting to load model Qwen/Qwen2.5-3B-Instruct...
(EngineCore pid=101) INFO 03-28 00:34:16 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101) INFO 03-28 00:34:16 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=101) INFO 03-28 00:34:16 [bitsandbytes_loader.py:786] Loading weights with BitsAndBytes quantization. May take a while ...
(EngineCore pid=101) INFO 03-28 00:36:18 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen2.5-3B-Instruct: 122.387839 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 03-28 00:34:07 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 03-28 00:34:07 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) WARNING 03-28 00:34:09 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=101) WARNING 03-28 00:34:14 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.0.70:46601 backend=nccl
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101) INFO 03-28 00:34:14 [gpu_model_runner.py:4481] Starting to load model Qwen/Qwen2.5-3B-Instruct...
(EngineCore pid=101) INFO 03-28 00:34:16 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101) INFO 03-28 00:34:16 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=101) INFO 03-28 00:34:16 [bitsandbytes_loader.py:786] Loading weights with BitsAndBytes quantization. May take a while ...
(EngineCore pid=101) INFO 03-28 00:36:18 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen2.5-3B-Instruct: 122.387839 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:06<00:06,  6.45s/it]
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 03-28 00:34:07 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 03-28 00:34:07 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) WARNING 03-28 00:34:09 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=101) WARNING 03-28 00:34:14 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.0.70:46601 backend=nccl
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101) INFO 03-28 00:34:14 [gpu_model_runner.py:4481] Starting to load model Qwen/Qwen2.5-3B-Instruct...
(EngineCore pid=101) INFO 03-28 00:34:16 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101) INFO 03-28 00:34:16 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=101) INFO 03-28 00:34:16 [bitsandbytes_loader.py:786] Loading weights with BitsAndBytes quantization. May take a while ...
(EngineCore pid=101) INFO 03-28 00:36:18 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen2.5-3B-Instruct: 122.387839 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:06<00:06,  6.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:10<00:00,  5.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:10<00:00,  5.29s/it]
(EngineCore pid=101)
(EngineCore pid=101) INFO 03-28 00:36:29 [gpu_model_runner.py:4566] Model loading took 2.07 GiB memory and 134.839849 seconds
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 03-28 00:34:07 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 03-28 00:34:07 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) WARNING 03-28 00:34:09 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=101) WARNING 03-28 00:34:14 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.0.70:46601 backend=nccl
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101) INFO 03-28 00:34:14 [gpu_model_runner.py:4481] Starting to load model Qwen/Qwen2.5-3B-Instruct...
(EngineCore pid=101) INFO 03-28 00:34:16 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101) INFO 03-28 00:34:16 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=101) INFO 03-28 00:34:16 [bitsandbytes_loader.py:786] Loading weights with BitsAndBytes quantization. May take a while ...
(EngineCore pid=101) INFO 03-28 00:36:18 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen2.5-3B-Instruct: 122.387839 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:06<00:06,  6.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:10<00:00,  5.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:10<00:00,  5.29s/it]
(EngineCore pid=101)
(EngineCore pid=101) INFO 03-28 00:36:29 [gpu_model_runner.py:4566] Model loading took 2.07 GiB memory and 134.839849 seconds
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 03-28 00:34:07 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 03-28 00:34:07 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) WARNING 03-28 00:34:09 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=101) WARNING 03-28 00:34:14 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.0.70:46601 backend=nccl
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101) INFO 03-28 00:34:14 [gpu_model_runner.py:4481] Starting to load model Qwen/Qwen2.5-3B-Instruct...
(EngineCore pid=101) INFO 03-28 00:34:16 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101) INFO 03-28 00:34:16 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=101) INFO 03-28 00:34:16 [bitsandbytes_loader.py:786] Loading weights with BitsAndBytes quantization. May take a while ...
(EngineCore pid=101) INFO 03-28 00:36:18 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen2.5-3B-Instruct: 122.387839 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:06<00:06,  6.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:10<00:00,  5.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:10<00:00,  5.29s/it]
(EngineCore pid=101)
(EngineCore pid=101) INFO 03-28 00:36:29 [gpu_model_runner.py:4566] Model loading took 2.07 GiB memory and 134.839849 seconds
(EngineCore pid=101) INFO 03-28 00:36:39 [gpu_worker.py:456] Available KV cache memory: 3.75 GiB
(EngineCore pid=101) INFO 03-28 00:36:39 [kv_cache_utils.py:1316] GPU KV cache size: 109,264 tokens
(EngineCore pid=101) INFO 03-28 00:36:39 [kv_cache_utils.py:1321] Maximum concurrency for 4,096 tokens per request: 26.68x
(EngineCore pid=101) INFO 03-28 00:36:39 [core.py:281] init engine (profile, create kv cache, warmup model) took 9.49 seconds
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 03-28 00:34:07 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 03-28 00:34:07 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) WARNING 03-28 00:34:09 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=101) WARNING 03-28 00:34:14 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.0.70:46601 backend=nccl
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101) INFO 03-28 00:34:14 [gpu_model_runner.py:4481] Starting to load model Qwen/Qwen2.5-3B-Instruct...
(EngineCore pid=101) INFO 03-28 00:34:16 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101) INFO 03-28 00:34:16 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=101) INFO 03-28 00:34:16 [bitsandbytes_loader.py:786] Loading weights with BitsAndBytes quantization. May take a while ...
(EngineCore pid=101) INFO 03-28 00:36:18 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen2.5-3B-Instruct: 122.387839 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:06<00:06,  6.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:10<00:00,  5.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:10<00:00,  5.29s/it]
(EngineCore pid=101)
(EngineCore pid=101) INFO 03-28 00:36:29 [gpu_model_runner.py:4566] Model loading took 2.07 GiB memory and 134.839849 seconds
(EngineCore pid=101) INFO 03-28 00:36:39 [gpu_worker.py:456] Available KV cache memory: 3.75 GiB
(EngineCore pid=101) INFO 03-28 00:36:39 [kv_cache_utils.py:1316] GPU KV cache size: 109,264 tokens
(EngineCore pid=101) INFO 03-28 00:36:39 [kv_cache_utils.py:1321] Maximum concurrency for 4,096 tokens per request: 26.68x
(EngineCore pid=101) INFO 03-28 00:36:39 [core.py:281] init engine (profile, create kv cache, warmup model) took 9.49 seconds
(EngineCore pid=101) INFO 03-28 00:36:40 [vllm.py:754] Asynchronous scheduling is enabled.
(EngineCore pid=101) WARNING 03-28 00:36:40 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=101) WARNING 03-28 00:36:40 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=101) INFO 03-28 00:36:40 [vllm.py:964] Cudagraph is disabled under eager mode
(EngineCore pid=101) INFO 03-28 00:36:40 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) INFO 03-28 00:36:40 [api_server.py:576] Supported tasks: ['generate']
tunas@MINI-Gaming-G1:~$ kubectl logs -l app=vllm-server --tail 300
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:297]
(APIServer pid=1) INFO 03-28 00:34:01 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct', 'max_model_len': 4096, 'quantization': 'bitsandbytes', 'enforce_eager': True, 'load_format': 'bitsandbytes', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_SERVICE_HOST
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 03-28 00:34:01 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_SERVICE_PORT_8000_TCP_ADDR
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) INFO 03-28 00:34:07 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 03-28 00:34:07 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-28 00:34:07 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-28 00:34:07 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 03-28 00:34:07 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) WARNING 03-28 00:34:09 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=101) WARNING 03-28 00:34:14 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.0.70:46601 backend=nccl
(EngineCore pid=101) INFO 03-28 00:34:14 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101) INFO 03-28 00:34:14 [gpu_model_runner.py:4481] Starting to load model Qwen/Qwen2.5-3B-Instruct...
(EngineCore pid=101) INFO 03-28 00:34:16 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101) INFO 03-28 00:34:16 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=101) INFO 03-28 00:34:16 [bitsandbytes_loader.py:786] Loading weights with BitsAndBytes quantization. May take a while ...
(EngineCore pid=101) INFO 03-28 00:36:18 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen2.5-3B-Instruct: 122.387839 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:06<00:06,  6.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:10<00:00,  5.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:10<00:00,  5.29s/it]
(EngineCore pid=101)
(EngineCore pid=101) INFO 03-28 00:36:29 [gpu_model_runner.py:4566] Model loading took 2.07 GiB memory and 134.839849 seconds
(EngineCore pid=101) INFO 03-28 00:36:39 [gpu_worker.py:456] Available KV cache memory: 3.75 GiB
(EngineCore pid=101) INFO 03-28 00:36:39 [kv_cache_utils.py:1316] GPU KV cache size: 109,264 tokens
(EngineCore pid=101) INFO 03-28 00:36:39 [kv_cache_utils.py:1321] Maximum concurrency for 4,096 tokens per request: 26.68x
(EngineCore pid=101) INFO 03-28 00:36:39 [core.py:281] init engine (profile, create kv cache, warmup model) took 9.49 seconds
(EngineCore pid=101) INFO 03-28 00:36:40 [vllm.py:754] Asynchronous scheduling is enabled.
(EngineCore pid=101) WARNING 03-28 00:36:40 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=101) WARNING 03-28 00:36:40 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=101) INFO 03-28 00:36:40 [vllm.py:964] Cudagraph is disabled under eager mode
(EngineCore pid=101) INFO 03-28 00:36:40 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) INFO 03-28 00:36:40 [api_server.py:576] Supported tasks: ['generate']
(APIServer pid=1) WARNING 03-28 00:36:40 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 03-28 00:36:42 [hf.py:320] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 03-28 00:36:42 [api_server.py:580] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 03-28 00:36:42 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
tunas@MINI-Gaming-G1:~$ python3 query-rag-whisper.py
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 791, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 461, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 1448, in getresponse
    response.begin()
  File "/usr/lib/python3.12/http/client.py", line 336, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 305, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 845, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 472, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 791, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 461, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 1448, in getresponse
    response.begin()
  File "/usr/lib/python3.12/http/client.py", line 336, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 305, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tunas/query-rag-whisper.py", line 41, in <module>
    ask("What do you know about rice?")
  File "/home/tunas/query-rag-whisper.py", line 33, in ask
    resp = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
tunas@MINI-Gaming-G1:~$ python3 query-rag-whisper.py

=== ANSWER ===
Rice is often served in round bowls. It can be prepared in various ways, such as plain, with sides like peanut sauce, or in different flavored broths. It's a staple food that is consumed by billions worldwide and has numerous cultural and culinary associations.
tunas@MINI-Gaming-G1:~$ python3 query-rag-whisper.py

=== ANSWER ===
Rice is often served in round bowls. It can be prepared in various ways, such as plain, with sauce, or seasoned. Its round shape helps it cook evenly and absorb flavors.

Share on

Twitter Facebook LinkedIn

Steven Matison