Back to library

vLLM

Fleet skill: vLLM — software inventory and operations reference

fleet
by skynetv1.0.0
software-vllmfleetfleetsoftware

0

Total Uses

0

Successes

0%

Success Rate

Compatible Agents

claude-codecodexgeminikimi

Instruction

--- name: software-vllm description: High-performance LLM inference server optimized for the GB10 Grace Blackwell architecture. Use for low-latency, high-throughput model serving via an OpenAI-compatible API. metadata: author: skynet version: 1.0.0 --- # vLLM (Virtual Large Language Model) vLLM is the primary high-throughput inference engine for James's fleet, specifically optimized for the GB10 Grace Blackwell GPU on **Spark**. It provides an OpenAI-compatible API and utilizes PagedAttention to manage KV cache memory efficiently. ## Fleet Installation Status | Machine | Role | Status | Endpoint | Model | |---------|------|--------|----------|-------| | **Spark** (192.168.86.48) | Primary Inference | **Active** (Docker) | `http://spark:8001` | Qwen 32B (Default) | | **Vault** (192.168.86.27) | API Routing | N/A | Proxies to Spark | via LiteLLM | | **Dev Workstation** | Testing/Client | Client Only | N/A | Local development | ## Configuration & Environment ### Docker Deployment (Spark) vLLM runs as a containerized service on Spark. - **Image:** `vllm/vllm-openai:v0.12.0` - **Container Name:** `vllm-server` - **Compose Path:** `~/infra/docker-compose.yml` on Spark. - **Volume Mounts:** - `~/models:/root/.cache/huggingface` (Model weights) - `/dev/shm:/dev/shm` (Shared memory for fast inference) ### Hardware Optimization (GB10 Grace Blackwell) The GB10's unified memory architecture allows for massive context windows and zero-copy data transfer. - **GPU Memory Utilization:** Set to `0.90` to leave headroom for the host system. - **Enforce Eager Mode:** Use `--enforce-eager` if CUDA graphs cause stability issues on early Blackwell drivers. - **Tensor Parallelism:** GB10 is a single powerful chip; `--tensor-parallel-size 1` is usually sufficient unless multi-GPU scaling is configured. ## Key Commands ### Managing the Service (on Spark) ```bash # View logs and check for model loading progress ssh spark "docker logs -f vllm-server" # Restart the inference engine ssh spark "docker restart vllm-server" # Monitor GPU utilization (NVIDIA-SMI) ssh spark "nvidia-smi -l 1" ``` ### Direct API Interaction Testing the endpoint from any machine in the fleet: ```bash curl http://192.168.86.48:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen-32b", "messages": [{"role": "user", "content": "System check."}] }' ``` ## Common Workflows ### 1. Model Swap To deploy a new model from HuggingFace: 1. SSH into Spark. 2. Update `HF_MODEL_ID` in `~/infra/.env`. 3. Run `docker compose up -d vllm-server`. 4. Monitor logs to ensure weights are downloaded to `~/models`. ### 2. Scaling with LiteLLM The LiteLLM Gateway (Port 8000) handles failover between vLLM and Ollama. - If vLLM is down, LiteLLM routes traffic to Ollama (running on Spark Port 11434). - Use `http://spark:8000` for production workloads to ensure high availability. ### 3. Long Context Processing Given the GB10's memory, vLLM is configured for 32k+ context lengths. - Use the `--max-model-len` flag in the startup command to override default HuggingFace config limits if the hardware supports it. ## Troubleshooting ### Container Fails to Start - **Symptom:** "CUDA out of memory" - **Fix:** Check if Ollama or other Docker containers are hogging GPU memory. Run `nvidia-smi` to identify zombie processes. Reduce `--gpu-memory-utilization` to `0.85`. ### Slow Initial Inference (First Token Latency) - **Cause:** Weights are being swapped or KV cache is being allocated. - **Fix:** Ensure `--swap-space` is set (default 4GB) to handle KV cache overflow to CPU RAM if necessary. ### API Connection Refused - **Check:** Ensure the container is bound to `0.0.0.0` and not `127.0.0.1`. - **Command:** `ssh spark "netstat -tuln | grep 8001"` ## Fleet-Specific Patterns - **Automation:** Always use the LiteLLM proxy (`spark:8000`) in Python scripts rather than hitting vLLM directly. This allows James to perform maintenance on vLLM without breaking active agents. - **Model Storage:** All models are stored in `~/models` on Spark's NVMe drive. Never download models to individual workstations. - **Inference Priority:** vLLM is preferred for heavy reasoning tasks (Qwen 32B/Llama 70B). Ollama is reserved for fast, utility-grade models (Llama 8B/Mistral).

Install

curl -s https://skills.skynet.ceo/api/skills/software-vllm/skill.md