2024-11-30 15 min read

Introducing InferenceX: Open Source vLLM Benchmarking for Multi-GPU Inference

I built InferenceX, an open-source tool for benchmarking LLM inference across different GPU configurations. It measures throughput, latency, TTFT, and real power efficiency using nvidia-smi. Works with any HuggingFace model that vLLM supports.

GitHub: https://github.com/strangeloopio/inferencex

Why I Built This

When deploying LLMs in production, you face questions that are hard to answer without real data:

How many GPUs should I use for my model?
What's the optimal number of concurrent users?
Is 4-GPU tensor parallelism worth the cost?
What's my actual power efficiency?

Existing benchmarks don't let you test YOUR model with YOUR configuration on YOUR cloud provider. I wanted a tool that works with any HuggingFace model, tests multiple GPU configurations automatically, measures real power consumption (not estimates), runs on serverless GPUs, and generates publication-ready comparison charts.

Architecture

InferenceX runs on Modal's serverless GPU infrastructure. Here's how it works:

┌─────────────────────────────────────────────────────────────────┐
│                         Modal Cloud                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
│  │   1 GPU     │  │   2 GPU     │  │   4 GPU     │              │
│  │  Endpoint   │  │  Endpoint   │  │  Endpoint   │              │
│  │  (vLLM)     │  │  (vLLM TP)  │  │  (vLLM TP)  │              │
│  └─────────────┘  └─────────────┘  └─────────────┘              │
│         │                │                │                      │
│         └────────────────┼────────────────┘                      │
│                          │                                       │
│                 ┌────────▼────────┐                              │
│                 │  Power Monitor  │                              │
│                 │  (nvidia-smi)   │                              │
│                 └────────┬────────┘                              │
│                          │                                       │
│                 ┌────────▼────────┐                              │
│                 │  Modal Volume   │                              │
│                 │  (power logs)   │                              │
│                 └─────────────────┘                              │
└─────────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Local / GitHub Actions                        │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │ benchmark_      │  │ plot_           │  │ Results:        │  │
│  │ gpu_users.py    │──│ gpu_users.py    │──│ - JSON          │  │
│  │ (async HTTP)    │  │ (matplotlib)    │  │ - PNG chart     │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

The Modal deployment creates three separate vLLM endpoints:

1 GPU endpoint - Single GPU inference (baseline)
2 GPU endpoint - Tensor parallelism across 2 GPUs
4 GPU endpoint - Tensor parallelism across 4 GPUs

Each endpoint runs vLLM with an OpenAI-compatible API, making it easy to test with standard HTTP clients.

Power Monitoring

InferenceX captures real nvidia-smi measurements:

Logs every 1 second during inference
Captures: power draw, temperature, GPU utilization, memory usage
Stored to Modal Volume as CSV files
Only readings with GPU utilization > 5% count toward efficiency

This gives you actual power efficiency numbers (tokens/s/kW).

Configuration Options

InferenceX supports any HuggingFace model that works with vLLM. Supported GPUs: H100 (default), H200, B200.

Three key benchmark parameters control the workload:

REQUESTS - Number of inference calls per configuration (default: 50)
ISL (Input Sequence Length) - Number of tokens in input prompt (512-1024 for chat, 2048-4096 for document analysis, 8192+ for long context)
OSL (Output Sequence Length) - Maximum tokens generated per request (256 for short responses, 1024 for detailed explanations, 4096 for long-form content)

By default, InferenceX tests 6 GPU/user configurations to help identify optimal GPU count, concurrency, and trade-offs between throughput and user experience.

Running InferenceX

Option 1: Local

# Setup
git clone https://github.com/strangeloopio/inferencex.git
cd inferencex
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python -m modal setup

# Deploy endpoints
python -m modal deploy vllm_multi_gpu.py

# Run benchmark
python benchmark_gpu_users.py \
  --url-1gpu "https://<workspace>--vllm-multi-gpu-benchmark-serve-1gpu.modal.run" \
  --url-2gpu "https://<workspace>--vllm-multi-gpu-benchmark-serve-2gpu.modal.run" \
  --url-4gpu "https://<workspace>--vllm-multi-gpu-benchmark-serve-4gpu.modal.run"

# Download power logs
python -m modal run vllm_multi_gpu.py --action logs
python -m modal run vllm_multi_gpu.py --action "download power_log_1gpu_*.csv"

# Generate chart
python plot_gpu_users.py

# Stop (save costs)
python -m modal app stop vllm-multi-gpu-benchmark

Option 2: GitHub Actions

For automated benchmarking without local setup:

Fork the repository
Add secrets: MODAL_TOKEN_ID, MODAL_TOKEN_SECRET, MODAL_WORKSPACE
Go to Actions -> "vLLM Multi-GPU Benchmark" -> Run workflow
Configure ISL, OSL, GPU type, model name
Download results from Artifacts

The workflow automatically deploys endpoints, runs benchmarks, downloads power logs, generates charts, stops Modal app (saves costs), and uploads artifacts (retained 90 days).

Metrics Explained

InferenceX measures five key metrics:

Throughput Per GPU (tokens/s/GPU) - Total tokens generated divided by (time x GPU count). Higher is better. Shows how efficiently you're using each GPU.
End-to-End Latency (seconds) - Average time from request submission to complete response. Lower is better. What users experience as "response time."
User Interactivity (tokens/s/user) - Tokens generated per second per concurrent user. Higher is better. How fast text streams to each individual user.
Time To First Token (seconds) - Time until the first token is generated. Lower is better. Critical for perceived responsiveness in chat.
Power Efficiency (tokens/s/kW) - Throughput divided by actual measured power consumption. Higher is better. Real cost efficiency metric.

Sample Results

Here's what I found benchmarking Qwen3-8B-FP8 on H100 (ISL=1024, OSL=256):

InferenceX Benchmark Results - Qwen3-8B-FP8 on H100

Config       Thru/GPU      Latency   Interactivity   TTFT    Power Eff.
             (tok/s/gpu)   (s)       (tok/s/user)    (s)     (tok/s/kW)
-----------------------------------------------------------------------
1GPU/128U    1,255         10.2      29.2            1.39    6,907
2GPU/128U    653           9.8       30.1            1.23    6,836
2GPU/64U     582           9.1       30.3            0.66    6,087
2GPU/32U     458           6.8       40.5            0.41    4,792
4GPU/32U     165           9.4       34.4            1.78    982
4GPU/16U     105           7.2       37.9            0.39    625

Key findings for this model:

1 GPU is optimal for throughput (1,255 tok/s/GPU)
2 GPU / 32 users is optimal for user experience (6.8s latency)
4 GPU is wasteful—10x worse power efficiency

But YOUR model may behave differently.

Cost & Time

Default benchmark (Qwen3-8B-FP8, 50 requests, 6 configurations):

Time: ~13 minutes
Cost: ~$12 on Modal (as of November 2024)

Costs scale with model size (larger models = longer load time), request count (more requests = longer benchmark), and ISL/OSL (longer sequences = more compute).