NVIDIA GPU Management
From cold starts to memory error protection - five essential GPU management concepts every ML engineer should know. Covers persistence mode, MPS, MIG, clock speed management, and ECC memory.
If you're running ML workloads, inference servers, or any GPU-intensive application, understanding how to manage your NVIDIA GPUs can mean the difference between milliseconds and seconds of latency, wasted money on idle resources, and silent data corruption vs reliable computation.
Table of Contents
- Persistence Mode - Eliminating Cold Start Latency
- MPS (Multi-Process Service) - Sharing GPUs Between Processes
- MIG (Multi-Instance GPU) - Hardware-Level GPU Partitioning
- Clock Speed Management - Performance vs Power Tradeoffs
- ECC Memory - Protecting Against Silent Data Corruption
1. Persistence Mode: The 3-Second Startup Tax You're Probably Paying
The Problem
Every time you run a CUDA application, there's a hidden cost. By default, NVIDIA drivers unload from memory when no GPU applications are running. The first CUDA call after this triggers a full driver initialization.
Without Persistence Mode: $ time python -c "import torch; torch.cuda.FloatTensor(1)" real 0m2.847s <-- Almost 3 seconds just to allocate one float!
For a one-off script, this doesn't matter. For an inference server handling thousands of requests? This cold start penalty hits you on every scale-up, every container restart, every new process.
The Solution
Persistence mode keeps the NVIDIA driver resident in memory. First CUDA calls drop from seconds to milliseconds.
With Persistence Mode: $ time python -c "import torch; torch.cuda.FloatTensor(1)" real 0m0.089s <-- 30x faster
How to Enable
The recommended approach uses nvidia-persistenced, a lightweight daemon:
# Enable and start the persistence daemon sudo systemctl enable nvidia-persistenced sudo systemctl start nvidia-persistenced # Verify it's running nvidia-smi | grep "Persistence-M" # Should show "On"
Alternative (manual, resets on reboot):
sudo nvidia-smi -pm 1
When to Use
USE PERSISTENCE MODE:
- ML inference servers
- Training clusters
- Any production GPU workload
- Development machines with frequent GPU access
SKIP IT:
- Desktop systems where GPU is rarely used
- When you need the GPU to fully power down for thermal/power reasons
2. MPS: When Multiple Processes Need the Same GPU
The Problem
You have one GPU and multiple processes that need it. Maybe you're running several inference workers, or multiple small training jobs. Without MPS, NVIDIA uses time-slicing:
Without MPS (Time-Slicing):
GPU Timeline:
[Process A][idle][Process B][idle][Process A][idle][Process B]
^ ^ ^
| | |
Context switch overhead (expensive!)
Each process gets exclusive access, then yields. Context switches are expensive. If your processes use small kernels that don't saturate the GPU, you're wasting compute.
The Solution
MPS (Multi-Process Service) creates a single CUDA context shared by multiple processes. Kernels from different processes can execute concurrently.
With MPS: GPU Timeline: [Process A + Process B running simultaneously on different SMs] No context switches. Better GPU utilization.
How It Works
+--------------+ +--------------+ +--------------+
| Process A | | Process B | | Process C |
| (MPS Client) | | (MPS Client) | | (MPS Client) |
+------+-------+ +------+-------+ +------+-------+
| | |
+--------+--------+---------+-------+
|
v
+---------------+
| MPS Server |
| (Single CUDA |
| Context) |
+-------+-------+
|
v
+---------------+
| GPU |
+---------------+
Starting MPS
# Set up directories export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-mps-log mkdir -p $CUDA_MPS_PIPE_DIRECTORY $CUDA_MPS_LOG_DIRECTORY # Start MPS daemon nvidia-cuda-mps-control -d # Run your applications (they auto-connect to MPS) python worker1.py & python worker2.py & # Stop MPS when done echo quit | nvidia-cuda-mps-control
Controlling Resource Allocation
You can limit how much GPU each process gets:
# Limit each client to 50% of GPU threads echo "set_default_active_thread_percentage 50" | nvidia-cuda-mps-control # Limit specific process echo "set_active_thread_percentage <PID> 25" | nvidia-cuda-mps-control
The Catch
MPS has a critical weakness: NO FAULT ISOLATION.
If Process A crashes or corrupts memory:
MPS Server (shared context) --> CORRUPTED
|
+--> Process B: also affected
+--> Process C: also affected
One bad process can take down all MPS clients. This is why MPS is rarely used in production multi-tenant scenarios.
When to Use MPS
GOOD USE CASES:
- Multiple workers for YOUR OWN code (you trust it)
- MPI applications on single GPU
- Development/testing
- Small inference jobs that don't saturate GPU
BAD USE CASES:
- Multi-tenant (untrusted code)
- Production services needing isolation
- Memory-intensive applications
3. MIG: Hardware-Level GPU Partitioning
The Problem
MPS shares a GPU but provides no isolation. What if you need:
- True memory isolation (one job can't see another's data)
- Fault isolation (one crash doesn't affect others)
- Guaranteed resources (no noisy neighbors)
The Solution
MIG (Multi-Instance GPU) physically partitions a GPU into isolated instances. Each instance has:
- Dedicated Streaming Multiprocessors (compute)
- Dedicated memory with separate bandwidth
- Separate L2 cache
- Full hardware isolation
A100 GPU with MIG: +----------------------------------------------------------+ | A100 80GB | | +------------+ +------------+ +------------+ | | | MIG 3g.40gb| | MIG 3g.40gb| | MIG 1g.10gb| | | | | | | | | | | | 42 SMs | | 42 SMs | | 14 SMs | | | | 40GB VRAM | | 40GB VRAM | | 10GB VRAM | | | | | | | | | | | | LLM Service| | Embed Svc | | Small Jobs | | | +------------+ +------------+ +------------+ | | | | Each partition is fully isolated - appears as | | a separate GPU to applications | +----------------------------------------------------------+
Supported GPUs
MIG requires Ampere architecture or newer:
- A100 (40GB and 80GB) - up to 7 instances
- A30 - up to 4 instances
- H100 - up to 7 instances
- H200, B200 - up to 7 instances
Consumer GPUs (GeForce) do NOT support MIG.
How to Use MIG
# Step 1: Enable MIG mode (requires reboot) sudo nvidia-smi -i 0 -mig 1 sudo reboot # Step 2: List available partition profiles nvidia-smi mig -lgip # Step 3: Create GPU instances # Example: Create 2x 3g.40gb partitions sudo nvidia-smi mig -i 0 -cgi 9,9 -C # Step 4: List created instances nvidia-smi -L # Shows: # GPU 0: NVIDIA A100 # MIG 3g.40gb Device 0 (UUID: MIG-xxx) # MIG 3g.40gb Device 1 (UUID: MIG-yyy) # Step 5: Use specific partition CUDA_VISIBLE_DEVICES=MIG-xxx python my_app.py
MIG vs MPS
+------------------+---------------------------+---------------------------+ | Feature | MPS | MIG | +------------------+---------------------------+---------------------------+ | Isolation | None (shared context) | Full hardware isolation | | Memory isolation | None | Yes, separate pools | | Fault isolation | No (one crash affects all)| Yes (partitions isolated) | | GPU support | Kepler+ (most GPUs) | Ampere+ only | | Overhead | Minimal | Some (partition boundary) | | Use case | Trusted multi-process | Multi-tenant, production | +------------------+---------------------------+---------------------------+
The Catch
MIG reconfiguration requires:
- Stopping ALL processes on the GPU
- Destroying existing partitions
- Creating new partitions
- Restarting processes
This takes 30+ seconds and causes downtime. You can't dynamically resize partitions while workloads are running.
When to Use MIG
USE MIG:
- Multi-tenant GPU sharing (cloud providers)
- Running multiple models on expensive GPUs (A100/H100)
- When you need guaranteed QoS
- Production inference servers with multiple models
SKIP MIG:
- Single workload that needs full GPU
- Consumer GPUs (not supported)
- When you need dynamic partition changes
4. Clock Speed Management: Performance vs Power vs Cost
The Basics
NVIDIA GPUs have dynamic clocking. They don't run at fixed speeds - they adjust based on:
- Temperature
- Power consumption
- Workload demand
- Voltage limits
P-State Levels: P0 - Maximum Performance (under load) P2 - Balanced Performance P8 - Basic Display (desktop idle) P12 - Minimum Power
GPU Boost automatically scales clocks within these constraints. But sometimes you want manual control.
Querying Clocks
# Current clocks nvidia-smi --query-gpu=clocks.current.graphics,clocks.current.memory --format=csv # Maximum supported nvidia-smi --query-gpu=clocks.max.graphics,clocks.max.memory --format=csv # Real-time monitoring watch -n 1 nvidia-smi --query-gpu=clocks.current.graphics,temperature.gpu,power.draw --format=csv
Locking Clocks
For benchmarking or consistent performance:
# Lock GPU clock to specific frequency sudo nvidia-smi -lgc 1400,1400 # Lock memory clock sudo nvidia-smi -lmc 1215,1215 # Reset to default sudo nvidia-smi -rgc sudo nvidia-smi -rmc
Power Management
Power limits directly affect clock behavior:
# Check power limits nvidia-smi --query-gpu=power.min_limit,power.max_limit,power.draw --format=csv # Set power limit (watts) sudo nvidia-smi -pl 250 Lower power limit = Lower sustained clocks = Less performance but less heat/cost Higher power limit = Higher sustained clocks = More performance (if cooling allows)
Practical Scenarios
SCENARIO 1: Maximum Performance (Training) sudo nvidia-smi -pm 1 # Persistence mode sudo nvidia-smi -pl 400 # Max power # Let GPU Boost handle clocks SCENARIO 2: Consistent Benchmarking sudo nvidia-smi -lgc 1200,1200 # Lock graphics clock sudo nvidia-smi -lmc 1215,1215 # Lock memory clock # Eliminates variability from thermal throttling SCENARIO 3: Power-Efficient Inference sudo nvidia-smi -pl 200 # Lower power limit # GPU Boost optimizes within budget # 30-50% power savings with ~10-20% perf loss
Throttling
If clocks aren't reaching expected values, check throttle reasons:
nvidia-smi -q -d PERFORMANCE # Common throttle reasons: # - SW Power Cap: Hitting power limit (raise with -pl) # - HW Thermal Slowdown: GPU too hot (improve cooling) # - Applications Clocks Setting: You locked them lower
5. ECC Memory: Silent Corruption Is Worse Than a Crash
The Problem
Memory errors happen. Cosmic rays, manufacturing defects, electrical noise, aging - all can flip bits in GPU memory.
Without ECC: Your training job runs for 3 days. A bit flips in memory. Your gradients are now slightly wrong. Your model converges to something subtly broken. You don't know until production.
Silent data corruption is insidious. Your program doesn't crash - it just produces wrong results.
The Solution
ECC (Error Correcting Code) memory:
- Detects single-bit and double-bit errors
- Automatically corrects single-bit errors
- Reports uncorrectable double-bit errors
With ECC: Single-bit error occurs --> ECC detects & corrects --> Job continues Double-bit error occurs --> ECC reports error --> You know something's wrong
ECC Support
GPU Type | ECC Support ----------------------+------------- GeForce (Consumer) | No Quadro | Yes Tesla/Data Center | Yes (default ON) A100, H100, V100 | Yes (default ON)
Checking ECC Status
# Check if ECC is enabled nvidia-smi --query-gpu=ecc.mode.current --format=csv # View error counts nvidia-smi -q -d ECC # Key metrics: # - Correctable (single-bit): Normal, ECC handled it # - Uncorrectable (double-bit): Serious, data may be corrupted
The Cost
ECC uses ~6% of VRAM for error correction data:
GPU | ECC OFF | ECC ON | Lost ------------+----------+----------+------ A100 40GB | 40GB | ~37.5GB | 2.5GB A100 80GB | 80GB | ~75GB | 5GB V100 32GB | 32GB | ~30GB | 2GB
For most ML workloads, the 6% memory loss is worth the data integrity.
Monitoring for Production
Set up alerting on ECC errors:
# Quick check for uncorrectable errors
ERRORS=$(nvidia-smi --query-gpu=ecc.errors.uncorrected.volatile.total \
--format=csv,noheader,nounits)
if [ "$ERRORS" -gt 0 ]; then
echo "ALERT: GPU has uncorrectable memory errors!"
# Send alert, checkpoint training, etc.
fi
When to Worry
NORMAL:
- Occasional single-bit correctable errors (ECC is working)
INVESTIGATE:
- High rate of single-bit errors (thousands)
- Any double-bit errors
- Retired pages accumulating
REPLACE GPU:
- Multiple double-bit errors
- Retired pages approaching limit
- Consistent memory problems
ECC Decision Matrix
Scenario | ECC Recommendation ----------------------------+------------------- Production ML training | ON (always) Scientific computing | ON Financial calculations | ON Long training runs (days) | ON Development/testing | Optional Memory-constrained workload | Consider OFF Quick experiments | Optional
Putting It All Together: Production GPU Setup
Here's a typical production configuration for an ML inference server:
# 1. Enable persistence mode (fast startup) sudo systemctl enable nvidia-persistenced sudo systemctl start nvidia-persistenced # 2. Verify ECC is enabled (data integrity) nvidia-smi --query-gpu=ecc.mode.current --format=csv # Should show "Enabled" # 3. Set appropriate power limit (cost/performance balance) sudo nvidia-smi -pl 300 # Adjust for your GPU # 4. If running multiple models on A100/H100, configure MIG sudo nvidia-smi -i 0 -mig 1 # Reboot, then create partitions # 5. Set up monitoring # - ECC error alerts # - Throttle reason monitoring # - Temperature and power tracking
For development, the setup is simpler:
# Just enable persistence mode sudo nvidia-smi -pm 1 # Everything else can stay at defaults
Quick Reference
PERSISTENCE MODE
Enable daemon: sudo systemctl enable nvidia-persistenced
Manual enable: sudo nvidia-smi -pm 1
Check status: nvidia-smi | grep Persistence-M
MPS
Start: nvidia-cuda-mps-control -d
Stop: echo quit | nvidia-cuda-mps-control
Limit threads: echo "set_default_active_thread_percentage 50" | nvidia-cuda-mps-control
MIG
Enable: sudo nvidia-smi -i 0 -mig 1 (then reboot)
List profiles: nvidia-smi mig -lgip
Create instance: sudo nvidia-smi mig -cgi <profile_id> -C
List instances: nvidia-smi -L
Destroy all: sudo nvidia-smi mig -dci && sudo nvidia-smi mig -dgi
CLOCKS
Query current: nvidia-smi --query-gpu=clocks.current.graphics --format=csv
Lock clock: sudo nvidia-smi -lgc <min>,<max>
Set power limit: sudo nvidia-smi -pl <watts>
Reset: sudo nvidia-smi -rgc
ECC
Check status: nvidia-smi --query-gpu=ecc.mode.current --format=csv
Enable: sudo nvidia-smi -e 1 (then reboot)
View errors: nvidia-smi -q -d ECC
Clear errors: sudo nvidia-smi -p 0
Conclusion
GPU management isn't glamorous, but it's essential for reliable ML systems:
- PERSISTENCE MODE saves seconds on every startup
- MPS can improve utilization but sacrifices isolation
- MIG provides true hardware isolation for multi-tenant/multi-model scenarios
- CLOCK MANAGEMENT lets you balance performance, power, and cost
- ECC protects against silent data corruption
Most cloud providers and platforms (Modal, AWS, GCP) handle this for you. But if you're managing your own GPUs - or just want to understand what's happening under the hood - these five concepts are fundamental.
The key insight: there's no one-size-fits-all configuration. A training cluster optimizes for maximum throughput. An inference server optimizes for consistent latency. A shared GPU environment needs isolation. Choose the right tools for your workload.
References & Further Reading
Books
- "AI Systems Performance Engineering" by Chris Fregly (O'Reilly, 2025) - Comprehensive coverage of performance optimization for AI/ML systems, including GPU utilization, distributed training, and inference optimization.
Code & Resources
- AI Performance Engineering GitHub Repository - Companion code, notebooks, and examples for hands-on performance engineering with GPUs.