NVIDIA GPU Management

From cold starts to memory error protection - five essential GPU management concepts every ML engineer should know. Covers persistence mode, MPS, MIG, clock speed management, and ECC memory.

If you're running ML workloads, inference servers, or any GPU-intensive application, understanding how to manage your NVIDIA GPUs can mean the difference between milliseconds and seconds of latency, wasted money on idle resources, and silent data corruption vs reliable computation.

Table of Contents

  1. Persistence Mode - Eliminating Cold Start Latency
  2. MPS (Multi-Process Service) - Sharing GPUs Between Processes
  3. MIG (Multi-Instance GPU) - Hardware-Level GPU Partitioning
  4. Clock Speed Management - Performance vs Power Tradeoffs
  5. ECC Memory - Protecting Against Silent Data Corruption

1. Persistence Mode: The 3-Second Startup Tax You're Probably Paying

The Problem

Every time you run a CUDA application, there's a hidden cost. By default, NVIDIA drivers unload from memory when no GPU applications are running. The first CUDA call after this triggers a full driver initialization.

Without Persistence Mode:

$ time python -c "import torch; torch.cuda.FloatTensor(1)"
real    0m2.847s    <-- Almost 3 seconds just to allocate one float!

For a one-off script, this doesn't matter. For an inference server handling thousands of requests? This cold start penalty hits you on every scale-up, every container restart, every new process.

The Solution

Persistence mode keeps the NVIDIA driver resident in memory. First CUDA calls drop from seconds to milliseconds.

With Persistence Mode:

$ time python -c "import torch; torch.cuda.FloatTensor(1)"
real    0m0.089s    <-- 30x faster

How to Enable

The recommended approach uses nvidia-persistenced, a lightweight daemon:

# Enable and start the persistence daemon
sudo systemctl enable nvidia-persistenced
sudo systemctl start nvidia-persistenced

# Verify it's running
nvidia-smi | grep "Persistence-M"
# Should show "On"

Alternative (manual, resets on reboot):

sudo nvidia-smi -pm 1

When to Use

USE PERSISTENCE MODE:

  • ML inference servers
  • Training clusters
  • Any production GPU workload
  • Development machines with frequent GPU access

SKIP IT:

  • Desktop systems where GPU is rarely used
  • When you need the GPU to fully power down for thermal/power reasons

2. MPS: When Multiple Processes Need the Same GPU

The Problem

You have one GPU and multiple processes that need it. Maybe you're running several inference workers, or multiple small training jobs. Without MPS, NVIDIA uses time-slicing:

Without MPS (Time-Slicing):

GPU Timeline:
[Process A][idle][Process B][idle][Process A][idle][Process B]
             ^                 ^                 ^
             |                 |                 |
        Context switch overhead (expensive!)

Each process gets exclusive access, then yields. Context switches are expensive. If your processes use small kernels that don't saturate the GPU, you're wasting compute.

The Solution

MPS (Multi-Process Service) creates a single CUDA context shared by multiple processes. Kernels from different processes can execute concurrently.

With MPS:

GPU Timeline:
[Process A + Process B running simultaneously on different SMs]

No context switches. Better GPU utilization.

How It Works

+--------------+  +--------------+  +--------------+
|  Process A   |  |  Process B   |  |  Process C   |
| (MPS Client) |  | (MPS Client) |  | (MPS Client) |
+------+-------+  +------+-------+  +------+-------+
       |                 |                 |
       +--------+--------+---------+-------+
                |
                v
        +---------------+
        |  MPS Server   |
        | (Single CUDA  |
        |   Context)    |
        +-------+-------+
                |
                v
        +---------------+
        |      GPU      |
        +---------------+

Starting MPS

# Set up directories
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-mps-log
mkdir -p $CUDA_MPS_PIPE_DIRECTORY $CUDA_MPS_LOG_DIRECTORY

# Start MPS daemon
nvidia-cuda-mps-control -d

# Run your applications (they auto-connect to MPS)
python worker1.py &
python worker2.py &

# Stop MPS when done
echo quit | nvidia-cuda-mps-control

Controlling Resource Allocation

You can limit how much GPU each process gets:

# Limit each client to 50% of GPU threads
echo "set_default_active_thread_percentage 50" | nvidia-cuda-mps-control

# Limit specific process
echo "set_active_thread_percentage <PID> 25" | nvidia-cuda-mps-control

The Catch

MPS has a critical weakness: NO FAULT ISOLATION.

If Process A crashes or corrupts memory:

MPS Server (shared context) --> CORRUPTED
     |
     +--> Process B: also affected
     +--> Process C: also affected

One bad process can take down all MPS clients. This is why MPS is rarely used in production multi-tenant scenarios.

When to Use MPS

GOOD USE CASES:

  • Multiple workers for YOUR OWN code (you trust it)
  • MPI applications on single GPU
  • Development/testing
  • Small inference jobs that don't saturate GPU

BAD USE CASES:

  • Multi-tenant (untrusted code)
  • Production services needing isolation
  • Memory-intensive applications

3. MIG: Hardware-Level GPU Partitioning

The Problem

MPS shares a GPU but provides no isolation. What if you need:

  • True memory isolation (one job can't see another's data)
  • Fault isolation (one crash doesn't affect others)
  • Guaranteed resources (no noisy neighbors)

The Solution

MIG (Multi-Instance GPU) physically partitions a GPU into isolated instances. Each instance has:

  • Dedicated Streaming Multiprocessors (compute)
  • Dedicated memory with separate bandwidth
  • Separate L2 cache
  • Full hardware isolation
A100 GPU with MIG:

+----------------------------------------------------------+
|                        A100 80GB                          |
|  +------------+  +------------+  +------------+          |
|  | MIG 3g.40gb|  | MIG 3g.40gb|  | MIG 1g.10gb|          |
|  |            |  |            |  |            |          |
|  | 42 SMs     |  | 42 SMs     |  | 14 SMs     |          |
|  | 40GB VRAM  |  | 40GB VRAM  |  | 10GB VRAM  |          |
|  |            |  |            |  |            |          |
|  | LLM Service|  | Embed Svc  |  | Small Jobs |          |
|  +------------+  +------------+  +------------+          |
|                                                          |
|  Each partition is fully isolated - appears as           |
|  a separate GPU to applications                          |
+----------------------------------------------------------+

Supported GPUs

MIG requires Ampere architecture or newer:

  • A100 (40GB and 80GB) - up to 7 instances
  • A30 - up to 4 instances
  • H100 - up to 7 instances
  • H200, B200 - up to 7 instances

Consumer GPUs (GeForce) do NOT support MIG.

How to Use MIG

# Step 1: Enable MIG mode (requires reboot)
sudo nvidia-smi -i 0 -mig 1
sudo reboot

# Step 2: List available partition profiles
nvidia-smi mig -lgip

# Step 3: Create GPU instances
# Example: Create 2x 3g.40gb partitions
sudo nvidia-smi mig -i 0 -cgi 9,9 -C

# Step 4: List created instances
nvidia-smi -L
# Shows:
# GPU 0: NVIDIA A100
#   MIG 3g.40gb Device 0 (UUID: MIG-xxx)
#   MIG 3g.40gb Device 1 (UUID: MIG-yyy)

# Step 5: Use specific partition
CUDA_VISIBLE_DEVICES=MIG-xxx python my_app.py

MIG vs MPS

+------------------+---------------------------+---------------------------+
|     Feature      |           MPS             |           MIG             |
+------------------+---------------------------+---------------------------+
| Isolation        | None (shared context)     | Full hardware isolation   |
| Memory isolation | None                      | Yes, separate pools       |
| Fault isolation  | No (one crash affects all)| Yes (partitions isolated) |
| GPU support      | Kepler+ (most GPUs)       | Ampere+ only             |
| Overhead         | Minimal                   | Some (partition boundary) |
| Use case         | Trusted multi-process     | Multi-tenant, production  |
+------------------+---------------------------+---------------------------+

The Catch

MIG reconfiguration requires:

  1. Stopping ALL processes on the GPU
  2. Destroying existing partitions
  3. Creating new partitions
  4. Restarting processes

This takes 30+ seconds and causes downtime. You can't dynamically resize partitions while workloads are running.

When to Use MIG

USE MIG:

  • Multi-tenant GPU sharing (cloud providers)
  • Running multiple models on expensive GPUs (A100/H100)
  • When you need guaranteed QoS
  • Production inference servers with multiple models

SKIP MIG:

  • Single workload that needs full GPU
  • Consumer GPUs (not supported)
  • When you need dynamic partition changes

4. Clock Speed Management: Performance vs Power vs Cost

The Basics

NVIDIA GPUs have dynamic clocking. They don't run at fixed speeds - they adjust based on:

  • Temperature
  • Power consumption
  • Workload demand
  • Voltage limits
P-State Levels:

P0  - Maximum Performance (under load)
P2  - Balanced Performance
P8  - Basic Display (desktop idle)
P12 - Minimum Power

GPU Boost automatically scales clocks within these constraints. But sometimes you want manual control.

Querying Clocks

# Current clocks
nvidia-smi --query-gpu=clocks.current.graphics,clocks.current.memory --format=csv

# Maximum supported
nvidia-smi --query-gpu=clocks.max.graphics,clocks.max.memory --format=csv

# Real-time monitoring
watch -n 1 nvidia-smi --query-gpu=clocks.current.graphics,temperature.gpu,power.draw --format=csv

Locking Clocks

For benchmarking or consistent performance:

# Lock GPU clock to specific frequency
sudo nvidia-smi -lgc 1400,1400

# Lock memory clock
sudo nvidia-smi -lmc 1215,1215

# Reset to default
sudo nvidia-smi -rgc
sudo nvidia-smi -rmc

Power Management

Power limits directly affect clock behavior:

# Check power limits
nvidia-smi --query-gpu=power.min_limit,power.max_limit,power.draw --format=csv

# Set power limit (watts)
sudo nvidia-smi -pl 250

Lower power limit = Lower sustained clocks = Less performance but less heat/cost
Higher power limit = Higher sustained clocks = More performance (if cooling allows)

Practical Scenarios

SCENARIO 1: Maximum Performance (Training)

sudo nvidia-smi -pm 1           # Persistence mode
sudo nvidia-smi -pl 400         # Max power
# Let GPU Boost handle clocks


SCENARIO 2: Consistent Benchmarking

sudo nvidia-smi -lgc 1200,1200  # Lock graphics clock
sudo nvidia-smi -lmc 1215,1215  # Lock memory clock
# Eliminates variability from thermal throttling


SCENARIO 3: Power-Efficient Inference

sudo nvidia-smi -pl 200         # Lower power limit
# GPU Boost optimizes within budget
# 30-50% power savings with ~10-20% perf loss

Throttling

If clocks aren't reaching expected values, check throttle reasons:

nvidia-smi -q -d PERFORMANCE

# Common throttle reasons:
# - SW Power Cap: Hitting power limit (raise with -pl)
# - HW Thermal Slowdown: GPU too hot (improve cooling)
# - Applications Clocks Setting: You locked them lower

5. ECC Memory: Silent Corruption Is Worse Than a Crash

The Problem

Memory errors happen. Cosmic rays, manufacturing defects, electrical noise, aging - all can flip bits in GPU memory.

Without ECC:

Your training job runs for 3 days.
A bit flips in memory.
Your gradients are now slightly wrong.
Your model converges to something subtly broken.
You don't know until production.

Silent data corruption is insidious. Your program doesn't crash - it just produces wrong results.

The Solution

ECC (Error Correcting Code) memory:

  • Detects single-bit and double-bit errors
  • Automatically corrects single-bit errors
  • Reports uncorrectable double-bit errors
With ECC:

Single-bit error occurs --> ECC detects & corrects --> Job continues
Double-bit error occurs --> ECC reports error --> You know something's wrong

ECC Support

GPU Type              | ECC Support
----------------------+-------------
GeForce (Consumer)    | No
Quadro                | Yes
Tesla/Data Center     | Yes (default ON)
A100, H100, V100      | Yes (default ON)

Checking ECC Status

# Check if ECC is enabled
nvidia-smi --query-gpu=ecc.mode.current --format=csv

# View error counts
nvidia-smi -q -d ECC

# Key metrics:
# - Correctable (single-bit): Normal, ECC handled it
# - Uncorrectable (double-bit): Serious, data may be corrupted

The Cost

ECC uses ~6% of VRAM for error correction data:

GPU         | ECC OFF  | ECC ON   | Lost
------------+----------+----------+------
A100 40GB   | 40GB     | ~37.5GB  | 2.5GB
A100 80GB   | 80GB     | ~75GB    | 5GB
V100 32GB   | 32GB     | ~30GB    | 2GB

For most ML workloads, the 6% memory loss is worth the data integrity.

Monitoring for Production

Set up alerting on ECC errors:

# Quick check for uncorrectable errors
ERRORS=$(nvidia-smi --query-gpu=ecc.errors.uncorrected.volatile.total \
         --format=csv,noheader,nounits)

if [ "$ERRORS" -gt 0 ]; then
    echo "ALERT: GPU has uncorrectable memory errors!"
    # Send alert, checkpoint training, etc.
fi

When to Worry

NORMAL:

  • Occasional single-bit correctable errors (ECC is working)

INVESTIGATE:

  • High rate of single-bit errors (thousands)
  • Any double-bit errors
  • Retired pages accumulating

REPLACE GPU:

  • Multiple double-bit errors
  • Retired pages approaching limit
  • Consistent memory problems

ECC Decision Matrix

Scenario                    | ECC Recommendation
----------------------------+-------------------
Production ML training      | ON (always)
Scientific computing        | ON
Financial calculations      | ON
Long training runs (days)   | ON
Development/testing         | Optional
Memory-constrained workload | Consider OFF
Quick experiments           | Optional

Putting It All Together: Production GPU Setup

Here's a typical production configuration for an ML inference server:

# 1. Enable persistence mode (fast startup)
sudo systemctl enable nvidia-persistenced
sudo systemctl start nvidia-persistenced

# 2. Verify ECC is enabled (data integrity)
nvidia-smi --query-gpu=ecc.mode.current --format=csv
# Should show "Enabled"

# 3. Set appropriate power limit (cost/performance balance)
sudo nvidia-smi -pl 300  # Adjust for your GPU

# 4. If running multiple models on A100/H100, configure MIG
sudo nvidia-smi -i 0 -mig 1
# Reboot, then create partitions

# 5. Set up monitoring
# - ECC error alerts
# - Throttle reason monitoring
# - Temperature and power tracking

For development, the setup is simpler:

# Just enable persistence mode
sudo nvidia-smi -pm 1

# Everything else can stay at defaults

Quick Reference

PERSISTENCE MODE
    Enable daemon:     sudo systemctl enable nvidia-persistenced
    Manual enable:     sudo nvidia-smi -pm 1
    Check status:      nvidia-smi | grep Persistence-M

MPS
    Start:            nvidia-cuda-mps-control -d
    Stop:             echo quit | nvidia-cuda-mps-control
    Limit threads:    echo "set_default_active_thread_percentage 50" | nvidia-cuda-mps-control

MIG
    Enable:           sudo nvidia-smi -i 0 -mig 1 (then reboot)
    List profiles:    nvidia-smi mig -lgip
    Create instance:  sudo nvidia-smi mig -cgi <profile_id> -C
    List instances:   nvidia-smi -L
    Destroy all:      sudo nvidia-smi mig -dci && sudo nvidia-smi mig -dgi

CLOCKS
    Query current:    nvidia-smi --query-gpu=clocks.current.graphics --format=csv
    Lock clock:       sudo nvidia-smi -lgc <min>,<max>
    Set power limit:  sudo nvidia-smi -pl <watts>
    Reset:            sudo nvidia-smi -rgc

ECC
    Check status:     nvidia-smi --query-gpu=ecc.mode.current --format=csv
    Enable:           sudo nvidia-smi -e 1 (then reboot)
    View errors:      nvidia-smi -q -d ECC
    Clear errors:     sudo nvidia-smi -p 0

Conclusion

GPU management isn't glamorous, but it's essential for reliable ML systems:

  • PERSISTENCE MODE saves seconds on every startup
  • MPS can improve utilization but sacrifices isolation
  • MIG provides true hardware isolation for multi-tenant/multi-model scenarios
  • CLOCK MANAGEMENT lets you balance performance, power, and cost
  • ECC protects against silent data corruption

Most cloud providers and platforms (Modal, AWS, GCP) handle this for you. But if you're managing your own GPUs - or just want to understand what's happening under the hood - these five concepts are fundamental.

The key insight: there's no one-size-fits-all configuration. A training cluster optimizes for maximum throughput. An inference server optimizes for consistent latency. A shared GPU environment needs isolation. Choose the right tools for your workload.

References & Further Reading

Books

Code & Resources

Official Documentation