NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes

In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. Cold-starting inference workloads on Kubernetes can take several minutes. During that time, GPUs are allocated but idle, generating no tokens and serving no requests.

‘Cold start’ means the full sequence a model server must complete before serving any request: pulling the container image, loading model weights into GPU memory, warming up CUDA kernels, compiling or capturing CUDA graphs, and registering with the service discovery layer. This delay increases the risk of SLA violations during traffic spikes, as the system cannot scale quickly enough to absorb sudden increases in demand.

The cold-start latency for a single-GPU vLLM (v0.20.0) workload breaks into three segments: container/image pull, engine initialization (weight loading, kernel warmup, graph compilation), and distributed runtime startup.

To address this, NVIDIA’s AI research team has introduced NVIDIA Dynamo Snapshot: a checkpoint/restore approach for AI inference workloads on Kubernetes.

https://developer.nvidia.com/blog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/?linkId=100000423964029

What is CRIU and cuda-checkpoint?

A running inference worker’s checkpointable state has two components. Device state (GPU-side) includes CUDA contexts, streams, device memory, and virtual address mappings — this is not visible to the host. To serialize it, cuda-checkpoint uses the checkpointing capability of the CUDA driver to dump the device state to CPU memory of the process owning each CUDA context. Host state (CPU-side) includes CPU memory, threads, file descriptors, and namespaces. CRIU (Checkpoint/Restore in Userspace) walks the Linux kernel’s bookkeeping and serializes the process tree’s state to disk.

When checkpointing, the two tools run in order: cuda-checkpoint dumps all device state into CPU memory first, then CRIU dumps all host-side process tree state to a folder in storage. When restoring on the same or a different node: CRIU restores the process tree from distributed storage such as NFS or SMB first, then cuda-checkpoint restores the GPU state from what is now in CPU memory onto the new GPUs.

CRIU is fundamentally a freeze-and-thaw mechanism. When a process is restored, execution resumes at the exact instruction where it was checkpointed, completely unaware that checkpointing or restoration occurred. Because of this, any coordination required before checkpointing such as quiescing the workload or after restoration such as re-establishing external state — must be handled externally through an orchestrator or workload-specific hooks.

How Dynamo Snapshot Works on Kubernetes

In Kubernetes, workloads run inside containers inside pods. Because CRIU checkpoints contain references to the container’s writable filesystem layer, checkpointing is done at the container level so the process tree state and filesystem travel together.

NVIDIA provides a privileged DaemonSet, snapshot-agent, installable through a Helm chart. An agent runs on every node and handles checkpoint and restore for runc-managed containers without requiring modifications to runc itself. On checkpoint, the agent waits for the workload’s readiness probe, invokes cuda-checkpoint and CRIU from the host side, and writes the artifact to shared storage. The workload may have created or deleted files local to the container (the overlay filesystem), which the agent also checkpoints after the CRIU stage.

On restore, the agent launches a lightweight placeholder pod, restores the overlay filesystem, and restores the CRIU/CUDA checkpoint into its namespaces. Each agent operates independently on its local node, allowing checkpoints and restores to parallelize naturally across the cluster.

This DaemonSet approach was chosen over Kubernetes native checkpoint/restore support in runc for three reasons: it is fully portable without depending on cloud-provider feature gates, it gives tighter control over CRIU for performance tuning, and it allows checkpoint artifacts to live in flexible storage backends rather than being embedded into OCI images.

Quiesce/resume hooks: A Dynamo inference worker initializes in two ordered phases. First, engine initialization: communicators are initialized, weights are loaded, kernels are warmed up, and CUDA graphs are compiled. The worker is fully warm at this point but not yet discoverable outside its pod. Second, distributed runtime startup: the worker connects to the Dynamo control plane and registers with the discovery backend. Open TCP connections to the control plane exist from this point onward.

If checkpoint were taken after distributed runtime startup, there would be active TCP connections that CRIU cannot capture. The solution is quiesce/resume hooks: the worker writes a ‘ready for checkpoint’ signal file after engine initialization but before distributed runtime startup. The worker then enters a polling loop waiting for a ‘restore complete’ signal file while the snapshot agent checkpoints it externally. Because CRIU restores execution at the exact instruction where checkpointing occurred, the worker resumes directly inside the polling loop, detects the signal file, and proceeds with distributed runtime initialization without requiring additional synchronization.

The quiesce/resume pattern is also important for multi-GPU and multi-node checkpoints (planned for a future release): outbound TCP connections used for RPC cannot be checkpointed in an established state because the pod IP changes between checkpoint and restore, and RDMA registrations and NIC state need to be recreated post-restore.

Optimization 1: KV Cache Unmap and Release

After measuring peak GPU memory usage while weights, CUDA graphs, and other buffers are allocated, inference engines allocate the remaining GPU memory as a large KV cache buffer. Since the checkpoint is taken before the replica has served any requests, this KV cache buffer does not need to be checkpointed at all. However, its virtual address must remain stable because it is baked into the CUDA graph.

The solution is to allocate the KV cache via the CUDA Virtual Memory Management API (cuMemCreate and cuMemMap), then free the underlying physical allocation with cuMemUnmap and cuMemRelease — but not cuMemAddressFree. This keeps the virtual address range intact while releasing the physical memory. This functionality is natively available in vLLM via sleep() and wake_up() and in SGLang via torch_memory_saver.

For Qwen3-0.6B on a B200, this reduces the total artifact size from ~190 GiB to ~6 GiB. The wins are most pronounced for large KV cache sizes — that is, smaller model weights relative to GPU size.

Optimization 2: Speeding Up CRIU Memory Restore

Even after the artifact is smaller, upstream CRIU restore time remains a bottleneck. For larger models, restore time actually exceeds cold-start time, which negates the benefit of checkpointing.

Note: The CRIU optimizations described below are not yet shipped as part of Dynamo Snapshot. They may be available once merged into upstream CRIU.

2.1 — Parallel memfd restore: vLLM’s sleep()/wake_up() path and SGLang’s torch_memory_saver move weight-tagged GPU allocations into pinned CPU shadow buffers. CUDA backs these allocations with shared anonymous memory, pinned through the NVIDIA driver. Inside the Linux kernel, these appear as memfds: anonymous, RAM-backed files mapped with MAP_SHARED. For gpt-oss-120b, these buffers consumed more than 120 GiB, split across many independent 2 GiB-or-smaller buffers. Upstream CRIU restores those buffers serially. The modified CRIU enumerates all unique shmem-backed objects, then uses a thread pool to restore them in parallel, allowing restore to use available storage bandwidth and CPU parallelism.

2.2 — Linux native AIO for anonymous memory: In upstream CRIU, the memory restore path is a synchronous preadv loop with exactly one read in flight at any moment, leaving the storage device idle between requests. The replacement uses Linux native AIO: CRIU submits a batch of iocbs via io_submit and keeps a sliding window of up to 128 reads in flight concurrently. As completions arrive via io_getevents, new submissions backfill the window.

Where the storage backend supports it, both anonymous and shared memory reads use O_DIRECT, avoiding unnecessary page cache pressure during the one-pass restore stream. Linux native AIO is only truly asynchronous on files opened with O_DIRECT. On filesystems where O_DIRECT is unavailable — such as some NFS deployments — restore falls back to buffered I/O with sequential readahead, and the gains from AIO are significantly reduced.

Combined results across three models (checkpoint sizes after KV cache unmap):

Model	Checkpoint Size	CRIU (upstream)	CRIU (AIO)	CRIU (AIO + parallel memfd)	Speedup	SOL*
Qwen3-0.6B	6.2 GiB	6.8 s	2.9 s	2.4 s	2.8×	0.95 s
Qwen3-8B	26 GiB	24 s	11 s	4.7 s	5.1×	1.8 s
gpt-oss-120b	129 GiB	119 s	54 s	15 s	7.9×	11 s

*SOL (speed of light) is the theoretical maximum restore speed given available storage bandwidth — the floor below which restore time cannot go.

At this point CRIU restore time is close to SOL, but end-to-end restore is still dominated by moving large model weights sequentially from storage through host memory onto the GPU. This is a serial bottleneck: cuda-checkpoint cannot restore GPU memory until CRIU materializes the weights in host memory.

Optimization 3: GPU Memory Service (GMS)

To eliminate the serial weight-transfer bottleneck, NVIDIA’s research team developed the GPU Memory Service (GMS). GMS uses the CUDA Virtual Memory Management (VMM) API to decouple large model weights from the inference worker’s process lifetime, offloading the majority of process memory into a separate GMS artifact. By removing weights from the core CRIU checkpoint, GMS allows process state restoration and weight restoration to run concurrently using different memory bandwidth channels. Weight restoration can use the fastest available paths such as GPUDirect Storage (GDS) or peer-GPU RDMA/NVLink.

Checkpoint artifact sizes with GMS:

Model	CRIU checkpoint (baseline)	CRIU checkpoint (with GMS)	GMS weight artifact
Qwen3-0.6B	6.2 GiB	4.3 GiB	1.2 GiB
Qwen3-8B	26 GiB	4.8 GiB	15 GiB
gpt-oss-120b	129 GiB	6.7 GiB	74 GiB

In a proof-of-concept weight restoration backend that stripes weights across 8 local NVMe SSDs, weight restoration completes in parallel with CRIU process restore — bringing total end-to-end startup time for gpt-oss-120b under 5 seconds, a 21× reduction. Restore times are measured from a common restore trigger timestamp, excluding container startup time.

Deployment: Kubernetes Resources

The deployment workflow uses three Kubernetes resources. The snapshot-agent DaemonSet is installed via Helm chart. The DynamoCheckpoint custom resource (shortname: dckpt) defines which model configuration to checkpoint. The DynamoGraphDeployment CR references the checkpoint for restore.

Prerequisites from the documentation: x86_64 (amd64) GPU nodes; NVIDIA driver 580.xx or newer on GPU nodes (590.xx or newer for multi-GPU snapshots); ReadWriteMany storage for cross-node restore; current backend support is vLLM only, in limited preview.

The DynamoCheckpoint identity is a 16-character SHA256 hash of fields that affect runtime state: model, backendFramework, dynamoVersion, tensorParallelSize, pipelineParallelSize, dtype, maxModelLen, and extraParameters. Fields that do not affect the hash include replica count, node placement, resource limits, and observability configuration.

Two deployment modes exist. The explicit checkpointRef mode references a ready DynamoCheckpoint by name. Auto mode has the operator compute the identity hash, look for a matching DynamoCheckpoint, and create one only when no match exists — the first worker cold-starts and the checkpoint is created in the background for subsequent scale events.

Current limitations: checkpoint/restore supports vLLM workers only in limited preview; specialized workers (multimodal, embedding, diffusion) are not supported; multi-GPU tensor-parallel configurations have limited validation; GMS restore is not yet available; snapshot-agent must run privileged; and restore is sensitive to live TCP socket state.

Key Takeaways

Dynamo Snapshot uses CRIU and cuda-checkpoint to freeze and restore single-GPU inference workers on Kubernetes, avoiding full cold-start latency.
KV cache unmap via cuMemUnmap and cuMemRelease reduces checkpoint artifact size from ~190 GiB to ~6 GiB for Qwen3-0.6B on a B200.
Linux native AIO and parallel memfd restore cut CRIU restore time by up to 7.9× over upstream CRIU; these optimizations are pending upstream CRIU merge.
The GPU Memory Service (GMS) decouples model weights from the CRIU artifact, enabling concurrent process and weight restoration over channels like GPUDirect Storage.
In a proof-of-concept using 8 striped local NVMe SSDs, gpt-oss-120b startup time is reduced by 21× to under 5 seconds.

Marktechpost’s Visual Explainer

01 — Overview
What Is NVIDIA Dynamo Snapshot?

In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically.
Cold-starting inference workloads on Kubernetes can take several minutes. During that time, GPUs are allocated but idle,
generating no tokens and serving no requests.

NVIDIA Dynamo Snapshot is a checkpoint/restore system for AI inference workloads on Kubernetes.
It serializes the full state of a running inference worker — both GPU-side and CPU-side — and restores it on the same or a different node,
skipping the cold-start sequence entirely.

21xStartup speedup
gpt-oss-120b (PoC)

<5sRestore time
8× NVMe SSDs (PoC)

6 GiBCheckpoint size
vs ~190 GiB baseline

02 — Core Tools
CRIU and cuda-checkpoint

A running inference worker has two types of checkpointable state. Dynamo Snapshot uses one tool per type:

cuda-checkpoint — serializes GPU device state (CUDA contexts, streams, device memory, virtual address mappings) into CPU memory of the process owning each CUDA context. Uses the checkpointing capability of the CUDA driver.
CRIU (Checkpoint/Restore in Userspace) — walks Linux kernel bookkeeping and serializes the host-side process tree (CPU memory, threads, file descriptors, namespaces) to disk.

Checkpoint order: cuda-checkpoint dumps GPU state to CPU memory first, then CRIU dumps all host-side state to storage.
Restore order: CRIU reconstructs the process tree from storage first, then cuda-checkpoint restores GPU state from what is now in CPU memory onto the new GPUs.

03 — Kubernetes Architecture
The snapshot-agent DaemonSet

Dynamo Snapshot is deployed as a privileged DaemonSet called snapshot-agent, installed via a Helm chart.
One agent runs on every node and handles checkpoint and restore for runc-managed containers without modifying runc itself.

On checkpoint: agent waits for the workload readiness probe, invokes cuda-checkpoint and CRIU, then writes the artifact to shared storage. The overlay filesystem is also checkpointed after the CRIU stage.
On restore: agent launches a lightweight placeholder pod, restores the overlay filesystem, then restores the CRIU/CUDA checkpoint into its namespaces.
Parallelism: each agent operates independently on its local node, so checkpoints and restores parallelize naturally across the cluster.
Portability: no cloud-provider feature gate dependency; checkpoint artifacts live in flexible storage backends, not embedded in OCI images.

04 — Workload Coordination
Quiesce/Resume Hooks

A Dynamo inference worker initializes in two ordered phases. The checkpoint must be taken between them:

Phase 1 — Engine initialization: communicators initialized, weights loaded, kernels warmed up, CUDA graphs compiled. Worker is fully warm but not yet discoverable outside its pod.
Phase 2 — Distributed runtime startup: worker connects to the Dynamo control plane and registers with the discovery backend. Open TCP connections exist from this point onward — CRIU cannot capture them.

Implementation: the worker writes a "ready for checkpoint" signal file after Phase 1 but before Phase 2, then enters a polling loop.
The snapshot agent checkpoints it while it waits. On restore, CRIU resumes execution inside the polling loop.
The worker detects the "restore complete" signal file and proceeds with Phase 2 — no extra synchronization needed.

05 — Optimization 1
KV Cache Unmap and Release

Inference engines allocate remaining GPU memory as a large KV cache buffer after weights and CUDA graphs are placed.
Since the checkpoint is taken before any requests are served, the KV cache contents do not need to be checkpointed.
However, its virtual address must stay stable because it is baked into the CUDA graph.

Allocate the KV cache via the CUDA Virtual Memory Management API: cuMemCreate + cuMemMap.
Free physical memory with cuMemUnmap + cuMemRelease — but not cuMemAddressFree. The virtual address range stays intact.
Already natively available in vLLM via sleep() / wake_up() and in SGLang via torch_memory_saver.

~190GiB — before unmap
Qwen3-0.6B on B200

~6GiB — after unmap
same model, same GPU

06 — Optimization 2.1
Parallel memfd Restore

vLLM’s sleep()/wake_up() and SGLang’s torch_memory_saver move weight-tagged GPU allocations
into pinned CPU shadow buffers. CUDA backs these with shared anonymous memory, which appear in the Linux kernel as
memfds: anonymous, RAM-backed files mapped with MAP_SHARED.

For gpt-oss-120b, these buffers consumed more than 120 GiB, split across many independent ≤2 GiB buffers.
Upstream CRIU restores those buffers serially: create one shmem-backed object, resize, map, read, then move to the next.
Modified CRIU enumerates all unique shmem-backed objects, then launches a thread pool to restore them in parallel. Each worker allocates its buffer and reads from the checkpoint independently.

Note: These CRIU optimizations are not yet shipped as part of Dynamo Snapshot. They will be available once merged into upstream CRIU.

07 — Optimization 2.2
Linux Native AIO for Anonymous Memory

After restoring shared resources, CRIU must fill each process’s private memory: heap pages, stacks, anonymous mappings,
and copy-on-write private file mappings — at the exact virtual addresses they had before checkpoint.

Upstream CRIU: synchronous preadv loop — one read in flight at a time. Storage device is idle between requests. Cannot saturate fast NVMe bandwidth.
Modified CRIU: Linux native AIO. Submits batches of iocbs via io_submit, keeps a sliding window of up to 128 reads in flight. Completions arrive via io_getevents; new submissions backfill the window.
Where supported, both anonymous and shared memory reads use O_DIRECT to avoid unnecessary page cache pressure. AIO is only truly asynchronous on O_DIRECT files — on some NFS deployments without O_DIRECT, gains are reduced.

08 — Results
CRIU Restore Time Comparison
Combined results across three models after KV cache unmap. SOL = speed of light (theoretical maximum restore speed given available storage bandwidth).

Model	Ckpt Size	Upstream	AIO only	AIO + memfd	Speedup	SOL
Qwen3-0.6B	6.2 GiB	6.8 s	2.9 s	2.4 s	2.8x	0.95 s
Qwen3-8B	26 GiB	24 s	11 s	4.7 s	5.1x	1.8 s
gpt-oss-120b	129 GiB	119 s	54 s	15 s	7.9x	11 s

Note: These CRIU optimizations are pending upstream CRIU merge and are not yet shipped in Dynamo Snapshot.

09 — Optimization 3
GPU Memory Service (GMS)

Even after optimizing CRIU, a hard bottleneck remained: cuda-checkpoint cannot restore GPU memory until CRIU
fully materializes the weights in host memory — a serial dependency. GMS eliminates it.

GMS uses the CUDA Virtual Memory Management (VMM) API to decouple large model weights from the inference worker’s process lifetime into a separate GMS artifact.
Process state (CRIU) and weight restoration (GMS) run concurrently using different memory bandwidth channels.
Weight restoration can use the fastest available paths: GPUDirect Storage (GDS) or peer-GPU RDMA/NVLink.

Model	CRIU (baseline)	CRIU + GMS	GMS artifact
Qwen3-0.6B	6.2 GiB	4.3 GiB	1.2 GiB
Qwen3-8B	26 GiB	4.8 GiB	15 GiB
gpt-oss-120b	129 GiB	6.7 GiB	74 GiB

10 — Deployment & Roadmap
Deploying Snapshot & What’s Next
Prerequisites (from docs.nvidia.com/dynamo v1.1.1):

x86_64 (amd64) GPU nodes; NVIDIA driver 580.xx+ (590.xx+ for multi-GPU snapshots)
ReadWriteMany storage for cross-node restore
vLLM backend only — limited preview; specialized workers (multimodal, embedding, diffusion) not supported
Checkpoint identity is a 16-character SHA256 hash of: model, backendFramework, tensorParallelSize, dtype, maxModelLen, dynamoVersion, pipelineParallelSize, extraParameters

Roadmap (in progress):

GMS restore path with pluggable backends (GDS, UCX) — gated on pending CUDA driver patch
TensorRT-LLM support
Multi-GPU and multi-node support via quiesce/resume hooks for PyTorch, NCCL, NIXL

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us