
In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. Cold-starting inference workloads on Kubernetes can take several minutes. During that time, GPUs are allocated but idle, generating no tokens and serving no requests.
‘Cold start’ means the full sequence a model server must complete before serving any request: pulling the container image, loading model weights into GPU memory, warming up CUDA kernels, compiling or capturing CUDA graphs, and registering with the service discovery layer. This delay increases the risk of SLA violations during traffic spikes, as the system cannot scale quickly enough to absorb sudden increases in demand.
The cold-start latency for a single-GPU vLLM (v0.20.0) workload breaks into three segments: container/image pull, engine initialization (weight loading, kernel warmup, graph compilation), and distributed runtime startup.
To address this, NVIDIA’s AI research team has introduced NVIDIA Dynamo Snapshot: a checkpoint/restore approach for AI inference workloads on Kubernetes.

What is CRIU and cuda-checkpoint?
A running inference worker’s checkpointable state has two components. Device state (GPU-side) includes CUDA contexts, streams, device memory, and virtual address mappings — this is not visible to the host. To serialize it, cuda-checkpoint uses the checkpointing capability of the CUDA driver to dump the device state to CPU memory of the process owning each CUDA context. Host state (CPU-side) includes CPU memory, threads, file descriptors, and namespaces. CRIU (Checkpoint/Restore in Userspace) walks the Linux kernel’s bookkeeping and serializes the process tree’s state to disk.
When checkpointing, the two tools run in order: cuda-checkpoint dumps all device state into CPU memory first, then CRIU dumps all host-side process tree state to a folder in storage. When restoring on the same or a different node: CRIU restores the process tree from distributed storage such as NFS or SMB first, then cuda-checkpoint restores the GPU state from what is now in CPU memory onto the new GPUs.
CRIU is fundamentally a freeze-and-thaw mechanism. When a process is restored, execution resumes at the exact instruction where it was checkpointed, completely unaware that checkpointing or restoration occurred. Because of this, any coordination required before checkpointing such as quiescing the workload or after restoration such as re-establishing external state — must be handled externally through an orchestrator or workload-specific hooks.
How Dynamo Snapshot Works on Kubernetes
In Kubernetes, workloads run inside containers inside pods. Because CRIU checkpoints contain references to the container’s writable filesystem layer, checkpointing is done at the container level so the process tree state and filesystem travel together.
NVIDIA provides a privileged DaemonSet, snapshot-agent, installable through a Helm chart. An agent runs on every node and handles checkpoint and restore for runc-managed containers without requiring modifications to runc itself. On checkpoint, the agent waits for the workload’s readiness probe, invokes cuda-checkpoint and CRIU from the host side, and writes the artifact to shared storage. The workload may have created or deleted files local to the container (the overlay filesystem), which the agent also checkpoints after the CRIU stage.
On restore, the agent launches a lightweight placeholder pod, restores the overlay filesystem, and restores the CRIU/CUDA checkpoint into its namespaces. Each agent operates independently on its local node, allowing checkpoints and restores to parallelize naturally across the cluster.
This DaemonSet approach was chosen over Kubernetes native checkpoint/restore support in runc for three reasons: it is fully portable without depending on cloud-provider feature gates, it gives tighter control over CRIU for performance tuning, and it allows checkpoint artifacts to live in flexible storage backends rather than being embedded into OCI images.
Quiesce/resume hooks: A Dynamo inference worker initializes in two ordered phases. First, engine initialization: communicators are initialized, weights are loaded, kernels are warmed up, and CUDA graphs are compiled. The worker is fully warm at this point but not yet discoverable outside its pod. Second, distributed runtime startup: the worker connects to the Dynamo control plane and registers with the discovery backend. Open TCP connections to the control plane exist from this point onward.
If checkpoint were taken after distributed runtime startup, there would be active TCP connections that CRIU cannot capture. The solution is quiesce/resume hooks: the worker writes a ‘ready for checkpoint’ signal file after engine initialization but before distributed runtime startup. The worker then enters a polling loop waiting for a ‘restore complete’ signal file while the snapshot agent checkpoints it externally. Because CRIU restores execution at the exact instruction where checkpointing occurred, the worker resumes directly inside the polling loop, detects the signal file, and proceeds with distributed runtime initialization without requiring additional synchronization.
The quiesce/resume pattern is also important for multi-GPU and multi-node checkpoints (planned for a future release): outbound TCP connections used for RPC cannot be checkpointed in an established state because the pod IP changes between checkpoint and restore, and RDMA registrations and NIC state need to be recreated post-restore.
Optimization 1: KV Cache Unmap and Release
After measuring peak GPU memory usage while weights, CUDA graphs, and other buffers are allocated, inference engines allocate the remaining GPU memory as a large KV cache buffer. Since the checkpoint is taken before the replica has served any requests, this KV cache buffer does not need to be checkpointed at all. However, its virtual address must remain stable because it is baked into the CUDA graph.
The solution is to allocate the KV cache via the CUDA Virtual Memory Management API (cuMemCreate and cuMemMap), then free the underlying physical allocation with cuMemUnmap and cuMemRelease — but not cuMemAddressFree. This keeps the virtual address range intact while releasing the physical memory. This functionality is natively available in vLLM via sleep() and wake_up() and in SGLang via torch_memory_saver.
For Qwen3-0.6B on a B200, this reduces the total artifact size from ~190 GiB to ~6 GiB. The wins are most pronounced for large KV cache sizes — that is, smaller model weights relative to GPU size.
Optimization 2: Speeding Up CRIU Memory Restore
Even after the artifact is smaller, upstream CRIU restore time remains a bottleneck. For larger models, restore time actually exceeds cold-start time, which negates the benefit of checkpointing.
Note: The CRIU optimizations described below are not yet shipped as part of Dynamo Snapshot. They may be available once merged into upstream CRIU.
2.1 — Parallel memfd restore: vLLM’s sleep()/wake_up() path and SGLang’s torch_memory_saver move weight-tagged GPU allocations into pinned CPU shadow buffers. CUDA backs these allocations with shared anonymous memory, pinned through the NVIDIA driver. Inside the Linux kernel, these appear as memfds: anonymous, RAM-backed files mapped with MAP_SHARED. For gpt-oss-120b, these buffers consumed more than 120 GiB, split across many independent 2 GiB-or-smaller buffers. Upstream CRIU restores those buffers serially. The modified CRIU enumerates all unique shmem-backed objects, then uses a thread pool to restore them in parallel, allowing restore to use available storage bandwidth and CPU parallelism.
2.2 — Linux native AIO for anonymous memory: In upstream CRIU, the memory restore path is a synchronous preadv loop with exactly one read in flight at any moment, leaving the storage device idle between requests. The replacement uses Linux native AIO: CRIU submits a batch of iocbs via io_submit and keeps a sliding window of up to 128 reads in flight concurrently. As completions arrive via io_getevents, new submissions backfill the window.
Where the storage backend supports it, both anonymous and shared memory reads use O_DIRECT, avoiding unnecessary page cache pressure during the one-pass restore stream. Linux native AIO is only truly asynchronous on files opened with O_DIRECT. On filesystems where O_DIRECT is unavailable — such as some NFS deployments — restore falls back to buffered I/O with sequential readahead, and the gains from AIO are significantly reduced.
Combined results across three models (checkpoint sizes after KV cache unmap):
| Model | Checkpoint Size | CRIU (upstream) | CRIU (AIO) | CRIU (AIO + parallel memfd) | Speedup | SOL* |
|---|---|---|---|---|---|---|
| Qwen3-0.6B | 6.2 GiB | 6.8 s | 2.9 s | 2.4 s | 2.8× | 0.95 s |
| Qwen3-8B | 26 GiB | 24 s | 11 s | 4.7 s | 5.1× | 1.8 s |
| gpt-oss-120b | 129 GiB | 119 s | 54 s | 15 s | 7.9× | 11 s |
*SOL (speed of light) is the theoretical maximum restore speed given available storage bandwidth — the floor below which restore time cannot go.
At this point CRIU restore time is close to SOL, but end-to-end restore is still dominated by moving large model weights sequentially from storage through host memory onto the GPU. This is a serial bottleneck: cuda-checkpoint cannot restore GPU memory until CRIU materializes the weights in host memory.
Optimization 3: GPU Memory Service (GMS)
To eliminate the serial weight-transfer bottleneck, NVIDIA’s research team developed the GPU Memory Service (GMS). GMS uses the CUDA Virtual Memory Management (VMM) API to decouple large model weights from the inference worker’s process lifetime, offloading the majority of process memory into a separate GMS artifact. By removing weights from the core CRIU checkpoint, GMS allows process state restoration and weight restoration to run concurrently using different memory bandwidth channels. Weight restoration can use the fastest available paths such as GPUDirect Storage (GDS) or peer-GPU RDMA/NVLink.
Checkpoint artifact sizes with GMS:
| Model | CRIU checkpoint (baseline) | CRIU checkpoint (with GMS) | GMS weight artifact |
|---|---|---|---|
| Qwen3-0.6B | 6.2 GiB | 4.3 GiB | 1.2 GiB |
| Qwen3-8B | 26 GiB | 4.8 GiB | 15 GiB |
| gpt-oss-120b | 129 GiB | 6.7 GiB | 74 GiB |
In a proof-of-concept weight restoration backend that stripes weights across 8 local NVMe SSDs, weight restoration completes in parallel with CRIU process restore — bringing total end-to-end startup time for gpt-oss-120b under 5 seconds, a 21× reduction. Restore times are measured from a common restore trigger timestamp, excluding container startup time.
Deployment: Kubernetes Resources
The deployment workflow uses three Kubernetes resources. The snapshot-agent DaemonSet is installed via Helm chart. The DynamoCheckpoint custom resource (shortname: dckpt) defines which model configuration to checkpoint. The DynamoGraphDeployment CR references the checkpoint for restore.
Prerequisites from the documentation: x86_64 (amd64) GPU nodes; NVIDIA driver 580.xx or newer on GPU nodes (590.xx or newer for multi-GPU snapshots); ReadWriteMany storage for cross-node restore; current backend support is vLLM only, in limited preview.
The DynamoCheckpoint identity is a 16-character SHA256 hash of fields that affect runtime state: model, backendFramework, dynamoVersion, tensorParallelSize, pipelineParallelSize, dtype, maxModelLen, and extraParameters. Fields that do not affect the hash include replica count, node placement, resource limits, and observability configuration.
Two deployment modes exist. The explicit checkpointRef mode references a ready DynamoCheckpoint by name. Auto mode has the operator compute the identity hash, look for a matching DynamoCheckpoint, and create one only when no match exists — the first worker cold-starts and the checkpoint is created in the background for subsequent scale events.
Current limitations: checkpoint/restore supports vLLM workers only in limited preview; specialized workers (multimodal, embedding, diffusion) are not supported; multi-GPU tensor-parallel configurations have limited validation; GMS restore is not yet available; snapshot-agent must run privileged; and restore is sensitive to live TCP socket state.
Key Takeaways
- Dynamo Snapshot uses CRIU and cuda-checkpoint to freeze and restore single-GPU inference workers on Kubernetes, avoiding full cold-start latency.
- KV cache unmap via
cuMemUnmapandcuMemReleasereduces checkpoint artifact size from ~190 GiB to ~6 GiB for Qwen3-0.6B on a B200. - Linux native AIO and parallel memfd restore cut CRIU restore time by up to 7.9× over upstream CRIU; these optimizations are pending upstream CRIU merge.
- The GPU Memory Service (GMS) decouples model weights from the CRIU artifact, enabling concurrent process and weight restoration over channels like GPUDirect Storage.
- In a proof-of-concept using 8 striped local NVMe SSDs, gpt-oss-120b startup time is reduced by 21× to under 5 seconds.
Marktechpost’s Visual Explainer
Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us






