Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing






Most end-to-end OCR models slow down as output grows. Each generated token adds to the KV cache. Memory rises and generation drags. Parsing dozens of pages becomes impractical. Baidu’s Unlimited OCR addresses this directly. It swaps the decoder’s attention for a design that keeps memory constant.

TL;DR

  • Unlimited OCR is a 3B-parameter Mixture-of-Experts model, with only 500M parameters active.
  • It replaces decoder attention with Reference Sliding Window Attention (R-SWA), keeping the KV cache constant.
  • The model parses dozens of pages in one forward pass under a 32K maximum length.
  • It scores 93.23 on OmniDocBench v1.5, beating the DeepSeek OCR baseline by 6.22 points.
  • It builds on DeepSeek OCR via continue-training, not a from-scratch run.

What is Unlimited OCR?

Unlimited OCR takes DeepSeek OCR as its baseline. It keeps the DeepEncoder and the Mixture-of-Experts decoder. The MoE design holds 3B total parameters but activates only 500M at inference.

The DeepEncoder is the compression engine. It cascades a SAM-ViT under window attention with a CLIP-ViT under global attention. At the bridge, it applies 16× token compression. A 1024×1024 PDF image becomes just 256 visual tokens. Fewer input tokens mean a smaller prefill.

DeepEncoder natively supports five resolution modes, and Unlimited OCR keeps two. ‘Base’ mode runs at 1024×1024 for multi-page work. ‘Gundam’ mode uses dynamic resolution for single pages.

https://arxiv.org/pdf/2606.23050

How R-SWA Keeps the Cache Constant

The contribution is Reference Sliding Window Attention. Standard Multi-Head Attention stores a key and value for every token. As output length T grows, the cache grows with it. The size is CMHA(T) = Lm + T. Memory and latency climb without bound.

R-SWA breaks that link. Each generated token attends to all reference tokens, meaning the visual tokens and the prompt. It also attends to the preceding n output tokens, where n defaults to 128. Everything older is evicted. The cache becomes a fixed queue of size m + n.

The size is CR-SWA(T) = Lm + min(n, T) ≤ Lm + n. It is bounded by a constant. As T grows far beyond n, the cache ratio trends toward zero. So memory stays flat and per-step latency stays flat.

The research team compare this to soft forgetting. A person copying a book glances at the source and the last few words. They do not re-read everything transcribed so far. Visual tokens never undergo state updates. That avoids the progressive blurring seen in linear attention. The interactive simulator below lets you vary T and watch both caches respond.




Source link

  • Related Posts

    Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models Beating gpt-realtime-translate on Accuracy and Latency

    Gradium today released two real-time speech translation models: stt-translate and s2s-translate. Both run across five languages and stream results live in the browser. Gradium claims a better accuracy-latency tradeoff than…

    How to Design an OpenHarness Style Agent Runtime with Tools, Memory, Permissions, Skills, and Multi-Agent Coordination

    async def demo_memory(): explain( “DEMO 4 — Memory: persistent MEMORY.md across sessions”, “””Long-term memory survives between runs by persisting to MEMORY.md. In session 1 the agent records a user preference;…

    Leave a Reply

    Your email address will not be published. Required fields are marked *