Baidu Qianfan Team Releases Qianfan-OCR: A 4B-Parameter Unified Document Intelligence Model






The Baidu Qianfan Team introduced Qianfan-OCR, a 4B-parameter end-to-end model designed to unify document parsing, layout analysis, and document understanding within a single vision-language architecture. Unlike traditional multi-stage OCR pipelines that chain separate modules for layout detection and text recognition, Qianfan-OCR performs direct image-to-Markdown conversion and supports prompt-driven tasks like table extraction and document question answering.

https://arxiv.org/pdf/2603.13398

Architecture and Technical Specifications

Qianfan-OCR utilizes the multimodal bridging architecture from the Qianfan-VL framework. The system consists of three primary components:

  • Vision Encoder (Qianfan-ViT): Employs an Any Resolution design that tiles images into 448 x 448 patches. It supports variable-resolution inputs up to 4K, producing up to 4,096 visual tokens per image to maintain spatial resolution for small fonts and dense text.
  • Cross-Modal Adapter: A lightweight two-layer MLP with GELU activation that projects visual features into the language model’s embedding space.
  • Language Model Backbone (Qwen3-4B): A 4.0B-parameter model with 36 layers and a native 32K context window. It utilizes Grouped-Query Attention (GQA) to reduce KV cache memory usage by 4x.

‘Layout-as-Thought’ Mechanism

The main feature of the model is Layout-as-Thought, an optional thinking phase triggered by tokens. During this phase, the model generates structured layout representations—including bounding boxes, element types, and reading order—before producing the final output.

  • Functional Utility: This process recovers explicit layout analysis capabilities (element localization and type classification) often lost in end-to-end paradigms.
  • Performance Characteristics: Evaluation on OmniDocBench v1.5 indicates that enabling the thinking phase provides a consistent advantage on documents with high “layout label entropy”—those containing heterogeneous elements like mixed text, formulas, and diagrams.
  • Efficiency: Bounding box coordinates are represented as dedicated special tokens ( to ), reducing thinking output length by approximately 50% compared to plain digit sequences.

Empirical Performance and Benchmarks

Qianfan-OCR was evaluated against both specialized OCR systems and general vision-language models (VLMs).

Document Parsing and General OCR

The model ranks first among end-to-end models on several key benchmarks:

  • OmniDocBench v1.5: Achieved a score of 93.12, surpassing DeepSeek-OCR-v2 (91.09) and Gemini-3 Pro (90.33).
  • OlmOCR Bench: Scored 79.8, leading the end-to-end category.
  • OCRBench: Achieved a score of 880, ranking first among all tested models.

On public KIE benchmarks, Qianfan-OCR achieved the highest average score (87.9), outperforming significantly larger models.

ModelOverall Mean (KIE)OCRBench KIENanonets KIE (F1)
Qianfan-OCR (4B)87.995.086.5
Qwen3-4B-VL83.589.083.3
Qwen3-VL-235B-A22B84.294.083.8
Gemini-3.1-Pro79.296.076.1

Document Understanding

Comparative testing revealed that two-stage OCR+LLM pipelines often fail on tasks requiring spatial reasoning. For instance, all tested two-stage systems scored 0.0 on CharXiv benchmarks, as the text extraction phase discards the visual context (axis relationships, data point positions) necessary for chart interpretation.

https://arxiv.org/pdf/2603.13398

Deployment and Inference

Inference efficiency was measured in Pages Per Second (PPS) using a single NVIDIA A100 GPU.

  • Quantization: With W8A8 (AWQ) quantization, Qianfan-OCR achieved 1.024 PPS, a 2x speedup over the W16A16 baseline with negligible accuracy loss.
  • Architecture Advantage: Unlike pipeline systems that rely on CPU-based layout analysis—which can become a bottleneck—Qianfan-OCR is GPU-centric. This avoids inter-stage processing delays and allows for efficient large-batch inference.

Check out Paper, Repo and Model on HFAlso, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.






Source link

  • Related Posts

    Tsinghua and Ant Group Researchers Unveil a Five-Layer Lifecycle-Oriented Security Framework to Mitigate Autonomous LLM Agent Vulnerabilities in OpenClaw

    Autonomous LLM agents like OpenClaw are shifting the paradigm from passive assistants to proactive entities capable of executing complex, long-horizon tasks through high-privilege system access. However, a security analysis research…

    NVIDIA AI Open-Sources ‘OpenShell’: A Secure Runtime Environment for Autonomous AI Agents

    The deployment of autonomous AI agents—systems capable of using tools and executing code—presents a unique security challenge. While standard LLM applications are restricted to text-based interactions, autonomous agents require access…

    Leave a Reply

    Your email address will not be published. Required fields are marked *