One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing

Building a single model that can both understand and generate images and videos is harder than it sounds. The two tasks pull in opposite directions. Understanding benefits from high-level semantic features tightly aligned with language. Generation needs low-level continuous representations that preserve texture, geometry, and temporal dynamics. Most systems handle this tension by separating the two into distinct architectures, then bridging them post-hoc.

ByteDance research team took a different approach with Lance. Rather than assembling separate components, the research team designed a model that natively integrates understanding, generation, and editing across both image and video modalities — trained jointly from the start.

What Lance Can Do

Lance organizes its capabilities into three output families: text (X2T), images (X2I), and videos (X2V). On the understanding side, this covers image and video captioning, visual question answering, OCR, visual grounding, and reasoning. On the generation side, it handles text-to-image, text-to-video, image-to-video, subject-driven generation, image editing, and video editing — including multi-turn consistency editing across both modalities.

This all-in-one capability is a major milestone. While standard unified architectures typically stop at basic image understanding and text-to-image generation, Lance is among the few to natively bridge the entire image-video ecosystem across both understanding and generation tasks.

How the Architecture Works

The architecture is based on two principles: unified context modeling and decoupled capability pathways.

For unified context, Lance converts all inputs — text, images, and videos — into a single shared interleaved multimodal sequence. Text tokens come from the Qwen2.5-VL embedding layer. For understanding-oriented visual inputs, the Qwen2.5-VL ViT encoder produces compact semantic visual tokens. For generation-oriented visual inputs, the Wan2.2 3D causal VAE encoder encodes images and videos into continuous latent representations, applying 16× spatial downsampling and 4× temporal downsampling. All these heterogeneous token types — text, semantic visual, and latent visual — live in the same sequence. The model then runs generalized 3D causal attention over the full context, with text tokens using causal attention and visual tokens using bidirectional attention.

For decoupled pathways, Lance uses a dual-stream mixture-of-experts architecture initialized from Qwen2.5-VL 3B. The understanding expert (LLMUND) handles text and semantic visual tokens, producing outputs for multimodal reasoning and text generation. The generation expert (LLMGEN) handles VAE latent tokens for visual synthesis and editing. Crucially, both experts operate over the same shared interleaved sequence — they share context but don’t compete for the same parameters. The understanding expert is trained with a next-token prediction loss; the generation expert is trained with a flow matching objective in continuous latent space. The two losses are combined with configurable weights throughout training.

Modality-Aware Rotary Positional Encoding (MaPE)

Running ViT semantic tokens, clean VAE condition tokens, and noisy VAE target tokens through the same sequence creates a subtle problem. Standard 3D-RoPE encodes positions based on spatiotemporal layout alone — it has no way to tell these token groups apart. When multiple visual token groups occupy the same sequence, their positional boundaries become ambiguous, which can hurt cross-task alignment.

Lance introduces Modality-Aware Rotary Positional Encoding (MaPE) to fix this. MaPE applies a fixed temporal offset to each modality group based on its index in the sequence. Spatial coordinates stay unchanged, so the intrinsic layout within images and videos is preserved. The temporal offset alone is enough to separate the token groups in the global positional space without disrupting temporal ordering within any individual video.

Removing MaPE drops GenEval from 80.94 to 80.56, GEdit-Bench from 6.86 to 6.30, and VBench from 81.81 to 80.95 — consistent degradation across generation, editing, and understanding.

Training: Four Stages, One Unified Framework

Lance is trained through four sequential stages, each building on the last.

Pre-Training (PT) lays the foundation using approximately 1B image-text and 140M video-text pairs, covering 1.5T training tokens. This stage establishes basic multimodal alignment and generation capability. The VAE and ViT encoders are frozen here; only the backbone and connectors are trained.

Continual Training (CT) expands the task space by introducing interleaved multi-task data — editing samples, subject-driven generation samples, and multimodal understanding data — across approximately 300B tokens. A progressive data-mixture schedule gradually increases the proportion of harder tasks like editing as training proceeds.

Supervised Fine-Tuning (SFT) tightens instruction following, editing accuracy, and identity consistency using curated high-quality data across 72B tokens.

Reinforcement Learning (RL) uses Group Relative Policy Optimization (GRPO), with PaddleOCR serving as the reward model, to further sharpen text rendering accuracy and image-text alignment.

Everything fits within a maximum training budget of 128 GPUs.

Results

Image Generation. On GenEval, Lance scores 0.90 overall, matching TUNA for the top spot among unified models. Subcategory scores include counting (0.84), colors (0.97), and spatial position (0.87). On DPG-Bench, Lance scores 84.67 overall, with particularly strong relation modeling — though TUNA (86.76) and TUNA-2 (86.54) lead that benchmark. To put the parameter efficiency in perspective: Janus-Pro-7B scores 0.80 on GenEval; Show-o2 (7B) scores 0.76. Lance matches the top unified model score at 3B activated parameters.

Video Generation. On VBench, Lance achieves a Total Score of 85.11 (using LLM rewriting), the highest among unified models. The next-best unified model, TUNA, scores 84.06. Lance also outscores dedicated generation-only models including HunyuanVideo (83.43) and Wan2.1-T2V (83.69).

Image Editing. On GEdit-Bench, Lance scores 7.30 Avg/G_O, the highest among unified models. It leads in background change, material modification, motion change, portrait beautification, subject removal, subject replacement, and tone transfer. Text modification is flagged as a remaining weakness.

Video Understanding. On MVBench, Lance achieves a 62.0 overall score, the highest among unified models. Show-o2 (7B), the next-best unified model, scores 55.7. Lance also outperforms several understanding-only models with more parameters — notable given that it is simultaneously trained for generation and editing.

Marktechpost’s Visual Explainer

Key Takeaways

Lance is a 3B activated parameter native unified multimodal model that handles image and video understanding, generation, and editing within a single jointly trained framework.
A dual-stream mixture-of-experts architecture with Modality-Aware Rotary Positional Encoding (MaPE) decouples understanding and generation pathways while keeping them in shared interleaved multimodal context.
Lance achieves 0.90 on GenEval and 85.11 on VBench, the highest Total Score among unified models, trained within a maximum budget of 128 GPUs.
On MVBench, Lance scores 62.0, the highest among unified models — outperforming Show-o2 (7B) at 55.7, while also supporting generation and editing.
Lance is open-source under Apache 2.0, with weights available on Hugging Face.

Check out the Paper, Model Weights and Project Page. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link