Apple Introduces DiffuCoder: A 7B Diffusion LLM Tailored for Code Generation

Diffusion LLMs as a Paradigm Shift in Code Generation

LLMs have revolutionized natural language processing with impressive results across tasks from dialogue to code generation. Masked diffusion models have emerged as an alternative and are scaled up into diffusion-based LLMs such as LLaDA and Dream. This model iteratively refines the entire sequence in parallel, allowing global planning of content. The diffusion LLM approach is a good fit for code generation because writing code often involves non-sequential back-and-forth refinement. However, it remains unclear how open-source diffusion LLMs perform on coding tasks. This is because existing post-training efforts show marginal gains or depend on semi-autoregressive decoding, which deviates from the global planning nature of diffusion.

Evolution of Text Diffusion Models and Their Impact on Code Synthesis

Early text diffusion models include mask diffusion models, with recent scaling efforts producing diffusion LLMs like DiffuLLaMA, LLaDA, and Dream. Block diffusion proposes a hybrid approach that applies diffusion within each block. Multimodal models such as LaViDa, MMaDA, and Dimple combine text diffusion models with vision models. In code generation, CodeFusion was the first to combine diffusion models with code generation, but it is limited to small-scale models and simple tasks. Recent commercial-scale diffusion LLMs such as Mercury and Gemini show comparable performance to leading autoregressive code models. However, current RL methods for dLLMs, such as d1 and MMaDA using GRPO, depend on block diffusion decoding during rollout and evaluation.

Apple and HKU Introduce DiffuCoder: A Specialized Diffusion Model for Code

Researchers from Apple and the University of Hong Kong proposed DiffuCoder, a 7B-scale masked diffusion model specialized for code generation, trained on 130B effective tokens. making it a valuable testbed for exploring diffusion-based LLM behaviors and advancing post-training methods. The researchers introduce local and global autoregressive-ness metrics to measure how closely generation follows a left-to-right pattern. The analysis reveals that diffusion LLMs exhibit an entropy sink effect, causing strong causal bias during conditional generation. DiffuCoder becomes more flexible in token generation order as sampling temperature increases from 0.2 to 1.2, freeing itself from strict left-to-right constraints and achieving higher pass@10 accuracy.

A Four-Stage Training Pipeline Leveraging RefineCode and Coupled-GRPO

Researchers adapt their model from Qwen-2.5-Coder as the base model and perform continual pre-training using a 400B-token code pre-training corpus from RefineCode and Stackv2. The training consists of four stages: adaptation pre-training, mid-training with 16B tokens of annealing code data, instruction tuning with 436K SFT samples, and post-training using coupled-GRPO with 21K hard samples from Acecoder-87K. Early stopping is applied in Stage 1 after processing 65B tokens. Stage 2 is trained for 4 epochs, resulting in a total of 65B tokens. The evaluation environments are constructed using three code benchmarks—HumanEval, MBPP, and EvalPlus—along with BigCodeBench. They include both full and hard subsets, covering completion and instruction-based query types.

Benchmark Results: DiffuCoder’s Performance and Optimization Insights

DiffuCoder trained on 130B code tokens, achieves performance on par with Qwen2.5-Coder and OpenCoder. However, all dLLMs show only marginal improvement over their base models after instruction tuning compared to Qwen2.5-Coder+SFT, which achieves significant improvements from instruction tuning on the same data. Moreover, the coupled-GRPO training shows strong effectiveness, whereas baseline variants such as d1, full-mask completion, and decoupled sampling tend to exhibit unstable reward learning behavior. RL fine-tuning increases the optimal sampling temperature during evaluation from 0.2 to higher values, suggesting that training sharpens the per-token distribution. This reduces the model’s reliance on strict autoregressive decoding and enhances its ability to generate tokens in parallel.

Coupled-GRPO and the Future of Diffusion-Based Code Models

In this paper, researchers present DiffuCoder, a 7B-scale open-source diffusion model for code with strong performance, along with its complete training recipe and detailed analysis of dLLMs for code generation. They further introduce coupled-GRPO, an RL algorithm that respects the non-autoregressive nature of dLLMs through a coupled-sampling technique for more accurate likelihood estimation. Coupled-GRPO improves DiffuCoder’s performance, showing the effectiveness of RL methods aligned with diffusion principles. This work offers the community a deeper insight into dLLMs and establishes a solid foundation for future research into their applications in complex reasoning and generative tasks.

Check out the Paper and Codes. All credit for this research goes to the researchers of this project.

Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More]

Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.

Source link