
Most AI models today are not designed for sustained, multi-step autonomous execution. Tasks like running hundreds of iterative code modifications, or chaining tool calls across hours without human intervention, require a different kind of model architecture and training focus.
Alibaba’s Qwen team formally announced Qwen3.7-Max at the 2026 Alibaba Cloud Summit on May 20. Although, two preview versions of the Qwen3.7 series quietly appeared on Arena AI’s leaderboard with no press release and no official API announcement.
Two Preview Models Released Simultaneously
Alibaba previewed two models simultaneously: Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview. They ranked 13th globally in text capabilities and 16th in vision capabilities, respectively, according to LM Arena.
In Text Arena, Qwen3.7-Max-Preview ranked #13 overall, placing Alibaba as the #6 lab in text. In Vision Arena, Qwen3.7-Plus-Preview ranked #16 overall, placing Alibaba as the #5 lab in vision. The model rank and the lab rank are separate figures.
Qwen3.7-Plus-Preview is described as a high-performance balanced version preview, focusing on reasoning and logical expression, with its toolchain to be gradually opened in the future. It handles vision and multimodal inputs. Qwen3.7-Max is the text-only reasoning flagship. This article covers Qwen3.7-Max, as it is the model Alibaba formally announced with API access.
What is Qwen3.7-Max Designed For
Alibaba Qwen team described Qwen3.7-Max as its most advanced and comprehensive agent model to date. The model is proprietary and closed-weight. It is capable of handling coding and debugging, office workflow automation, and long-horizon tasks spanning hundreds or even thousands of steps.
Extended-Thinking Mode
Qwen3.7-Max is a reasoning model. The model generates a chain of thought first — an internal sequence of steps where it plans, checks its work, and corrects course before committing to a final answer. On interfaces like Qwen Chat, this shows up as a ‘Thinking’ mode you can switch on to see the model’s reasoning trace.
Reasoning models produce significantly more output tokens than standard completions. When Artificial Analysis ran its Intelligence Index evaluation, Qwen3.7-Max generated about 97 million tokens, compared to an average of 24 million for models on that benchmark. For short or simple tasks, this overhead adds latency without improving output quality. For multi-step planning, code refactoring, or long agent chains, extended-thinking mode is where the model’s strength applies.
Context Window
The model features a 1M token context window, up from 256K on Qwen3.6 Max Preview. It supports text input and output only. Pricing has not yet been announced. Qwen3.6 Max Preview was priced at $1.30/$7.80 per million input/output tokens on Alibaba Cloud.
A million-token context window can hold a full mid-sized code repository or a large stack of documents in a single request. Models often reason less reliably as the context window fills. Independent long-context testing for Qwen3.7-Max is not yet available.
Benchmark Results
Qwen3.7-Max scored 56.6 on the Artificial Analysis Intelligence Index, placing it fifth overall. That represents a 4.8-point gain over its predecessor Qwen3.6 Max Preview (51.8), and puts it ahead of Google’s Gemini 3.5 Flash (55.3). GPT-5.5 (60.2), Claude Opus 4.7 (57.3), and Gemini 3.1 Pro Preview (57.2) still lead the overall rankings.
The Intelligence Index v4.0 aggregates ten evaluations, including GDPval-AA, Terminal-Bench Hard, SciCode, AA-Omniscience, Humanity’s Last Exam, and GPQA Diamond.

The improvement over Qwen3.6 Max Preview is not uniform. Most of the Index gains are concentrated in scientific reasoning, agentic capability, and coding. CritPt rose 9.7 percentage points (from 3.7% to 13.4%), Humanity’s Last Exam jumped 9.2 points (from 28.9% to 38.1%), and Terminal-Bench Hard climbed 6.9 points (from 43.9% to 50.8%). GDPval-AA added 42 Elo points (from 1504 to 1546). Scores on other benchmarks are largely flat compared to Qwen3.6 Max Preview.
One result on the Index requires careful reading. On AA-Omniscience, Qwen3.7-Max’s raw accuracy actually dropped 7.6 percentage points (from 37.7% to 30.1%), while its hallucination rate fell 21.3 points (from 44.2% to 22.9%). The model is choosing to say “I don’t know” more often rather than recalling more facts. Its attempt rate fell from 67.3% to 48.0%, the lowest among frontier models in the comparison. The AA-Omniscience benchmark rewards correct answers and penalizes hallucinations but has no penalty for refusing to answer. For use cases that depend on broad factual recall, this is a meaningful limitation to test against your workload.
In Text Arena, Qwen3.7-Max-Preview ranked #13 overall with an Elo score of 1,475. Category rankings include #7 in Math, #9 in Expert Prompts, #9 in Software and IT, and #10 in Coding.
All benchmark numbers are preliminary. The model carries a ‘Preview’ mode, indicating Alibaba considers it an early build.
Agentic Performance — Internal Test
In an internal Alibaba test on a new chip platform, the model autonomously performed more than 1,000 tool calls and iterative code modifications to optimize a key kernel. Alibaba claimed the process improved inference speed by roughly 10x compared with the previous version.
Marktechpost’s Visual Explainer
Key Takeaways:
- Alibaba released two Qwen3.7 preview models: Max (text/reasoning) and Plus (multimodal).
- Qwen3.7-Max scored 56.6 on the Artificial Analysis Intelligence Index, ranking #5 overall — a 4.8-point gain over Qwen3.6 Max Preview.
- The 1M-token context window doubles the 256K limit from Qwen3.6 Max Preview; text only, no image input.
- On AA-Omniscience, raw accuracy dropped while abstention rose — worth testing for knowledge-recall use cases.
- The model sustained 1,000+ tool calls and 35-hour autonomous execution in Alibaba’s internal testing only; no independent verification yet.
Check out the Technical details. and Docs. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us





