Meet AntAngelMed: A 103B-Parameter Open-Source Medical Language Model Built on a 1/32 Activation-Ratio MoE Architecture


A team researchers from China have released AntAngelMed, a large open-source medical language model that the team describes as the largest and most capable of its kind currently available.

What Is AntAngelMed?

AntAngelMed is a medical-domain language model with 103 billion total parameters, but it does not activate all of those parameters during inference. Instead, it uses a Mixture-of-Experts (MoE) architecture with a 1/32 activation ratio, meaning only 6.1 billion parameters are active at any given time when processing a query.

It helps to know how MoE architectures work. In a standard dense model, every parameter participates in processing every token. In an MoE model, the network is divided into many ‘expert’ sub-networks, and a routing mechanism selects only a small subset of them to handle each input. This allows you to have a very large total parameter count — which typically correlates with strong knowledge capacity — while keeping the actual compute cost of inference proportional to the smaller active parameter count.

AntAngelMed inherits this design from Ling-flash-2.0, a base model developed by inclusionAI and guided by what the team calls Ling Scaling Laws. The specific optimizations layered on top include: refined expert granularity, a tuned shared expert ratio, attention balance mechanisms, sigmoid routing without auxiliary loss, an MTP (Multi-Token Prediction) layer, QK-Norm, and Partial-RoPE (Rotary Position Embedding applied to a subset of attention heads rather than all of them). According to the research team, these design choices together allow small-activation MoE models to deliver up to 7× efficiency compared to similarly sized dense architectures which means with only 6.1B activated parameters, AntAngelMed can match roughly 40B dense model performance. Separately, as output length grows during inference, the relative speed advantage can also reach 7× or more over dense models of comparable size.

https://modelscope.cn/models/MedAIBase/AntAngelMed

Training Pipeline

AntAngelMed uses a three-stage training process designed to layer general language understanding on top of deep medical domain adaptation.

The first stage is continual pre-training on large-scale medical corpora, including encyclopedias, web text, and academic publications. This phase is built on top of the Ling-flash-2.0 checkpoint, giving the model a strong general reasoning foundation before medical specialization begins.

The second stage is Supervised Fine-Tuning (SFT), where the model is trained on a multi-source instruction dataset. This dataset mixes general reasoning tasks — math, programming, logic — to preserve chain-of-thought capabilities, alongside medical scenarios such as doctor–patient Q&A, diagnostic reasoning, and safety and ethics cases.

The third stage is Reinforcement Learning using the GRPO (Group Relative Policy Optimization) algorithm, combined with task-specific reward models. GRPO, originally introduced in the DeepSeekMath paper, is a variant of PPO that estimates baselines from group scores rather than a separate critic model, making it computationally lighter. Here, reward signals are designed to shape model behavior toward empathy, structured clinical responses, safety boundaries, and evidence-based reasoning — all with the goal of reducing hallucinations on medical questions.

Inference Performance

On H20 hardware, AntAngelMed exceeds 200 tokens per second, which the research team reports is approximately 3× faster than a 36 billion parameter dense model. With YaRN (Yet Another RoPE extensioN) extrapolation, it supports a 128K context length — long enough to handle full clinical documents, extended patient histories, or multi-turn medical dialogues.

The research team has also released an FP8 quantized version of the model. When this quantization is combined with EAGLE3 speculative decoding optimization, inference throughput at a concurrency of 32 improves significantly over FP8 alone: 71% on HumanEval, 45% on GSM8K, and 94% on Math-500. These benchmarks measure coding and math reasoning tasks — not medical tasks directly — but serve as proxies for the model’s general throughput stability across output types.

Benchmark Results

On HealthBench, the open-source medical evaluation benchmark from OpenAI that uses simulated multi-turn medical dialogues to measure real-world clinical performance, AntAngelMed ranks first among all open-source models and surpasses a range of top proprietary models as well, with a particularly significant advantage on the HealthBench-Hard subset.

On MedAIBench, an evaluation system maintained by China’s National Artificial Intelligence Medical Industry Pilot Facility, AntAngelMed ranks at the top level, with particularly strong scores in medical knowledge Q&A and medical ethics and safety categories.

On MedBench, a benchmark for Chinese healthcare LLMs covering 36 independently curated datasets and approximately 700,000 samples across five dimensions — medical knowledge question answering, medical language understanding, medical language generation, complex medical reasoning, and safety and ethics — AntAngelMed ranks first overall.

Marktechpost’s Visual Explainer

Technical Guide
AntAngelMed

1 / 7

01 — Overview
What Is AntAngelMed?
Jointly developed by Health Information Center of Zhejiang Province, Ant Healthcare, and Zhejiang Anzhen’er Medical AI Technology Co., Ltd.

103BTotal Params

6.1BActive at Inference

128KContext Length

AntAngelMed is a medical-domain LLM built on a 1/32 activation-ratio MoE architecture. With 103B total parameters and only 6.1B active at inference time, it matches the performance of roughly 40B dense models at a fraction of the compute cost.

Model weights are released under Apache 2.0. The code repository is licensed under MIT.

02 — Architecture
MoE Architecture & Base Model
Built on Ling-flash-2.0 by inclusionAI, guided by Ling Scaling Laws.

AntAngelMed uses a 1/32 activation-ratio MoE with optimizations across all core components. These choices enable small-activation MoE models to deliver up to 7× efficiency over similarly sized dense architectures — and as output length grows, relative speedups can reach 7× or more.

Key architectural components:

Expert Granularity
Shared Expert Ratio
Sigmoid Routing
No Auxiliary Loss
MTP Layer
QK-Norm
Partial-RoPE
YaRN Extrapolation
Attention Balance

03 — Training
Three-Stage Training Pipeline
Designed to layer general language understanding on top of deep medical domain adaptation.

Stage 01
Continual Pre-Training
Built on Ling-flash-2.0, trained on large-scale medical corpora — encyclopedias, web text, and academic publications — to inject deep domain and world knowledge.

Stage 02
Supervised Fine-Tuning (SFT)
Multi-source instruction data mixing general tasks (math, programming, logic) for chain-of-thought, plus medical scenarios (doctor–patient Q&A, diagnostic reasoning, safety/ethics) for clinical adaptation.

Stage 03
Reinforcement Learning via GRPO
Group Relative Policy Optimization with task-specific reward models. Shapes model behavior toward empathy, structural clarity, safety boundaries, and evidence-based reasoning to reduce hallucinations.

04 — Inference
Inference Performance
Hardware benchmarks on H20 and throughput improvements from FP8 + EAGLE3 optimization.

>200 tok/s
On H20 hardware. Approximately 3× faster than a comparable 36B dense model.

7× efficiency
MoE vs. dense at equivalent size. Speedup increases further as output length grows.

+71% / +45% / +94%
FP8 + EAGLE3 throughput gains over FP8 alone on HumanEval / GSM8K / Math-500 at concurrency 32.

128K context
Supported via YaRN extrapolation. Handles full clinical documents and extended multi-turn dialogues.

05 — Benchmarks
Benchmark Results
Evaluated across three authoritative medical LLM benchmarks.

BenchmarkScopeResult
HealthBenchOpenAISimulated multi-turn medical dialogues for real-world clinical performance.#1 open-source; surpasses several proprietary models. Largest lead on HealthBench-Hard.
MedAIBenchNat’l AI Medical Pilot FacilityChinese authority benchmark covering knowledge Q&A and medical ethics/safety.Top-level. Strongest in knowledge Q&A and medical ethics/safety.
MedBenchChinese Healthcare Domain36 datasets, ~700K samples across 5 clinical dimensions.#1 overall across all 5 dimensions.

06 — Quickstart
Run with Hugging Face Transformers
Requires trust_remote_code=True for the MoE routing code.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "MedAIBase/AntAngelMed",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("MedAIBase/AntAngelMed")

messages = [
  {"role": "system", "content": "You are AntAngelMed, a helpful medical assistant."},
  {"role": "user",   "content": "What should I do if I have a headache?"}
]
text   = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt",
    return_token_type_ids=False).to(model.device)
out    = model.generate(**inputs, max_new_tokens=16384)
out    = [o[len(i):] for i, o in zip(inputs.input_ids, out)]
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])

Also supports: vLLM v0.11.0 (4-GPU tensor parallel), SGLang with FlashAttention-3, and vLLM-Ascend for Huawei Ascend 910B NPUs.

07 — Access
Resources & Links
Model weights Apache 2.0 — Code repository MIT — FP8 quantized variant available separately.

Developed by Health Information Center of Zhejiang Province, Ant Healthcare, and Zhejiang Anzhen’er Medical AI Technology Co., Ltd.
Coverage by Marktechpost — marktechpost.com

Key Takeaways

  • AntAngelMed is a 103B-parameter open-source medical LLM that activates only 6.1B parameters at inference time using a 1/32 activation-ratio MoE architecture inherited from Ling-flash-2.0.
  • It uses a three-stage training pipeline: continual pre-training on medical corpora, SFT with mixed general and clinical instruction data, and GRPO-based reinforcement learning for safety and diagnostic reasoning.
  • On H20 hardware, the model exceeds 200 tokens/s and supports 128K context length via YaRN extrapolation — roughly 3× faster than a comparable 36B dense model.
  • AntAngelMed ranks first among open-source models on OpenAI’s HealthBench, surpasses several proprietary models, and tops both MedAIBench and MedBench leaderboards.
  • The model is available on Hugging Face, ModelScope, and GitHub; model weights are Apache 2.0, code is MIT, and an FP8 quantized version is also released.

Check out the Model Weights on HF, GitHub Repo and Technical detailsAlso, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source link

  • Related Posts

    Build a Hybrid-Memory Autonomous Agent with Modular Architecture and Tool Dispatch Using OpenAI

    class MemoryStoreTool(Tool): name = “memory_store” description = “Save an important fact or piece of information to long-term memory.” def __init__(self, memory: MemoryBackend): self._mem = memory def run(self, text: str, category:…

    Tilde Research Introduces Aurora: A Leverage-Aware Optimizer That Fixes a Hidden Neuron Death Problem in Muon

    Researchers at Tilde Research have released Aurora, a new optimizer for training neural networks that addresses a structural flaw in the widely-used Muon optimizer. The flaw quietly kills off a…

    Leave a Reply

    Your email address will not be published. Required fields are marked *