
In this tutorial, we implement an advanced, practical implementation of the NVIDIA Transformer Engine in Python, focusing on how mixed-precision acceleration can be explored in a realistic deep learning workflow. We set up the environment, verify GPU and CUDA readiness, attempt to install the required Transformer Engine components, and handle compatibility issues gracefully so that the notebook remains runnable even when the full extension cannot be built. As we move through each step, we build teacher and student networks, compare a baseline PyTorch path with a Transformer Engine-enabled path, train both models, benchmark their speed and memory usage, and visualize the results, giving us a clear hands-on understanding of how performance-oriented training workflows are structured in practice.
import os
import sys
import json
import time
import math
import random
import shutil
import platform
import subprocess
import statistics
def run(cmd, check=True):
print("\n[RUN]", " ".join(cmd))
result = subprocess.run(cmd, text=True, capture_output=True)
if result.stdout.strip():
print(result.stdout[-4000:])
if result.returncode != 0 and result.stderr.strip():
print(result.stderr[-4000:])
if check and result.returncode != 0:
raise subprocess.CalledProcessError(result.returncode, cmd)
return result
def has_cmd(name):
return shutil.which(name) is not None
run([sys.executable, "-m", "pip", "install", "-q", "--upgrade", "pip"])
run([sys.executable, "-m", "pip", "install", "-q", "ninja", "packaging", "matplotlib"])
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
assert torch.cuda.is_available(), "This notebook needs a GPU runtime in Colab."
gpu_name = torch.cuda.get_device_name(0)
cc_major, cc_minor = torch.cuda.get_device_capability(0)
cuda_runtime = torch.version.cuda
python_version = sys.version.split()[0]
torch_version = torch.__version__
cuda_home = os.environ.get("CUDA_HOME", "/usr/local/cuda")
nvcc_path = shutil.which("nvcc") or os.path.join(cuda_home, "bin", "nvcc")
cudnn_header_candidates = [
os.path.join(cuda_home, "include", "cudnn.h"),
"/usr/include/cudnn.h",
"/usr/local/include/cudnn.h",
]
nvcc_exists = os.path.exists(nvcc_path)
cudnn_header_exists = any(os.path.exists(p) for p in cudnn_header_candidates)
print("=" * 120)
print("ENVIRONMENT CHECK")
print("=" * 120)
print(json.dumps({
"python": python_version,
"platform": platform.platform(),
"torch": torch_version,
"torch_cuda": cuda_runtime,
"gpu_name": gpu_name,
"compute_capability": f"{cc_major}.{cc_minor}",
"cuda_home": cuda_home,
"nvcc_exists": nvcc_exists,
"nvcc_path": nvcc_path if nvcc_exists else None,
"cudnn_header_exists": cudnn_header_exists,
}, indent=2))
print("=" * 120)We prepare the Colab environment by importing the required Python libraries, defining a helper function for executing shell commands, and installing the core dependencies for the tutorial. We then import PyTorch and Matplotlib, verify that a GPU is available, and collect key environment details, including the GPU name, CUDA version, Python version, and toolkit paths. This gives us a clear view of the system state before we attempt any Transformer Engine installation or model execution.
te_available = False
te_mode = "fallback"
te_import_error = None
try:
run([sys.executable, "-m", "pip", "install", "-q", "transformer_engine[core_cu12]"])
except Exception as e:
print("Core wheel install failed:", repr(e))
can_try_te_torch = nvcc_exists and cudnn_header_exists
if can_try_te_torch:
env = os.environ.copy()
env["NVTE_FRAMEWORK"] = "pytorch"
env["MAX_JOBS"] = "1"
env["NVTE_BUILD_THREADS_PER_JOB"] = "1"
env["CUDA_PATH"] = cuda_home
env["CUDA_HOME"] = cuda_home
try:
print("\nAttempting to build the PyTorch extension for Transformer Engine...")
result = subprocess.run(
[sys.executable, "-m", "pip", "install", "-q", "--no-build-isolation", "transformer_engine[pytorch]"],
text=True,
capture_output=True,
env=env,
)
if result.stdout.strip():
print(result.stdout[-4000:])
if result.returncode != 0 and result.stderr.strip():
print(result.stderr[-4000:])
if result.returncode == 0:
import transformer_engine.pytorch as te
from transformer_engine.common import recipe
te_available = True
te_mode = "transformer_engine"
else:
te_import_error = result.stderr[-4000:] if result.stderr else "Unknown pip build error"
except Exception as e:
te_import_error = repr(e)
else:
te_import_error = "Missing nvcc or cuDNN headers in this Colab runtime, so TE PyTorch extension cannot be built here."
if te_available:
try:
fp8_available, fp8_reason = te.is_fp8_available(return_reason=True)
except Exception as e:
fp8_available, fp8_reason = False, f"Could not query FP8 availability: {e}"
try:
bf16_available = te.is_bf16_available()
except Exception:
bf16_available = torch.cuda.is_bf16_supported()
else:
fp8_available = False
fp8_reason = "Transformer Engine not installed; using fallback PyTorch path."
bf16_available = torch.cuda.is_bf16_supported()
amp_dtype = torch.bfloat16 if bf16_available else torch.float16
print("\n" + "=" * 120)
print("INSTALL STATUS")
print("=" * 120)
print(json.dumps({
"te_available": te_available,
"te_mode": te_mode,
"fp8_available": fp8_available,
"fp8_reason": fp8_reason,
"te_import_error": te_import_error,
"amp_dtype": str(amp_dtype),
}, indent=2))
print("=" * 120)
device = "cuda"
random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
if te_available:
fp8_recipe = recipe.DelayedScaling(margin=0, fp8_format=recipe.Format.E4M3)
def baseline_autocast():
return torch.autocast(device_type="cuda", dtype=amp_dtype)
def te_forward_context(use_fp8):
if te_available and use_fp8:
return te.autocast(enabled=True, recipe=fp8_recipe)
return baseline_autocast()We attempt to install the Transformer Engine core package and then check whether the Colab runtime can build the PyTorch extension by verifying the presence of nvcc and cuDNN headers. If the environment supports it, we try to install the Transformer Engine PyTorch backend and then inspect whether FP8 and BF16 are available on the current hardware. We also configure the precision mode and define the autocast contexts that later allow us to switch between standard mixed precision and Transformer Engine execution.
class TeacherNet(nn.Module):
def __init__(self, hidden_size=512, intermediate_size=2048, num_layers=3, vocab_size=4096):
super().__init__()
self.embed = nn.Embedding(vocab_size, hidden_size)
self.layers = nn.ModuleList([
nn.Sequential(
nn.LayerNorm(hidden_size),
nn.Linear(hidden_size, intermediate_size),
nn.GELU(),
nn.Linear(intermediate_size, hidden_size),
) for _ in range(num_layers)
])
self.head = nn.Linear(hidden_size, hidden_size)
def forward(self, token_ids):
x = self.embed(token_ids)
for layer in self.layers:
x = x + layer(x)
return self.head(x)
class BaselineStudent(nn.Module):
def __init__(self, hidden_size=512, intermediate_size=2048, num_layers=3, vocab_size=4096):
super().__init__()
self.embed = nn.Embedding(vocab_size, hidden_size)
self.norms = nn.ModuleList([nn.LayerNorm(hidden_size) for _ in range(num_layers)])
self.fc1 = nn.ModuleList([nn.Linear(hidden_size, intermediate_size) for _ in range(num_layers)])
self.fc2 = nn.ModuleList([nn.Linear(intermediate_size, hidden_size) for _ in range(num_layers)])
self.head = nn.Linear(hidden_size, hidden_size)
def forward(self, token_ids):
x = self.embed(token_ids)
for ln, fc1, fc2 in zip(self.norms, self.fc1, self.fc2):
residual = x
x = ln(x)
x = fc1(x)
x = F.gelu(x, approximate="tanh")
x = fc2(x)
x = x + residual
return self.head(x)
if te_available:
class TEStudent(nn.Module):
def __init__(self, hidden_size=512, intermediate_size=2048, num_layers=3, vocab_size=4096):
super().__init__()
self.embed = nn.Embedding(vocab_size, hidden_size)
self.norms = nn.ModuleList([te.LayerNorm(hidden_size) for _ in range(num_layers)])
self.fc1 = nn.ModuleList([te.Linear(hidden_size, intermediate_size, bias=True) for _ in range(num_layers)])
self.fc2 = nn.ModuleList([te.Linear(intermediate_size, hidden_size, bias=True) for _ in range(num_layers)])
self.head = te.Linear(hidden_size, hidden_size, bias=True)
def forward(self, token_ids, use_fp8=False):
x = self.embed(token_ids)
with te_forward_context(use_fp8):
for ln, fc1, fc2 in zip(self.norms, self.fc1, self.fc2):
residual = x
x = ln(x)
x = fc1(x)
x = F.gelu(x, approximate="tanh")
x = fc2(x)
x = x + residual
x = self.head(x)
return x
else:
class TEStudent(nn.Module):
def __init__(self, hidden_size=512, intermediate_size=2048, num_layers=3, vocab_size=4096):
super().__init__()
self.embed = nn.Embedding(vocab_size, hidden_size)
self.norms = nn.ModuleList([nn.LayerNorm(hidden_size) for _ in range(num_layers)])
self.fc1 = nn.ModuleList([nn.Linear(hidden_size, intermediate_size) for _ in range(num_layers)])
self.fc2 = nn.ModuleList([nn.Linear(intermediate_size, hidden_size) for _ in range(num_layers)])
self.head = nn.Linear(hidden_size, hidden_size)
def forward(self, token_ids, use_fp8=False):
x = self.embed(token_ids)
with baseline_autocast():
for ln, fc1, fc2 in zip(self.norms, self.fc1, self.fc2):
residual = x
x = ln(x)
x = fc1(x)
x = F.gelu(x, approximate="tanh")
x = fc2(x)
x = x + residual
x = self.head(x)
return x
def count_params(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
def format_millions(n):
return f"{n / 1e6:.2f}M"We define the neural network architectures used throughout the tutorial, including the teacher model, the baseline student model, and the Transformer Engine student path. We keep the model structures aligned so that the comparison remains meaningful while allowing the TE path to swap in Transformer Engine layers when the extension is available. We also define small utility functions for counting parameters and formatting model size, which help us inspect the scale of the models before training begins.
hidden_size = 512
intermediate_size = 2048
num_layers = 3
vocab_size = 4096
seq_len = 128
batch_size = 8
steps = 25
benchmark_iters = 20
lr = 2e-4
weight_decay = 1e-2
teacher = TeacherNet(hidden_size, intermediate_size, num_layers, vocab_size).to(device).eval()
baseline_model = BaselineStudent(hidden_size, intermediate_size, num_layers, vocab_size).to(device)
te_model = TEStudent(hidden_size, intermediate_size, num_layers, vocab_size).to(device)
optimizer_baseline = torch.optim.AdamW(baseline_model.parameters(), lr=lr, weight_decay=weight_decay)
optimizer_te = torch.optim.AdamW(te_model.parameters(), lr=lr, weight_decay=weight_decay)
print("Teacher params :", format_millions(count_params(teacher)))
print("Baseline params:", format_millions(count_params(baseline_model)))
print("TE-path params :", format_millions(count_params(te_model)))
def make_batch(batch_size, seq_len, vocab_size, device):
tokens = torch.randint(0, vocab_size, (batch_size, seq_len), device=device)
with torch.no_grad():
target = teacher(tokens)
return tokens, target
def peak_mem_mb():
return torch.cuda.max_memory_allocated() / (1024 ** 2)
def train_baseline_step():
baseline_model.train()
optimizer_baseline.zero_grad(set_to_none=True)
tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
with baseline_autocast():
pred = baseline_model(tokens)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer_baseline.step()
return float(loss.detach().item())
def train_te_step(use_fp8):
te_model.train()
optimizer_te.zero_grad(set_to_none=True)
tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
pred = te_model(tokens, use_fp8=use_fp8)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer_te.step()
return float(loss.detach().item())We set the main experiment hyperparameters, instantiate all models on the GPU, and create the optimizers that will be used during training. We also print the parameter counts to confirm that the baseline and TE paths are comparable in terms of model size. In addition, we define the batch-generation logic, memory tracking function, and the individual training-step functions that execute one optimization step for each model path.
baseline_losses = []
te_losses = []
mode_name = "TE-FP8" if (te_available and fp8_available) else ("TE-BF16/FP16" if te_available else "Fallback-PyTorch")
print("\n" + "=" * 120)
print("TRAINING")
print("=" * 120)
for step in range(1, steps + 1):
b_loss = train_baseline_step()
t_loss = train_te_step(use_fp8=fp8_available)
baseline_losses.append(b_loss)
te_losses.append(t_loss)
if step == 1 or step % 5 == 0 or step == steps:
print(f"step={step:02d} | baseline_loss={b_loss:.6f} | te_path_loss={t_loss:.6f} | mode={mode_name}")
@torch.no_grad()
def evaluate_model(model, is_te=False, use_fp8=False, eval_batches=8):
model.eval()
vals = []
for _ in range(eval_batches):
tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
if is_te:
pred = model(tokens, use_fp8=use_fp8)
else:
with baseline_autocast():
pred = model(tokens)
vals.append(float(F.mse_loss(pred, target).item()))
return sum(vals) / len(vals)
baseline_eval = evaluate_model(baseline_model, is_te=False)
te_eval = evaluate_model(te_model, is_te=True, use_fp8=fp8_available)
def benchmark_train_step(model, optimizer, is_te=False, use_fp8=False, warmup=5, iters=20):
times_ms = []
mems_mb = []
for _ in range(warmup):
optimizer.zero_grad(set_to_none=True)
tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
if is_te:
pred = model(tokens, use_fp8=use_fp8)
else:
with baseline_autocast():
pred = model(tokens)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer.step()
torch.cuda.synchronize()
for _ in range(iters):
torch.cuda.reset_peak_memory_stats()
optimizer.zero_grad(set_to_none=True)
tokens, target = make_batch(batch_size, seq_len, vocab_size, device)
start = time.perf_counter()
if is_te:
pred = model(tokens, use_fp8=use_fp8)
else:
with baseline_autocast():
pred = model(tokens)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer.step()
torch.cuda.synchronize()
end = time.perf_counter()
times_ms.append((end - start) * 1000.0)
mems_mb.append(peak_mem_mb())
return {
"mean_ms": statistics.mean(times_ms),
"median_ms": statistics.median(times_ms),
"max_memory_mb": max(mems_mb),
}
baseline_bench = benchmark_train_step(baseline_model, optimizer_baseline, is_te=False, use_fp8=False, iters=benchmark_iters)
te_bench = benchmark_train_step(te_model, optimizer_te, is_te=True, use_fp8=fp8_available, iters=benchmark_iters)We run the main training loop for both the baseline model and the TE path, tracking their losses over multiple steps. We then define and execute the evaluation function to measure how well each model matches the teacher’s outputs after training. Finally, we implement the benchmarking routine to measure per-step runtime and peak CUDA memory usage, enabling quantitative comparison of performance characteristics.
summary = {
"gpu_name": gpu_name,
"compute_capability": f"{cc_major}.{cc_minor}",
"te_available": te_available,
"fp8_available": fp8_available,
"fp8_reason": fp8_reason,
"mode": mode_name,
"baseline_eval_mse": baseline_eval,
"te_path_eval_mse": te_eval,
"baseline_mean_step_ms": baseline_bench["mean_ms"],
"te_path_mean_step_ms": te_bench["mean_ms"],
"baseline_peak_mem_mb": baseline_bench["max_memory_mb"],
"te_path_peak_mem_mb": te_bench["max_memory_mb"],
}
print("\n" + "=" * 120)
print("SUMMARY")
print("=" * 120)
print(json.dumps(summary, indent=2))
plt.figure(figsize=(10, 5))
plt.plot(baseline_losses, label="Baseline loss")
plt.plot(te_losses, label=f"{mode_name} loss")
plt.xlabel("Training step")
plt.ylabel("MSE loss")
plt.title("Training Loss Comparison")
plt.legend()
plt.grid(True)
plt.show()
plt.figure(figsize=(8, 5))
plt.bar(["Baseline", mode_name], [baseline_bench["mean_ms"], te_bench["mean_ms"]])
plt.ylabel("Mean train step time (ms)")
plt.title("Speed Comparison")
plt.grid(True, axis="y")
plt.show()
plt.figure(figsize=(8, 5))
plt.bar(["Baseline", mode_name], [baseline_bench["max_memory_mb"], te_bench["max_memory_mb"]])
plt.ylabel("Peak memory (MB)")
plt.title("Peak CUDA Memory Comparison")
plt.grid(True, axis="y")
plt.show()We gather all final metrics into a summary dictionary and print the experiment’s consolidated results in a structured format. We then generate visualizations of training loss, mean training-step time, and peak memory usage to more intuitively interpret the differences between the baseline and TE paths. This final section helps us move from raw numbers to practical insights by showing how the two implementations behave across accuracy, speed, and memory.
In conclusion, we built far more than a simple installation walkthrough; we created a complete experimental pipeline that helps us understand how the NVIDIA Transformer Engine fits into modern GPU-accelerated model training. We tested the runtime environment, adapted to Colab limitations, preserved a working fallback path, and then trained, evaluated, and benchmarked two implementations side by side to observe practical differences in efficiency, precision behavior, and resource usage. At the end, we understood how to use the Transformer Engine in a Colab-friendly setting and gained a reusable foundation that we can extend to larger transformer architectures, richer benchmarking scenarios, and more production-oriented optimization workflows.
Check out the Full Codes/Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us






