Building a Context-Folding LLM Agent for Long-Horizon Reasoning with Memory Compression and Tool Use


In this tutorial, we explore how to build a Context-Folding LLM Agent that efficiently solves long, complex tasks by intelligently managing limited context. We design the agent to break down a large task into smaller subtasks, perform reasoning or calculations when needed, and then fold each completed sub-trajectory into concise summaries. By doing this, we preserve essential knowledge while keeping the active memory small. Check out the FULL CODES here.

import os, re, sys, math, random, json, textwrap, subprocess, shutil, time
from typing import List, Dict, Tuple
try:
   import transformers
except:
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", "transformers", "accelerate", "sentencepiece"], check=True)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
MODEL_NAME = os.environ.get("CF_MODEL", "google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
llm = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device_map="auto")
def llm_gen(prompt: str, max_new_tokens=160, temperature=0.0) -> str:
   out = llm(prompt, max_new_tokens=max_new_tokens, do_sample=temperature>0.0, temperature=temperature)[0]["generated_text"]
   return out.strip()

We begin by setting up our environment and loading a lightweight Hugging Face model. We use this model to generate and process text locally, ensuring the agent runs smoothly on Google Colab without any API dependencies. Check out the FULL CODES here.

import ast, operator as op
OPS = {ast.Add: op.add, ast.Sub: op.sub, ast.Mult: op.mul, ast.Div: op.truediv, ast.Pow: op.pow, ast.USub: op.neg, ast.FloorDiv: op.floordiv, ast.Mod: op.mod}
def _eval_node(n):
   if isinstance(n, ast.Num): return n.n
   if isinstance(n, ast.UnaryOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.operand))
   if isinstance(n, ast.BinOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.left), _eval_node(n.right))
   raise ValueError("Unsafe expression")
def calc(expr: str):
   node = ast.parse(expr, mode="eval").body
   return _eval_node(node)
class FoldingMemory:
   def __init__(self, max_chars:int=800):
       self.active=[]; self.folds=[]; self.max_chars=max_chars
   def add(self,text:str):
       self.active.append(text.strip())
       while len(self.active_text())>self.max_chars and len(self.active)>1:
           popped=self.active.pop(0)
           fold=f"- Folded: {popped[:120]}..."
           self.folds.append(fold)
   def fold_in(self,summary:str): self.folds.append(summary.strip())
   def active_text(self)->str: return "\n".join(self.active)
   def folded_text(self)->str: return "\n".join(self.folds)
   def snapshot(self)->Dict: return {"active_chars":len(self.active_text()),"n_folds":len(self.folds)}

We define a simple calculator tool for basic arithmetic and create a memory system that dynamically folds past context into concise summaries. This helps us maintain a manageable active memory while retaining essential information. Check out the FULL CODES here.

SUBTASK_DECOMP_PROMPT="""You are an expert planner. Decompose the task below into 2-4 crisp subtasks.
Return each subtask as a bullet starting with '- ' in priority order.
Task: "{task}" """
SUBTASK_SOLVER_PROMPT="""You are a precise problem solver with minimal steps.
If a calculation is needed, write one line 'CALC(expr)'.
Otherwise write 'ANSWER: '.
Think briefly; avoid chit-chat.


Task: {task}
Subtask: {subtask}
Notes (folded context):
{notes}


Now respond with either CALC(...) or ANSWER: ..."""
SUBTASK_SUMMARY_PROMPT="""Summarize the subtask outcome in <=3 bullets, total <=50 tokens.
Subtask: {name}
Steps:
{trace}
Final: {final}
Return only bullets starting with '- '."""
FINAL_SYNTH_PROMPT="""You are a senior agent. Synthesize a final, coherent solution using ONLY:
- The original task
- Folded summaries (below)
Avoid repeating steps. Be concise and actionable.


Task: {task}
Folded summaries:
{folds}


Final answer:"""
def parse_bullets(text:str)->List[str]:
   return [ln[2:].strip() for ln in text.splitlines() if ln.strip().startswith("- ")]

We design prompt templates that guide the agent in decomposing tasks, solving subtasks, and summarizing outcomes. These structured prompts enable clear communication between reasoning steps and the model’s responses. Check out the FULL CODES here.

def run_subtask(task:str, subtask:str, memory:FoldingMemory, max_tool_iters:int=3)->Tuple[str,str,List[str]]:
   notes=(memory.folded_text() or "(none)")
   trace=[]; final=""
   for _ in range(max_tool_iters):
       prompt=SUBTASK_SOLVER_PROMPT.format(task=task,subtask=subtask,notes=notes)
       out=llm_gen(prompt,max_new_tokens=96); trace.append(out)
       m=re.search(r"CALC\((.+?)\)",out)
       if m:
           try:
               val=calc(m.group(1))
               trace.append(f"TOOL:CALC -> {val}")
               out2=llm_gen(prompt+f"\nTool result: {val}\nNow produce 'ANSWER: ...' only.",max_new_tokens=64)
               trace.append(out2)
               if out2.strip().startswith("ANSWER:"):
                   final=out2.split("ANSWER:",1)[1].strip(); break
           except Exception as e:
               trace.append(f"TOOL:CALC ERROR -> {e}")
       if out.strip().startswith("ANSWER:"):
           final=out.split("ANSWER:",1)[1].strip(); break
   if not final:
       final="No definitive answer; partial reasoning:\n"+"\n".join(trace[-2:])
   summ=llm_gen(SUBTASK_SUMMARY_PROMPT.format(name=subtask,trace="\n".join(trace),final=final),max_new_tokens=80)
   summary_bullets="\n".join(parse_bullets(summ)[:3]) or f"- {subtask}: {final[:60]}..."
   return final, summary_bullets, trace
class ContextFoldingAgent:
   def __init__(self,max_active_chars:int=800):
       self.memory=FoldingMemory(max_chars=max_active_chars)
       self.metrics={"subtasks":0,"tool_calls":0,"chars_saved_est":0}
   def decompose(self,task:str)->List[str]:
       plan=llm_gen(SUBTASK_DECOMP_PROMPT.format(task=task),max_new_tokens=96)
       subs=parse_bullets(plan)
       return subs[:4] if subs else ["Main solution"]
   def run(self,task:str)->Dict:
       t0=time.time()
       self.memory.add(f"TASK: {task}")
       subtasks=self.decompose(task)
       self.metrics["subtasks"]=len(subtasks)
       folded=[]
       for st in subtasks:
           self.memory.add(f"SUBTASK: {st}")
           final,fold_summary,trace=run_subtask(task,st,self.memory)
           self.memory.fold_in(fold_summary)
           folded.append(f"- {st}: {final}")
           self.memory.add(f"SUBTASK_DONE: {st}")
       final=llm_gen(FINAL_SYNTH_PROMPT.format(task=task,folds=self.memory.folded_text()),max_new_tokens=200)
       t1=time.time()
       return {"task":task,"final":final.strip(),"folded_summaries":self.memory.folded_text(),
               "active_context_chars":len(self.memory.active_text()),
               "subtask_finals":folded,"runtime_sec":round(t1-t0,2)}

We implement the agent’s core logic, in which each subtask is executed, summarized, and folded back into memory. This step demonstrates how context folding enables the agent to reason iteratively without losing track of prior reasoning. Check out the FULL CODES here.

DEMO_TASKS=[
   "Plan a 3-day study schedule for ML with daily workouts and simple meals; include time blocks.",
   "Compute a small project budget with 3 items (laptop 799.99, course 149.5, snacks 23.75), add 8% tax and 5% buffer, and present a one-paragraph recommendation."
]
def pretty(d): return json.dumps(d, indent=2, ensure_ascii=False)
if __name__=="__main__":
   agent=ContextFoldingAgent(max_active_chars=700)
   for i,task in enumerate(DEMO_TASKS,1):
       print("="*70)
       print(f"DEMO #{i}: {task}")
       res=agent.run(task)
       print("\n--- Folded Summaries ---\n"+(res["folded_summaries"] or "(none)"))
       print("\n--- Final Answer ---\n"+res["final"])
       print("\n--- Diagnostics ---")
       diag={k:res[k] for k in ["active_context_chars","runtime_sec"]}
       diag["n_subtasks"]=len(agent.decompose(task))
       print(pretty(diag))

We run the agent on sample tasks to observe how it plans, executes, and synthesizes final results. Through these examples, we see the complete context-folding process in action, producing concise and coherent outputs.

In conclusion, we demonstrate how context folding enables long-horizon reasoning while avoiding memory overload. We see how each subtask is planned, executed, summarized, and distilled into compact knowledge, mimicking how an intelligent agent would handle complex workflows over time. By combining decomposition, tool use, and context compression, we create a lightweight yet powerful agentic system that scales reasoning efficiently.


Check out the FULL CODES here and Paper . Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.



Source link

  • Related Posts

    QeRL: NVFP4-Quantized Reinforcement Learning (RL) Brings 32B LLM Training to a Single H100—While Improving Exploration

    What would you build if you could run Reinforcement Learning (RL) post-training on a 32B LLM in 4-bit NVFP4—on a single H100—with BF16-level accuracy and 1.2–1.5× step speedups? NVIDIA researchers…

    Anthropic Launches Claude Haiku 4.5: Small AI Model that Delivers Sonnet-4-Level Coding Performance at One-Third the Cost and more than Twice the Speed

    Anthropic released Claude Haiku 4.5, a latency-optimized “small” model that delivers similar levels of coding performance to Claude Sonnet 4 while running more than twice as fast at one-third the…

    Leave a Reply

    Your email address will not be published. Required fields are marked *