ACON: Optimizing Context Compression for Long-horizon LLM Agents

AuthorsMinki Kang, Wei-Ning Chen, Dongge Han et al.

arXiv 20252025

TL;DR

ACON uses failure-driven compression guideline optimization plus compressor distillation to cut peak tokens by up to 54% while preserving long-horizon agent accuracy.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-horizon agents hit a memory wall as context grows unbounded

LLM agents must accumulate long histories of actions and observations, and ACON reports 26–54% peak token savings are needed to keep costs manageable.

In AppWorld, OfficeBench, and Multi-objective QA, uncompressed histories dilute relevant information, distract decision making, and make long-horizon task success brittle.

HOW IT WORKS

Agent Context Optimization (ACON) — failure-driven compression guidelines

ACON orchestrates History Compression, Observation Compression, Compression Guideline Optimization, and Compressor Distillation around a fixed LLM agent to minimize context cost without retraining.

You can think of ACON as a smart cache between RAM and disk: it rewrites sprawling trajectories into compact, state-preserving summaries before the agent reads them.

This failure-driven guideline optimization lets ACON keep causal state, variables, and decision cues that a plain context window or naive summarization would silently drop.

DIAGRAM

ACON Trajectory Compression and Feedback Loop

This diagram shows how ACON runs paired trajectories with and without compression, then uses failures to refine compression guidelines.

DIAGRAM

ACON Evaluation Pipeline Across Benchmarks

This diagram shows how ACON is evaluated on AppWorld, OfficeBench, and 8-objective QA with different compression and distillation settings.

PROCESS

How ACON Handles a Long-horizon Agent Task

01
History Compression
ACON monitors interaction history length and, when it exceeds a threshold, uses History Compression to rewrite ht into a compact h′t that preserves key state.
02
Observation Compression
For long tool outputs, ACON applies Observation Compression to map ot and ht−1 into o′t, stripping redundancy while keeping decision-critical details.
03
Compression Guideline Optimization
ACON runs paired trajectories with full and compressed contexts, then uses contrastive failures to update natural language compression guidelines for the compressor.
04
Distilling Context Compression into Small Models
ACON trains smaller LMs like Qwen3-14B on teacher compressor outputs so they can perform the same compression with much lower overhead.

KEY CONTRIBUTIONS

Key Contributions

01
Agent Context Optimization framework
ACON introduces a unified framework for History Compression and Observation Compression in long-horizon agents, reducing peak tokens by 26–54% across three benchmarks.
02
Failure-driven compression guideline optimization
ACON uses Compression Guideline Optimization with contrastive trajectories to refine prompts, enabling task-aware compression without updating LLM parameters.
03
Distilling context compression into small models
ACON distills optimized compressors into smaller LMs, preserving over 95% of the teacher’s accuracy while cutting compressor overhead for deployment.

RESULTS

By the Numbers

Accuracy (AppWorld average)

56.5%

+0.5 points over No compression

Peak input tokens (AppWorld)

7.33k

-2.60k vs No compression

Accuracy (OfficeBench)

74.74%

-2.10 points vs No compression with 32.2% fewer peak tokens

EM (8-objective QA)

0.373

+0.007 over Prompting with similar peak tokens

On AppWorld, OfficeBench, and 8-objective QA, ACON is evaluated with gpt-4.1 agents under history and observation compression settings. These results show that ACON preserves or slightly improves accuracy while substantially reducing peak context length and dependency compared to No compression and Prompting baselines.

BENCHMARK

By the Numbers

BENCHMARK

AppWorld test-normal: History Compression with gpt-4.1 Agent

Accuracy (%) on AppWorld (Average over 168 tasks) under different history compression strategies.

BENCHMARK

8-objective QA: History Compression with gpt-4.1 Agent

Exact Match (EM) on 8-objective QA under different history compression strategies.

KEY INSIGHT

The Counterintuitive Finding

On 8-objective QA, ACON UT reaches 0.373 EM and 0.494 F1 while reducing peak tokens from 10.35k to 4.71k.

This is surprising because aggressive compression is usually expected to hurt accuracy, yet ACON’s optimized summaries slightly improve EM and F1 over No compression.

WHY IT MATTERS

What this unlocks for the field

ACON makes it practical to run long-horizon LLM agents with compressed histories and observations while keeping task performance near uncompressed baselines.

Builders can now deploy smaller or cheaper agents like Qwen3-14B on complex multi-step workflows, using distilled compressors to handle context growth without rewriting agent policies.

~12 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…