HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

AuthorsMengkang Hu, Tianxing Chen, Qiguang Chen et al.

2024

TL;DR

HiAgent uses subgoal based hierarchical working memory with observation summarization and trajectory retrieval to double success rate (42% vs 21%) on long horizon agent tasks.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long horizon agents drown in redundant context and stall at 21.00% success rate

STANDARD agents push all historical action observation pairs into context, creating long redundant working memory that harms reasoning on long horizon tasks.

On AgentBoard tasks like Blocksworld and Tyreworld, this STANDARD strategy yields only 21.00% overall success rate and long trajectories that waste context and runtime.

HOW IT WORKS

HiAgent — subgoal based hierarchical working memory

HiAgent’s core mechanism combines Subgoal based Hierarchical Working Memory, Observation Summarization, and Trajectory Retrieval to chunk trajectories by subgoals and compress past details.

You can think of HiAgent like a programmer using RAM for the current function and a log file for completed functions, only reopening logs when debugging.

This hierarchical working memory lets HiAgent keep context focused on the current subgoal while still recalling detailed past trajectories on demand, something a flat context window cannot do.

DIAGRAM

HiAgent in trial interaction and memory update flow

This diagram shows how HiAgent alternates between generating subgoals, executing actions, summarizing observations, and retrieving trajectories within a single trial.

DIAGRAM

Evaluation pipeline and ablation design for HiAgent

This diagram shows how HiAgent is evaluated on AgentBoard tasks and how ablations remove Observation Summarization and Trajectory Retrieval.

PROCESS

How HiAgent Handles a Long Horizon Agent Task

01
Subgoal based Hierarchical Working Memory
HiAgent first uses Subgoal based Hierarchical Working Memory to prompt the LLM to formulate a subgoal gi as a milestone for the task.
02
Generate precise actions
Conditioned on the current subgoal, HiAgent generates precise actions and collects action observation pairs into a memory chunk for that subgoal.
03
Observation Summarization
When HiAgent determines the subgoal is fulfilled, Observation Summarization compresses the chunk into a summarized observation si and replaces detailed pairs with (gi, si).
04
Trajectory Retrieval
If HiAgent later needs details from a past subgoal, Trajectory Retrieval recalls the full action observation trajectory for that subgoal into the working memory.

KEY CONTRIBUTIONS

Key Contributions

01
Hierarchical working memory management
HiAgent introduces Subgoal based Hierarchical Working Memory that chunks trajectories by subgoals and yields 42.00% overall success rate versus 21.00% for STANDARD.
02
Observation Summarization and Trajectory Retrieval
HiAgent combines Observation Summarization and Trajectory Retrieval so past subgoals are stored as concise summaries but can be expanded into detailed trajectories when needed.
03
Comprehensive evaluation on long horizon tasks
HiAgent is evaluated on five AgentBoard tasks with more than 20 steps, reducing average steps by 3.80 and context length by 35.02% compared to STANDARD.

RESULTS

By the Numbers

Success Rate (SR)

42.00%

+21.00 over STANDARD

Progress Rate (PR)

62.55%

+23.94 over STANDARD

Average Steps

22.61 steps

3.80 fewer steps than STANDARD

Context Efficiency

64.98% tokens

35.02% fewer context tokens than STANDARD

On five AgentBoard long horizon tasks (Blocksworld, Gripper, Tyreworld, Barman, Jericho), HiAgent doubles overall success rate and shortens trajectories versus STANDARD. These numbers show that hierarchical working memory with subgoal chunks materially improves both effectiveness and efficiency for LLM based agents.

BENCHMARK

By the Numbers

BENCHMARK

Overall performance of STANDARD and HiAgent on 5 long horizon agent tasks

Overall Success Rate (SR) across Blocksworld, Gripper, Tyreworld, Barman, and Jericho.

BENCHMARK

Ablation study of HiAgent on Tyreworld

Success Rate (SR) on Tyreworld for HiAgent and ablations w o OS, w o TR, and w o OS and TR.

KEY INSIGHT

The Counterintuitive Finding

On Tyreworld, HiAgent with full hierarchical memory reaches 60.0% success rate, while task decomposition without hiding trajectories reaches only 40.0%.

This is surprising because many assume subgoal generation alone is enough, but HiAgent shows that without memory compression, context length and runtime actually increase.

WHY IT MATTERS

What this unlocks for the field

HiAgent unlocks long horizon agents that maintain high executability and progress even as steps exceed 20, by keeping working memory focused and structured.

Builders can now design LLM agents that tackle complex multi step environments like Jericho or Tyreworld without hitting context limits or drowning in redundant history.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…