Self-evolving Agents with reflective and memory-augmented abilities

AuthorsXuechen Liang, Yangfan He, Yinghui Xia et al.

2024

TL;DR

SAGE uses iterative feedback, reflection, and MemorySyntax based on the Ebbinghaus forgetting curve to boost AgentBench scores up to 2.26X over closed-source baselines.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents lack long term memory and struggle with continuous decision making

LLM agents face continuous decision-making, lack of long-term memory, and limited context windows in dynamic environments, especially over long time spans.

These weaknesses hurt multi-source QA, code generation, and multi-agent systems, where missing long-span information leads to basic errors, context limit exceeded failures, and poor task completion.

HOW IT WORKS

SAGE — iterative feedback, reflection, and MemorySyntax

SAGE combines Iterative Feedback, Reflection, Short-Term Memory, Long-Term Memory, and MemorySyntax so the assistant adapts policies and memory contents using checker feedback.

You can think of Short-Term Memory as RAM for recent trajectories and Long-Term Memory as a disk of distilled reflections, with MemorySyntax acting like a smart cache eviction policy.

By embedding the Ebbinghaus-based MemorySyntax into SAGE, the system selectively strengthens or discards information in ways a plain context window cannot, enabling long-span reasoning without context overflow.

DIAGRAM

Assistant checker interaction and self evolution loop

This diagram shows how SAGE runs the actual interaction phase, with the assistant iteratively updating outputs based on checker feedback until validation or iteration cap.

DIAGRAM

Evaluation pipeline across AgentBench and long context tasks

This diagram shows how SAGE is evaluated on AgentBench, long form QA, code completion, and RAG agents, including memory optimization ablations.

PROCESS

How SAGE Handles a Task Session

01
Initialization Phase
SAGE assigns roles to the user, assistant, and checker, building the input set IA = (dU, iU) and initializing Short-Term Memory and Long-Term Memory.
02
Actual Interaction Phase
SAGE uses Iterative Feedback as the assistant generates outputs ot under policy πθ, while the checker returns feedback ft and rewards Rt at each time step.
03
Reflection
SAGE computes self reflection rt = ref(o1:t, R1:t) from trajectories and rewards, then stores rt into Long-Term Memory ML for future decisions.
04
MemorySyntax
SAGE applies MemorySyntax with the Ebbinghaus forgetting curve to update MS and ML using thresholds θ1 and θ2, retaining, transferring, or discarding I*t.

KEY CONTRIBUTIONS

Key Contributions

01
Self evolving reflection mechanism
SAGE introduces Reflection that turns sparse rewards and trajectories into rich self reflections rt stored in Long-Term Memory, improving decision making without extra training.
02
MemorySyntax based on Ebbinghaus curve
SAGE proposes MemorySyntax, combining the Ebbinghaus forgetting curve with linguistic optimization to manage Short-Term Memory and Long-Term Memory using thresholds θ1 and θ2.
03
Improved benchmarks on AgentBench and QA
SAGE achieves up to 2.26X improvement on closed source models and 57.7% to 100% gains on open source models, with GPT 3.5 HotpotQA accuracy rising from 48.5% to 68.3%.

RESULTS

By the Numbers

Database score

37.6

+11.7 over GPT-3.5 Base on AgentBench DB (25.9 to 37.6)

HotpotQA answer accuracy

68.3%

+19.8 over GPT-3.5 Baseline on Long-form QA (48.5% to 68.3%)

ALFWorld task completion

73.8%

+17.3 for Mistral-7b with SAGE on Sequential Task

HotpotQA QA accuracy

74.8%

+4.7 for ChatGPT-4 SAGE over FiD on RAG multi document QA

On AgentBench, SAGE raises GPT 3.5’s Database score from 25.9 to 37.6 and boosts smaller models like Llama2 7B Chat from 0.0 to 25.0 on DB. On HotpotQA, SAGE lifts GPT 3.5 answer accuracy from 48.5% to 68.3%, showing that SAGE’s reflective memory and MemorySyntax substantially improve long context reasoning.

BENCHMARK

By the Numbers

BENCHMARK

Baseline and SAGE Framework Performance on AgentBench — Database task

Scores on the AgentBench Database task for GPT-4, GPT-3.5, and two open source models with and without SAGE.

BENCHMARK

Ablation study for memory optimization on AgentBench

Effect of SAGE memory optimization on Qwen-1.8B and CodeLlama-7B for the Knowledge Graph task.

KEY INSIGHT

The Counterintuitive Finding

SAGE enables tiny Qwen 1.8B Chat to reach 45.3 on AgentBench KG with memory optimization, up from only 6.8 without it.

This is surprising because we usually expect parameter count, not memory management, to dominate performance, yet SAGE’s MemorySyntax narrows the gap toward GPT 3.5 level agents.

WHY IT MATTERS

What this unlocks for the field

SAGE shows that reflective memory and Ebbinghaus based MemorySyntax can turn weaker LLM agents into competitive multi task, long context reasoners.

Builders can now design agent systems where small open source models, equipped with SAGE, handle complex multi step tasks and RAG workflows with far lower memory usage and cost.

~12 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…