APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

AuthorsPratyay Banerjee, Masud Moshtaghi, Ankit Chadha

2026

TL;DR

APEX-EM uses structured Procedural Knowledge Graph experience replay to push KGQAGen-10k accuracy to 89.6% (95.3% CSR), +48.3pp over a 41.3% no-memory baseline.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents stay stateless despite repeated, similar tasks (41.3% vs 89.6% on KGQAGen-10k)

APEX-EM highlights that a Claude Sonnet 4.5 agent without memory reaches only 41.3% accuracy on KGQAGen-10k, despite repeated exposure to structurally similar questions.

This means LLM-based autonomous agents repeatedly re-derive solutions from scratch, wasting prior work and leaving large performance gains on the table for code, queries, and reasoning.

HOW IT WORKS

APEX-EM — Procedural Knowledge Graphs plus PRGII experience replay

APEX-EM centers on a Procedural Knowledge Graph, Experience Memory store, PRGII workflow, Task Verifiers, and StructuralSignatureExtractor that encode full procedural-episodic traces as reusable experiences.

You can think of APEX-EM like a hybrid between an RL experience replay buffer and a card catalog, where each card is a structured plan with rich metadata.

This design lets APEX-EM replay and adapt entire procedures using structural signatures and dual-outcome indexing, instead of relying on a flat context window of unstructured reflections.

DIAGRAM

PRGII workflow: Plan–Retrieve–Generate–Iterate–Ingest loop

This diagram shows how APEX-EM runs the PRGII workflow to solve a task and commit a structured experience back into memory.

DIAGRAM

Evaluation setup across BigCodeBench, KGQAGen-10k, and HLE

This diagram shows how APEX-EM is evaluated on three benchmarks with frozen backbones and shared baselines like MemRL and RAG.

PROCESS

How APEX-EM Handles a Task via the PRGII Workflow

01
Plan Phase
APEX-EM parses the task into Task Understanding, runs Entity Discovery and Schema Discovery, and uses the StructuralSignatureExtractor to hypothesize an abstract operation sequence.
02
Retrieve Phase
APEX-EM queries the Experience Memory store using semantic search, structural signature matching, and PKG traversal to collect successful and failed experiences.
03
Generate Phase
Conditioned on Goal Reflections, Procedure Reflections, and negative examples from the Error Registry, APEX-EM generates an executable artifact such as code or a SPARQL query.
04
Iterate and Ingest Phases
Task Verifiers validate each artifact, drive refinement across iterations, then the Teacher and quality gate decide whether to store the run as a successful or failed Experience in the Procedural Knowledge Graph.

KEY CONTRIBUTIONS

Key Contributions

01
Procedural Knowledge Graph
APEX-EM introduces a Procedural Knowledge Graph that stores Experiences, Entities, Sub-Tasks, Operations, and TaskTopic nodes with structural signatures like [entity_resolution → temporal_filter → aggregation].
02
PRGII workflow with Task Verifiers
APEX-EM defines the Plan-Retrieve-Generate-Iterate-Ingest workflow where Task Verifiers and a Teacher provide multi-dimensional scores c, η, κ and an overall quality q with threshold θ.
03
Dual-outcome Experience Memory
APEX-EM builds a dual-outcome Experience Memory that treats successful experiences as positive in-context examples and failed ones as negative examples with structured Error Registry and Patch Reflections.

RESULTS

By the Numbers

KGQAGen-10k LASM Accuracy

89.6%

+48.3pp over No Memory (41.3%)

KGQAGen-10k CSR

95.3%

vs No Memory baseline with no CSR reported

BigCodeBench Last SR

83.3%

+29.4pp over No Memory Sonnet 4.5 (53.9%)

HLE Last SR (500q)

48.0%

+22.8pp over No Memory Opus 4.5 (25.2%)

On KGQAGen-10k, a structured query benchmark, APEX-EM reaches 89.6% accuracy and 95.3% CSR, beating the 41.3% no-memory baseline and the GPT-4o w/ SP oracle at 84.9%. On BigCodeBench and HLE, APEX-EM delivers +29.4pp and +22.8pp gains in success rate, showing that structured procedural replay scales across code and multi-domain reasoning.

BENCHMARK

By the Numbers

BENCHMARK

KGQAGen-10k: APEX-EM vs LLM and KG-RAG baselines

LASM Accuracy on KGQAGen-10k test split and training sample for APEX-EM and key baselines.

BENCHMARK

BigCodeBench: APEX-EM vs MemRL and no-memory baselines

Last Epoch Success Rate on BigCodeBench train split for APEX-EM and MemRL baselines.

KEY INSIGHT

The Counterintuitive Finding

On BigCodeBench, APEX-EM’s rich judge feedback barely helps over binary success signals, with A1≈A2 despite adding detailed qualitative evaluations.

Yet on KGQAGen-10k, the same rich feedback boosts accuracy by +10.3pp over binary-only memory, contradicting the intuition that more detailed supervision always helps code more than symbolic queries.

WHY IT MATTERS

What this unlocks for the field

APEX-EM shows that non-parametric online learning with structured procedural-episodic memory can rival or beat oracle retrieval systems using only past executions.

Builders can now deploy frozen LLM backbones that still learn new procedures over time, transferring skills across domains with zero lexical overlap via structural signatures like entity_resolution → temporal_filter → aggregation.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…