APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

AuthorsPratyay Banerjee, Masud Moshtaghi, Ankit Chadha

2026

TL;DR

APEX-EM uses structured Procedural Knowledge Graph experience replay to push KGQAGen-10k accuracy to 89.6% (95.3% CSR), +48.3pp over a 41.3% no-memory baseline.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents stay stateless despite repeated, similar tasks (41.3% vs 89.6% on KGQAGen-10k)

APEX-EM highlights that a Claude Sonnet 4.5 agent without memory reaches only 41.3% accuracy on KGQAGen-10k, despite repeated exposure to structurally similar questions.

This means LLM-based autonomous agents repeatedly re-derive solutions from scratch, wasting prior work and leaving large performance gains on the table for code, queries, and reasoning.

HOW IT WORKS

APEX-EM — Procedural Knowledge Graphs plus PRGII experience replay

APEX-EM centers on a Procedural Knowledge Graph, Experience Memory store, PRGII workflow, Task Verifiers, and StructuralSignatureExtractor that encode full procedural-episodic traces as reusable experiences.

You can think of APEX-EM like a hybrid between an RL experience replay buffer and a card catalog, where each card is a structured plan with rich metadata.

This design lets APEX-EM replay and adapt entire procedures using structural signatures and dual-outcome indexing, instead of relying on a flat context window of unstructured reflections.

DIAGRAM

PRGII workflow: Plan–Retrieve–Generate–Iterate–Ingest loop

This diagram shows how APEX-EM runs the PRGII workflow to solve a task and commit a structured experience back into memory.

DIAGRAM

Evaluation setup across BigCodeBench, KGQAGen-10k, and HLE

This diagram shows how APEX-EM is evaluated on three benchmarks with frozen backbones and shared baselines like MemRL and RAG.

PROCESS

How APEX-EM Handles a Task via the PRGII Workflow

  1. 01

    Plan Phase

    APEX-EM parses the task into Task Understanding, runs Entity Discovery and Schema Discovery, and uses the StructuralSignatureExtractor to hypothesize an abstract operation sequence.

  2. 02

    Retrieve Phase

    APEX-EM queries the Experience Memory store using semantic search, structural signature matching, and PKG traversal to collect successful and failed experiences.

  3. 03

    Generate Phase

    Conditioned on Goal Reflections, Procedure Reflections, and negative examples from the Error Registry, APEX-EM generates an executable artifact such as code or a SPARQL query.

  4. 04

    Iterate and Ingest Phases

    Task Verifiers validate each artifact, drive refinement across iterations, then the Teacher and quality gate decide whether to store the run as a successful or failed Experience in the Procedural Knowledge Graph.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Procedural Knowledge Graph

    APEX-EM introduces a Procedural Knowledge Graph that stores Experiences, Entities, Sub-Tasks, Operations, and TaskTopic nodes with structural signatures like [entity_resolution → temporal_filter → aggregation].

  • 02

    PRGII workflow with Task Verifiers

    APEX-EM defines the Plan-Retrieve-Generate-Iterate-Ingest workflow where Task Verifiers and a Teacher provide multi-dimensional scores c, η, κ and an overall quality q with threshold θ.

  • 03

    Dual-outcome Experience Memory

    APEX-EM builds a dual-outcome Experience Memory that treats successful experiences as positive in-context examples and failed ones as negative examples with structured Error Registry and Patch Reflections.

RESULTS

By the Numbers

KGQAGen-10k LASM Accuracy

89.6%

+48.3pp over No Memory (41.3%)

KGQAGen-10k CSR

95.3%

vs No Memory baseline with no CSR reported

BigCodeBench Last SR

83.3%

+29.4pp over No Memory Sonnet 4.5 (53.9%)

HLE Last SR (500q)

48.0%

+22.8pp over No Memory Opus 4.5 (25.2%)

On KGQAGen-10k, a structured query benchmark, APEX-EM reaches 89.6% accuracy and 95.3% CSR, beating the 41.3% no-memory baseline and the GPT-4o w/ SP oracle at 84.9%. On BigCodeBench and HLE, APEX-EM delivers +29.4pp and +22.8pp gains in success rate, showing that structured procedural replay scales across code and multi-domain reasoning.

BENCHMARK

By the Numbers

On KGQAGen-10k, a structured query benchmark, APEX-EM reaches 89.6% accuracy and 95.3% CSR, beating the 41.3% no-memory baseline and the GPT-4o w/ SP oracle at 84.9%. On BigCodeBench and HLE, APEX-EM delivers +29.4pp and +22.8pp gains in success rate, showing that structured procedural replay scales across code and multi-domain reasoning.

BENCHMARK

KGQAGen-10k: APEX-EM vs LLM and KG-RAG baselines

LASM Accuracy on KGQAGen-10k test split and training sample for APEX-EM and key baselines.

BENCHMARK

BigCodeBench: APEX-EM vs MemRL and no-memory baselines

Last Epoch Success Rate on BigCodeBench train split for APEX-EM and MemRL baselines.

KEY INSIGHT

The Counterintuitive Finding

On BigCodeBench, APEX-EM’s rich judge feedback barely helps over binary success signals, with A1≈A2 despite adding detailed qualitative evaluations.

Yet on KGQAGen-10k, the same rich feedback boosts accuracy by +10.3pp over binary-only memory, contradicting the intuition that more detailed supervision always helps code more than symbolic queries.

WHY IT MATTERS

What this unlocks for the field

APEX-EM shows that non-parametric online learning with structured procedural-episodic memory can rival or beat oracle retrieval systems using only past executions.

Builders can now deploy frozen LLM backbones that still learn new procedures over time, transferring skills across domains with zero lexical overlap via structural signatures like entity_resolution → temporal_filter → aggregation.

~14 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Agent Memory

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang, Dongyu Ru et al.

· 2026

AMemGym combines Structured Data Generation, On-Policy Interaction, Evaluation Metrics, and Meta-Evaluation to script user state trajectories, drive LLM-simulated role-play, and score write–read–utilization behavior. On AMemGym’s base configuration, AWE-(2,4,30) reaches a 0.291 normalized memory score on interactive evaluation, while native gpt-4.1-mini only achieves 0.203, exposing substantial gaps between memory agents and plain long-context LLMs.

Agent Memory

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Emmanuel Bamidele

· 2026

AMV-L manages agent memory using a Memory Value Model, Tiered Lifecycle, Bounded Retrieval Path, and Lifecycle Manager to decouple retention from retrieval eligibility. Under a 70k-request long-running workload, AMV-L improves throughput from 9.027 to 36.977 req/s over TTL and reduces p99 latency from 5398.167 ms to 1233.430 ms while matching LRU’s retrieval quality.

Questions about this paper?

Paper: APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

Answers use this explainer on Memory Papers.

Checking…