MemFactory: Unified Inference & Training Framework for Agent Memory

AuthorsZiliang Guo, Ziheng Li, Bo Tang et al.

2026

TL;DR

MemFactory uses a modular Module–Agent–Environment–Trainer stack with built in GRPO to improve MemAgent style recurrent memory by up to 14.8% average score.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Memory RL research is fragmented and hard to reproduce

MemFactory targets fragmented Memory RL implementations that are highly customized, task specific, and scattered across isolated repositories, making reproduction and extension difficult.

This fragmentation means Memory R1, MemAgent, and RMM style systems cannot easily swap modules, slowing progress on long term memory, retrieval, and policy optimization research.

HOW IT WORKS

MemFactory framework — modular layers plus GRPO

MemFactory’s core mechanism is a four layer stack: Module Layer, Agent Layer, Environment Layer, and Trainer Layer, with RecurrentMemoryModule for end to end memory.

You can think of MemFactory like Lego blocks plus an RL engine: modules are bricks, the Agent Layer is the assembly, and GRPO is the tuning knob.

This layered design lets MemFactory learn extraction, updating, retrieval, and recurrent memory policies that a plain context window or static RAG pipeline cannot express.

DIAGRAM

Agent rollout and environment interaction flow

This diagram shows how MemFactory agents roll out trajectories, interact with the Environment Layer, and receive GRPO rewards during training.

DIAGRAM

MemFactory training and evaluation pipeline

This diagram shows how MemFactory loads MemAgent data, trains with GRPO, and evaluates on eval_50, eval_100, and eval_fwe_16384.

PROCESS

How MemFactory Handles a Memory augmented training session

01
Module Layer
MemFactory configures Extractor, Updater, Retriever, and RecurrentMemoryModule implementations, exposing generate, rollout, and inference interfaces for memory operations.
02
Agent Layer
MemFactory assembles modules into an agent that executes policies, performs rollouts, and loads pre trained Qwen3 checkpoints with FlashAttention 2.
03
Environment Layer
MemFactory converts MemAgent datasets into standardized states, maintains MemoryBankEnv or LongcontextEnv, and computes Format and LLM as a Judge rewards.
04
Trainer Layer
MemFactory runs GRPO, sampling grouped trajectories, computing advantages without a critic, and updating the agent policy while logging metrics via SwanLab.

KEY CONTRIBUTIONS

Key Contributions

01
Unified Memory RL Infrastructure
MemFactory standardizes MemoryBankEnv, LongcontextEnv, and RecurrentMemoryModule within a four layer stack, unifying training, evaluation, and inference for memory augmented agents.
02
Highly Modular and Extensible Design
MemFactory exposes Extractor, Updater, Retriever, and Agent Module classes with generate, rollout, and inference interfaces, enabling Lego like assembly of Memory R1, MemAgent, and RMM style agents.
03
Empirical Validation on MemAgent
MemFactory trains MemAgent style agents on Qwen3 1.7B and Qwen3 4B Instruct, achieving up to 14.8 percent relative average improvement over base checkpoints.

RESULTS

By the Numbers

eval_50

0.5684 score

+0.0957 over Qwen3-1.7B Base checkpoint

eval_100

0.4863 score

+0.0566 over Qwen3-1.7B Base checkpoint

eval_fwe_16384

0.6426 score

+0.0156 over Qwen3-4B-Instruct Base checkpoint

Average

0.3581 score

+0.0463 over Qwen3-1.7B Base checkpoint

On MemAgent eval_50, eval_100, and eval_fwe_16384, MemFactory improves both Qwen3 1.7B and Qwen3 4B Instruct, proving MemFactory effectively optimizes recurrent memory policies via GRPO.

BENCHMARK

By the Numbers

On MemAgent eval_50, eval_100, and eval_fwe_16384, MemFactory improves both Qwen3 1.7B and Qwen3 4B Instruct, proving MemFactory effectively optimizes recurrent memory policies via GRPO.

BENCHMARK

Performance of MemoryAgent trained via MemFactory on three test sets

Average score (avg@4) on MemAgent eval_50, eval_100, and eval_fwe_16384 for Qwen3 base checkpoints versus MemFactory RL.

KEY INSIGHT

The Counterintuitive Finding

MemFactory slightly reduces Qwen3 1.7B performance on the OOD eval_fwe_16384 set from 0.0332 to 0.0195 despite large main task gains.

This is surprising because RL tuned recurrent memory policies might be expected to generalize better, yet MemFactory shows that stronger in distribution optimization can hurt OOD robustness.

WHY IT MATTERS

What this unlocks for the field

MemFactory unlocks a reusable GRPO based stack where Memory R1, MemAgent, and RMM style modules can be mixed, matched, and trained without bespoke pipelines.

Builders can now prototype new memory extractors, updaters, retrievers, or recurrent modules and benchmark them end to end on MemAgent style datasets using a single GPU friendly framework.

~11 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

arXiv:2603.00026 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…