TeleMem: Building Long-Term and Multimodal Memory for Agentic AI

AuthorsChunliang Chen, Ming Guan, Xiao Lin et al.

2025

TL;DR

TeleMem uses a threaded memory DAG with closure-based retrieval and a batched consolidation pipeline to reach 86.33% QA accuracy on ZH-4O (+16.13 points over RAG).

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-horizon agents forget and fragment context at 62.45% accuracy

TeleMem targets long-term interactive settings where flat RAG only reaches 62.45% QA Accuracy on ZH-4O, far below structured memory systems.

When dialogue histories grow to hundreds of turns, LLM agents lose causal structure, causing unstable long-horizon reasoning and degraded personalization.

HOW IT WORKS

TeleMem — Threaded Memory DAG with Closure-based Retrieval

TeleMem centers on a representation layer, graph layer, and memory reading mechanism that together maintain a structured, evolvable memory graph with Insert and ReInsert operators.

You can think of TeleMem like RAM plus a versioned filesystem, where recent dialogue is cached, then periodically consolidated into a threaded history of semantic snapshots.

By organizing memory as a minimal causal DAG and using closure-based retrieval, TeleMem restores prerequisite context that a plain context window or flat Top K retrieval would miss.

DIAGRAM

Closure-based Retrieval over the Memory DAG

This diagram shows how TeleMem performs closure-based retrieval by expanding from seed nodes to ancestors and linearizing them for LLM reasoning.

DIAGRAM

ReAct-style Multimodal Memory Reading Loop

This diagram shows how TeleMem's ReAct-style multimodal agent iteratively calls video tools and updates history during memory reading.

PROCESS

How TeleMem Handles a Long-Horizon Dialogue Session

  1. 01

    Offline Batch Updates

    TeleMem processes historical dialogue turns with summarization, retrieval alignment, global semantic clustering, and LLM-based consolidation to build high-quality nodes for the memory graph.

  2. 02

    Online Incremental Updates

    TeleMem applies summarization, retrieval alignment, and LLM-based decision per new turn, then uses Insert and ReInsert to maintain the DAG under temporal and pruning constraints.

  3. 03

    Memory Graph Updating

    TeleMem enforces the minimal causal skeleton by pruning redundant edges and maintaining memory threads as root to node paths ordered by effective timestamps.

  4. 04

    Memory Reading

    TeleMem retrieves minimal closed subgraphs via closure-based retrieval, linearizes them chronologically, and feeds them to a ReAct-style multimodal agent for observe think act reasoning.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Unified long-term and multimodal memory framework

    TeleMem uses the representation layer and memory reading mechanism to maintain coherent narrative-driven user profiles and multimodal event states across text and video streams.

  • 02

    Structured memory graph with closure-based retrieval

    TeleMem organizes memory as a threaded directed acyclic graph with Insert and ReInsert, enabling dependency-aware closure-based retrieval for stable long-horizon reasoning.

  • 03

    Efficient memory writing pipeline

    TeleMem's offline batch updates perform summarization, retrieval alignment, global clustering, and LLM-based consolidation, reducing token usage by 43% and achieving a 2.1× speedup over Mem0.

RESULTS

By the Numbers

QA Accuracy (%)

86.33%

+16.13 over RAG on ZH-4O

QA Accuracy (%)

70.20%

Mem0 baseline on ZH-4O

QA Accuracy (%)

84.92%

Long context LLM full history reference

QA Accuracy (%)

62.45%

Flat RAG without temporal or causal structure

On the ZH-4O benchmark, which probes memory recall in 28 long Chinese role playing sessions, TeleMem reaches 86.33% QA Accuracy, showing that threaded closure-based memory beats both Mem0 and long context baselines.

BENCHMARK

By the Numbers

On the ZH-4O benchmark, which probes memory recall in 28 long Chinese role playing sessions, TeleMem reaches 86.33% QA Accuracy, showing that threaded closure-based memory beats both Mem0 and long context baselines.

BENCHMARK

Performance comparison on the ZH-4O benchmark

QA Accuracy (%) across 1,068 probing questions on ZH-4O.

KEY INSIGHT

The Counterintuitive Finding

TeleMem reduces token usage by 43% and achieves a 2.1× speedup while still improving ZH-4O accuracy from 70.20% for Mem0 to 86.33%.

This is surprising because many developers assume better long-term memory requires feeding more tokens, but TeleMem shows smarter structuring and consolidation can yield both efficiency and accuracy.

WHY IT MATTERS

What this unlocks for the field

TeleMem unlocks dependency-aware long-term memory where agents can reconstruct coherent causal context instead of stitching isolated Top K fragments.

Builders can now deploy agentic systems that maintain evolving user profiles and multimodal memories over hundreds of turns without exploding context windows or sacrificing latency.

~13 min read← Back to papers

Related papers

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

Questions about this paper?

Paper: TeleMem: Building Long-Term and Multimodal Memory for Agentic AI

Answers use this explainer on Memory Papers.

Checking…