Zep: A Temporal Knowledge Graph Architecture for Agent Memory

AuthorsPreston Rasmussen, Pavlo Paliychuk, Travis Beauvais et al.

2025

TL;DR

Zep uses the Graphiti temporal knowledge graph with episodic, semantic, and community memory to reach 71.2% on LongMemEval vs 60.2% full-context while cutting latency by ~90%.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents fail on long conversations despite 115k token contexts

LongMemEval shows LLM performance drops even when full 115k token conversations are stuffed into the context window.

This hurts enterprise chat assistants that must track cross-session memory and temporal reasoning, where naive full-context RAG becomes slow, costly, and unreliable.

HOW IT WORKS

Zep and Graphiti — temporal graph memory with episodes, entities, and communities

Zep centers on Episode Subgraph, Semantic Entity Subgraph, Community Subgraph, and a Search–Reranker–Constructor pipeline inside the Graphiti temporal knowledge graph.

You can think of Zep like a brain with episodic memory for events, semantic memory for facts, and a card catalog that clusters related concepts into communities.

This design lets Zep maintain bi-temporal edges, invalidate outdated facts, and retrieve compact, time-aware context that plain context windows and static RAG cannot express.

DIAGRAM

Temporal Memory Retrieval Flow in Zep

This diagram shows how Zep processes a query through search, reranking, and construction over Graphiti's temporal knowledge graph.

DIAGRAM

Evaluation Pipeline for Zep on DMR and LongMemEval

This diagram shows how Zep ingests datasets, builds Graphiti, and runs memory benchmarks against full-context baselines.

PROCESS

How Zep Handles a Memory Retrieval Query

  1. 01

    Knowledge Graph Construction

    Zep ingests Episodes into the Episode Subgraph, extracts entities and facts into the Semantic Entity Subgraph, and clusters them into the Community Subgraph using Graphiti.

  2. 02

    Search

    Given a query, Zep applies cosine semantic similarity, BM25 full text, and breadth first search over Graphiti to collect candidate edges, entities, and communities.

  3. 03

    Reranker

    Zep applies Reciprocal Rank Fusion, Maximal Marginal Relevance, and cross encoder rerankers to prioritize candidates and emphasize frequently mentioned or nearby nodes.

  4. 04

    Constructor

    Zep's constructor formats selected facts with tvalid and tinvalid plus entity and community summaries into a compact context string for the LLM agent.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Temporally aware Graphiti knowledge graph

    Zep introduces Graphiti with Episode Subgraph, Semantic Entity Subgraph, and Community Subgraph, tracking tvalid, tinvalid, t′created, and t′expired for each fact edge.

  • 02

    State of the art memory benchmarks

    Zep reaches 94.8% on Deep Memory Retrieval with gpt-4-turbo and 98.2% with gpt-4o-mini, slightly exceeding MemGPT and full conversation baselines.

  • 03

    LongMemEval accuracy and latency gains

    On LongMemEval, Zep with gpt-4o improves accuracy from 60.2% to 71.2% and cuts average latency from 28.9 s to 2.58 s using ~1.6k tokens.

RESULTS

By the Numbers

DMR Score gpt 4 turbo

94.8%

+1.4 over MemGPT

DMR Score gpt 4o mini

98.2%

+0.2 over Full conversation

LongMemEval Score gpt 4o

71.2%

+11.0 over Full context

LongMemEval Latency gpt 4o

2.58 s

vs 28.9 s full context

Deep Memory Retrieval tests multi session chat recall, while LongMemEval stresses long term interactive memory with ~115k token conversations. Zep shows that temporal graph memory can beat full context baselines in both accuracy and latency.

BENCHMARK

By the Numbers

Deep Memory Retrieval tests multi session chat recall, while LongMemEval stresses long term interactive memory with ~115k token conversations. Zep shows that temporal graph memory can beat full context baselines in both accuracy and latency.

BENCHMARK

Deep Memory Retrieval and LongMemEval Performance

Accuracy on Deep Memory Retrieval and LongMemEval comparing Zep to key baselines.

KEY INSIGHT

The Counterintuitive Finding

On LongMemEval, Zep with gpt-4o uses only about 1.6k context tokens yet scores 71.2% vs 60.2% for 115k token full context.

This is surprising because many assume feeding the entire conversation to an LLM is always best, but Zep shows structured temporal memory can be both shorter and more accurate.

WHY IT MATTERS

What this unlocks for the field

Zep unlocks scalable, temporally aware memory where agents can reason over years of conversations and updates without ballooning context windows.

Builders can now deploy chat systems that maintain cross session preferences and temporal facts with low latency, instead of relying on brittle summaries or expensive full history prompts.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Answers use this explainer on Memory Papers.

Checking…