ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks

AuthorsSamuel Sameer Tanguturi

2026

TL;DR

ATANT v1.1 uses a 7-property continuity rubric and a coverage matrix to show 0.43 / 7 mean property coverage across existing memory benchmarks, despite ATANT v1.0 reaching 96% on its own scale.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Continuity claims built on non continuity benchmarks with mean 0.43 / 7 coverage

ATANT v1.1 shows that across seven existing evaluations the mean continuity property coverage is only 0.43 / 7, with a median of 1.0 / 7.

Systems are being called continuous based on benchmarks like LOCOMO and LongMemEval, so downstream agents inherit brittle memory behavior and undetected regressions in persistence, update handling, and reconstruction.

HOW IT WORKS

ATANT v1.1 — structural continuity positioning

ATANT v1.1 uses the 7 v1.0 continuity properties, the 10 checkpoints, the property coverage matrix, and the Kenotic v1.0 reference implementation to compare seven benchmarks cell by cell.

You can think of ATANT v1.1 as a compatibility matrix between continuity and existing tests, like a hardware spec sheet mapping which ports each device actually exposes.

This lets ATANT v1.1 separate continuity from long context or retrieval, enabling evaluations that detect regressions and architectural changes that a plain context window or substring scoring cannot see.

DIAGRAM

Continuity property taxonomy versus existing benchmarks

This diagram shows how ATANT v1.1 maps each continuity property against LOCOMO, LongMemEval, BEAM, MemoryBench, Zep eval, MemGPT Letta, and RULER.

DIAGRAM

Evaluation pipeline for positioning ATANT v1.1

This diagram shows how ATANT v1.1 evaluates benchmarks, computes coverage scores, and calibrates with LOCOMO and ATANT cumulative scale.

PROCESS

How ATANT v1.1 Handles continuity evaluation positioning

01
Structural Analysis of Existing Memory Evaluations
ATANT v1.1 inspects LOCOMO, LongMemEval, BEAM, MemoryBench, Zep eval, MemGPT Letta, and RULER using the 7 v1.0 continuity properties and 10 checkpoints.
02
Property Coverage Matrix
ATANT v1.1 builds a 7 by 7 property coverage matrix, assigning ✓, ⊚, or × per cell and computing scores like 1.0 / 7 and 0.43 / 7.
03
Kenotic on LOCOMO Result
ATANT v1.1 runs the Kenotic v1.0 reference implementation on LOCOMO, obtaining 8.8% substring accuracy on 476 items from 3 conversations.
04
Reporting Standard Recommendation
ATANT v1.1 recommends reporting ATANT compliance levels such as Cumulative scale 96% alongside retrieval and long context benchmarks for any continuity claim.

KEY CONTRIBUTIONS

Key Contributions

01
Structural critique of seven memory evaluations
ATANT v1.1 analyzes LOCOMO, LongMemEval, BEAM, MemoryBench, Zep eval, MemGPT Letta, and RULER using the 7 v1.0 continuity properties and 10 checkpoints, finding no benchmark above 2 / 7 coverage.
02
Property coverage matrix
ATANT v1.1 introduces a 7 by 7 property coverage matrix with ✓, ⊚, × scoring, yielding a median of 1.0 / 7 and mean 0.43 / 7 across existing evaluations.
03
Kenotic on LOCOMO calibration pair
ATANT v1.1 reports Kenotic v1.0 reference implementation scoring 8.8% on LOCOMO versus 96% on ATANT cumulative scale, an 87 point divergence demonstrating different measured properties.

RESULTS

By the Numbers

ATANT cumulative scale

96% (1,761/1,835)

+87.2 points over LOCOMO substring 8.8%

LOCOMO substring score

8.8%

vs ATANT cumulative scale 96%

Median coverage score

1.0/7

benchmark median across seven evaluations

Mean coverage score

0.43/7

average properties covered per benchmark

ATANT v1.1 evaluates the Kenotic v1.0 reference implementation on ATANT cumulative scale and LOCOMO, and computes property coverage scores across seven benchmarks, proving that continuity requires ATANT v1.0 style evaluation rather than retrieval or long context metrics.

BENCHMARK

By the Numbers

BENCHMARK

Property coverage of existing memory evaluations vs. the 7 v1.0 continuity properties

Score column from Table 1 (properties covered out of 7).

KEY INSIGHT

The Counterintuitive Finding

ATANT v1.1 shows the Kenotic v1.0 reference implementation scoring 96% on ATANT cumulative scale yet only 8.8% on LOCOMO substring accuracy.

This is counterintuitive because many practitioners assumed LOCOMO measured continuity, but the 87 point divergence reveals that LOCOMO and ATANT v1.0 target fundamentally different properties.

WHY IT MATTERS

What this unlocks for the field

ATANT v1.1 gives researchers a concrete way to see which continuity properties are actually measured when they cite LOCOMO, LongMemEval, BEAM, MemoryBench, Zep eval, MemGPT Letta, or RULER.

Builders can now separate continuity claims from retrieval or long context claims, designing architectures and benchmarks that explicitly target missing properties like update handling, reconstruction, and model independence.

~10 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…