ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks

AuthorsSamuel Sameer Tanguturi

2026

TL;DR

ATANT v1.1 uses a 7-property continuity rubric and a coverage matrix to show 0.43 / 7 mean property coverage across existing memory benchmarks, despite ATANT v1.0 reaching 96% on its own scale.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Continuity claims built on non continuity benchmarks with mean 0.43 / 7 coverage

ATANT v1.1 shows that across seven existing evaluations the mean continuity property coverage is only 0.43 / 7, with a median of 1.0 / 7.

Systems are being called continuous based on benchmarks like LOCOMO and LongMemEval, so downstream agents inherit brittle memory behavior and undetected regressions in persistence, update handling, and reconstruction.

HOW IT WORKS

ATANT v1.1 — structural continuity positioning

ATANT v1.1 uses the 7 v1.0 continuity properties, the 10 checkpoints, the property coverage matrix, and the Kenotic v1.0 reference implementation to compare seven benchmarks cell by cell.

You can think of ATANT v1.1 as a compatibility matrix between continuity and existing tests, like a hardware spec sheet mapping which ports each device actually exposes.

This lets ATANT v1.1 separate continuity from long context or retrieval, enabling evaluations that detect regressions and architectural changes that a plain context window or substring scoring cannot see.

DIAGRAM

Continuity property taxonomy versus existing benchmarks

This diagram shows how ATANT v1.1 maps each continuity property against LOCOMO, LongMemEval, BEAM, MemoryBench, Zep eval, MemGPT Letta, and RULER.

DIAGRAM

Evaluation pipeline for positioning ATANT v1.1

This diagram shows how ATANT v1.1 evaluates benchmarks, computes coverage scores, and calibrates with LOCOMO and ATANT cumulative scale.

PROCESS

How ATANT v1.1 Handles continuity evaluation positioning

  1. 01

    Structural Analysis of Existing Memory Evaluations

    ATANT v1.1 inspects LOCOMO, LongMemEval, BEAM, MemoryBench, Zep eval, MemGPT Letta, and RULER using the 7 v1.0 continuity properties and 10 checkpoints.

  2. 02

    Property Coverage Matrix

    ATANT v1.1 builds a 7 by 7 property coverage matrix, assigning ✓, ⊚, or × per cell and computing scores like 1.0 / 7 and 0.43 / 7.

  3. 03

    Kenotic on LOCOMO Result

    ATANT v1.1 runs the Kenotic v1.0 reference implementation on LOCOMO, obtaining 8.8% substring accuracy on 476 items from 3 conversations.

  4. 04

    Reporting Standard Recommendation

    ATANT v1.1 recommends reporting ATANT compliance levels such as Cumulative scale 96% alongside retrieval and long context benchmarks for any continuity claim.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Structural critique of seven memory evaluations

    ATANT v1.1 analyzes LOCOMO, LongMemEval, BEAM, MemoryBench, Zep eval, MemGPT Letta, and RULER using the 7 v1.0 continuity properties and 10 checkpoints, finding no benchmark above 2 / 7 coverage.

  • 02

    Property coverage matrix

    ATANT v1.1 introduces a 7 by 7 property coverage matrix with ✓, ⊚, × scoring, yielding a median of 1.0 / 7 and mean 0.43 / 7 across existing evaluations.

  • 03

    Kenotic on LOCOMO calibration pair

    ATANT v1.1 reports Kenotic v1.0 reference implementation scoring 8.8% on LOCOMO versus 96% on ATANT cumulative scale, an 87 point divergence demonstrating different measured properties.

RESULTS

By the Numbers

ATANT cumulative scale

96% (1,761/1,835)

+87.2 points over LOCOMO substring 8.8%

LOCOMO substring score

8.8%

vs ATANT cumulative scale 96%

Median coverage score

1.0/7

benchmark median across seven evaluations

Mean coverage score

0.43/7

average properties covered per benchmark

ATANT v1.1 evaluates the Kenotic v1.0 reference implementation on ATANT cumulative scale and LOCOMO, and computes property coverage scores across seven benchmarks, proving that continuity requires ATANT v1.0 style evaluation rather than retrieval or long context metrics.

BENCHMARK

By the Numbers

ATANT v1.1 evaluates the Kenotic v1.0 reference implementation on ATANT cumulative scale and LOCOMO, and computes property coverage scores across seven benchmarks, proving that continuity requires ATANT v1.0 style evaluation rather than retrieval or long context metrics.

BENCHMARK

Property coverage of existing memory evaluations vs. the 7 v1.0 continuity properties

Score column from Table 1 (properties covered out of 7).

KEY INSIGHT

The Counterintuitive Finding

ATANT v1.1 shows the Kenotic v1.0 reference implementation scoring 96% on ATANT cumulative scale yet only 8.8% on LOCOMO substring accuracy.

This is counterintuitive because many practitioners assumed LOCOMO measured continuity, but the 87 point divergence reveals that LOCOMO and ATANT v1.0 target fundamentally different properties.

WHY IT MATTERS

What this unlocks for the field

ATANT v1.1 gives researchers a concrete way to see which continuity properties are actually measured when they cite LOCOMO, LongMemEval, BEAM, MemoryBench, Zep eval, MemGPT Letta, or RULER.

Builders can now separate continuity claims from retrieval or long context claims, designing architectures and benchmarks that explicitly target missing properties like update handling, reconstruction, and model independence.

~10 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Agent Memory

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang, Dongyu Ru et al.

· 2026

AMemGym combines Structured Data Generation, On-Policy Interaction, Evaluation Metrics, and Meta-Evaluation to script user state trajectories, drive LLM-simulated role-play, and score write–read–utilization behavior. On AMemGym’s base configuration, AWE-(2,4,30) reaches a 0.291 normalized memory score on interactive evaluation, while native gpt-4.1-mini only achieves 0.203, exposing substantial gaps between memory agents and plain long-context LLMs.

Agent Memory

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Emmanuel Bamidele

· 2026

AMV-L manages agent memory using a Memory Value Model, Tiered Lifecycle, Bounded Retrieval Path, and Lifecycle Manager to decouple retention from retrieval eligibility. Under a 70k-request long-running workload, AMV-L improves throughput from 9.027 to 36.977 req/s over TTL and reduces p99 latency from 5398.167 ms to 1233.430 ms while matching LRU’s retrieval quality.

Questions about this paper?

Paper: ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks

Answers use this explainer on Memory Papers.

Checking…