Understanding the Impact of Long-Term Memory on Self-Disclosure with Large Language Model-Driven Chatbots for Public Health Intervention

AuthorsEunkyung Jo, Yuin Jeong, SoHyun Park et al.

2024

TL;DR

CareCall uses a long-term memory layer over HyperCLOVA to recall users’ health histories, increasing detailed health disclosure and positive reactions across 1,252 real check-up calls.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM health chatbots forget users’ histories and flatten self-disclosure

CareCall’s deployment showed that public health monitoring needs recurrent data collection, yet LLM chatbots “rarely preserve the knowledge gained about individuals across repeated interactions.”

Without long-term memory, CareCall could not follow up on prior health issues, leading to repetitive generic questions, reduced engagement, and shallow self-disclosure over time.

HOW IT WORKS

CareCall with Long-Term Memory for public health check-up calls

CareCall adds a memory management layer, LLM summarizer, and memory-augmented input around HyperCLOVA, storing per-session summaries for Health, Meals, Sleep, Visited Places, and Pets.

You can think of CareCall like a person keeping a health diary: each call is compressed into notes, and future conversations start by rereading those notes instead of relying on short-term memory.

This long-term memory lets CareCall ask targeted follow-up questions such as “How is your leg pain now?” which a plain context-window-only HyperCLOVA session cannot do across weekly calls.

DIAGRAM

Turn-by-turn interaction with CareCall and its long-term memory

This diagram shows how a user, CareCall, and the long-term memory layer interact over two weekly check-up calls.

DIAGRAM

CareCall evaluation pipeline across cities and LTM conditions

This diagram shows how the authors sampled municipalities, grouped users into LTM and no-LTM conditions, and analyzed 1,252 call logs.

PROCESS

How CareCall Handles a Weekly Check-Up Call Session

  1. 01

    Current call session

    CareCall receives user speech, performs transcription, and builds the current dialogue history for the weekly check-up call.

  2. 02

    Memory augmented input

    CareCall’s memory management layer retrieves stored summaries for Health, Meals, Sleep, Visited Places, and Pets and injects them into the LLM input.

  3. 03

    New AI message

    HyperCLOVA generates responses using the memory augmented input, sometimes producing LTM triggered follow-up questions about prior health issues.

  4. 04

    Summarizer

    After the call, an LLM summarizer produces topic wise summaries, and the memory management layer updates or removes items in the long term memory store.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Empirical understanding of LTM impact in CareCall

    CareCall provides an empirical comparison of 576 LTMyes and 676 LTMno calls, showing higher Health-detail and Clinical-detail disclosure when long-term memory is enabled.

  • 02

    Design of LTM topics and memory pipeline

    CareCall specifies five LTM topics and implements a memory management layer plus LLM summarizer that dynamically stores, updates, and injects summaries into HyperCLOVA.

  • 03

    Design implications for public health chatbots

    CareCall surfaces tensions between health monitoring utility and privacy, and recommends careful topic selection and phrasing of LTM triggered questions for public health interventions.

RESULTS

By the Numbers

Health detail code count per call

0.31–0.76 higher

+0.31 to +0.76 over CareCall without LTM

Clinical detail code count per call

0.18–0.63 higher

+0.18 to +0.63 over CareCall without LTM

Meals detail code count per call

0.25–0.60 higher

+0.25 to +0.60 over CareCall without LTM

Average call duration

87.89 s vs 75.48 s

+12.41 s over CareCall without LTM

On 1,252 calls from 147 users across two Korean cities, these metrics show that CareCall with long-term memory elicits richer health-related disclosure and sustains longer engagement than CareCall without memory.

BENCHMARK

By the Numbers

On 1,252 calls from 147 users across two Korean cities, these metrics show that CareCall with long-term memory elicits richer health-related disclosure and sustains longer engagement than CareCall without memory.

BENCHMARK

Code counts per call: CareCall with vs without Long-Term Memory

Average increase ranges in detailed health related code counts per call when CareCall uses long-term memory.

KEY INSIGHT

The Counterintuitive Finding

Despite Sleep being an LTM topic, CareCall users with long-term memory showed lower Sleep-simple counts, with 0.13–0.44 fewer codes per call than users without memory.

This is surprising because one might expect remembering sleep issues to increase sleep talk, but CareCall’s design shifted disclosure toward detailed Health and Clinical categories instead.

WHY IT MATTERS

What this unlocks for the field

CareCall shows that a lightweight long-term memory layer over an LLM can steer real-world health conversations toward richer, clinically relevant self-disclosure over months.

Builders of public health and wellbeing agents can now design memory-augmented chatbots that feel more familiar and caring, while explicitly tuning which topics are remembered to balance monitoring goals and privacy.

~13 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

Questions about this paper?

Paper: Understanding the Impact of Long-Term Memory on Self-Disclosure with Large Language Model-Driven Chatbots for Public Health Intervention

Answers use this explainer on Memory Papers.

Checking…