LLM-based Medical Assistant Personalization with Short- and Long-Term Memory Coordination

AuthorsKai Zhang, Yangyang Kang, Fubang Zhao, Xiaozhong Liu

2023

TL;DR

MaLP uses a Dual-Process enhanced Memory plus PEFT to personalize medical assistants, reaching 91.53% win rate on response generation with LLaMA-7B (+13.12 over LoRA).

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Dictionary memories and prompt-only personalization fall short for medical assistants

Existing memory-based assistants rely on rigid dictionary-based memory, which MaLP shows is inflexible and heavily dependent on retriever quality.

Without retraining, these systems barely provide personalized and engaging experiences, so patients with different dialogue preferences receive mismatched advice and degraded user satisfaction.

HOW IT WORKS

Dual-Process enhanced Memory and MaLP frame

MaLP centers on Dual-Process enhanced Memory (DPeM) coordinating Working Memory, Short-Term Memory (STM), Long-Term Memory (LTM), Coordinator C, and Retriever R around a PEFT-tuned LLM.

DPeM mirrors human cognition: Working Memory acts like RAM, STM like a scratchpad, and LTM like disk, with rehearsal and executive processes moving knowledge between them.

This dual-process structure lets MaLP store user-specific and common-sense knowledge beyond a fixed context window, enabling retrieval-aware personalization that static prompts and single-layer memories cannot match.

DIAGRAM

Dual-process rehearsal and executive flow in DPeM

This diagram shows how MaLP’s DPeM runs Learning, Summarizing, and Memorizing via Rehearsal and Executive processes over dialogue iterations.

DIAGRAM

MaLP training and evaluation pipeline

This diagram shows how MaLP injects medical knowledge, builds DPeM memory, applies LoRA, and evaluates on QA, preference classification, and response generation.

PROCESS

How MaLP Handles a Medical Dialogue Session

  1. 01

    Medical Knowledge Adaptation

    MaLP first trains a domain adapter on HealthCareMagic-100k and iCliniq, constraining changes with knowledge loss and sample loss to preserve the base LLM’s capabilities.

  2. 02

    DPeM Mechanism

    MaLP applies the Dual-Process enhanced Memory, where Coordinator C writes notes into Working Memory, filters them into STM, and promotes frequent items into LTM.

  3. 03

    Memory Generation

    Using user dialogue D, MaLP forms Mworking, MST M, and MLT M tables, storing user-specific and common-sense knowledge with flag tables tracking access frequency.

  4. 04

    Memory Utilization

    At query time, Retriever R uses Rc on STM and Rs on LTM to build prompt p, then feeds x and p into the LoRA-tuned LLM ˆΦ to generate personalized response y.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Dual-Process enhanced Memory mechanism

    MaLP introduces DPeM with Working Memory, STM, LTM, Coordinator C, and Retriever R, yielding about 7% relative improvement over dictionary-based memory on key metrics.

  • 02

    MaLP unified frame with PEFT

    MaLP combines DPeM with LoRA-based PEFT, letting the LLM internalize user preferences while DPeM supplies structured user-specific and common-sense memory.

  • 03

    Personalized medical dialogue dataset

    MaLP constructs a self-chat medical dataset with user profiles and preferences, achieving an average quality score of 5.27 and a 94% safety ratio on 100 samples.

RESULTS

By the Numbers

ROUGE-L Profile QA

33.91

+4.25 over LLaMA-7B w LoRA

ROUGE-L Knowledge QA

36.37

+2.77 over LLaMA-7B w LoRA

Preference Classification Accuracy %

69.95

+8.90 over LLaMA-7B w LoRA

Response Generation Win Rate %

91.53

+19.52 over LLaMA-7B w LoRA

On the MaLP medical dialogue benchmark with LLaMA-7B, MaLP is evaluated on Profile QA, Knowledge QA, Preference Classification, and Response Generation. The gains over the LoRA baseline show that combining DPeM with PEFT substantially improves both factual correctness and alignment with user preferences.

BENCHMARK

By the Numbers

On the MaLP medical dialogue benchmark with LLaMA-7B, MaLP is evaluated on Profile QA, Knowledge QA, Preference Classification, and Response Generation. The gains over the LoRA baseline show that combining DPeM with PEFT substantially improves both factual correctness and alignment with user preferences.

BENCHMARK

Main results on LLaMA-7B Response Generation Win Rate

Win Rate % for response generation with different configurations of LLaMA-7B on the MaLP medical dialogue benchmark.

BENCHMARK

Ablation study on LLaMA-7B Preference Classification

Preference Classification Accuracy % for LLaMA-7B with different MaLP modules enabled.

KEY INSIGHT

The Counterintuitive Finding

MaLP with LoRA alone reaches 61.05% preference classification accuracy, but still trails MaLP’s full 69.95% despite strong user preference modeling.

This is surprising because many assume PEFT is sufficient for personalization, yet MaLP shows that structured DPeM memory still adds an extra 8.90 percentage points.

WHY IT MATTERS

What this unlocks for the field

MaLP unlocks a way to coordinate short- and long-term personalization, combining DPeM memory with PEFT so assistants remember both preferences and stable medical facts.

Builders can now create medical agents that adapt to each user over many sessions without retraining full LLMs, using MaLP’s memory tables and LoRA layers as lightweight personalization handles.

~14 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

Questions about this paper?

Paper: LLM-based Medical Assistant Personalization with Short- and Long-Term Memory Coordination

Answers use this explainer on Memory Papers.

Checking…