RAG Benchmark Benchmark Memory Architecture

Memory-Augmented Log Analysis with Phi-4-mini: Enhancing Threat Detection in Structured Security Logs

AuthorsAnbi Guo, Mahfuza Farooque

2025

arXiv PDF

TL;DR

DM-RAG uses dual memories plus Bayesian fusion to push recall to 98.70% on UNSW-NB15 with Phi-4-mini.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLMs Miss Multistage Threats in Long Logs (DM-RAG needs 98.70% recall)

Structured security logs for APT detection span long periods, but DM-RAG’s target domain exceeds typical LLM context windows and domain priors, causing missed attacks.

When DM-RAG is not used, Phi-4-mini-based systems struggle on UNSW-NB15, leading to low recall and undetected multistage attacks, harming intrusion detection and threat response.

HOW IT WORKS

DM-RAG — Dual-Memory Retrieval-Augmented Generation

DM-RAG’s core mechanism combines Short-Term Memory (STM), Long-Term Memory (LTM), logistic regression confidence model, and Bayesian fusion around an instruction-tuned Phi-4-mini.

You can think of STM as RAM and LTM as disk, with FAISS acting like an indexed card catalog for historical attack patterns.

This dual-memory design lets DM-RAG reason over days of logs, combining recent and historical context in ways a plain context window cannot.

DIAGRAM

Online Log Analysis and Memory Update Flow

This diagram shows how DM-RAG processes each UNSW-NB15 log entry, updates STM and LTM, and applies Bayesian fusion when STM is full.

DIAGRAM

UNSW-NB15 Evaluation Pipeline for DM-RAG

This diagram shows how DM-RAG is trained and evaluated on UNSW-NB15 with logistic regression, instruction tuning, and test-time dual-memory prompting.

PROCESS

How DM-RAG Handles Online Log Analysis with Memory-Augmented RAG and Bayesian Fusion

01
Step 1: Confidence Model Preparation
DM-RAG trains the logistic regression confidence model on normalized UNSW-NB15 features, defining score(x) as P(y = 1 | x) for anomaly confidence.
02
Step 2: Online Log Analysis
DM-RAG encodes each log with the Encoder E, retrieves from Long-Term Memory (LTM) via FAISS, merges STM and LTM into a prompt, and queries Phi-4-mini.
03
Step 3: Memory Generation
DM-RAG parses summary, confidence, and label, then appends them into Short-Term Memory (STM) as a sliding window of K = 10 recent summaries.
04
Step 4: Memory Compression and Promotion
When STM is full, DM-RAG compresses summaries with Phi-4-mini, applies Bayesian fusion to compute conffused, and promotes high-confidence summaries into Long-Term Memory (LTM).

KEY CONTRIBUTIONS

Key Contributions

01
Dual-Memory Retrieval-Augmented Generation (DM-RAG)
DM-RAG introduces interacting Short-Term Memory (STM) and Long-Term Memory (LTM) modules around Phi-4-mini, enabling high-recall log anomaly detection with 98.70% recall and 69.59% F1 on UNSW-NB15.
02
Bayesian Fusion for Memory Promotion
DM-RAG uses Bayesian fusion over Beta-modeled features to compute fused anomaly confidence, deciding which STM summaries are compressed and promoted into LTM.
03
Instruction-Tuned Phi-4-mini for Structured Logs
DM-RAG instruction-tunes Phi-4-mini with a strict JSON prompt design, combining STM and LTM context to produce interpretable summaries and attack labels for structured security logs.

RESULTS

By the Numbers

Accuracy

53.64%

-3.60 over Phi-4 + RAG (MITRE)

Precision

53.74%

+8.82 over LoRA Fine-tuned

Recall

98.70%

+57.13 over Phi-4 + RAG (MITRE)

F1 Score

69.59%

+17.89 over Phi-4 + RAG (MITRE)

On the UNSW-NB15 intrusion detection benchmark, DM-RAG is evaluated on accuracy, precision, recall, and F1. The 98.70% recall and 69.59% F1 show that DM-RAG captures almost all attacks while maintaining balanced precision compared to Phi-4-mini baselines.

BENCHMARK

By the Numbers

BENCHMARK

Performance Comparison on UNSW-NB15 Test Set

F1 Score on UNSW-NB15 for DM-RAG and Phi-4-mini baselines.

KEY INSIGHT

The Counterintuitive Finding

DM-RAG achieves 98.70% recall with only 53.74% precision, while zero-shot Phi-4-mini reaches 98.91% precision but just 0.20% recall.

This is surprising because high precision is usually desirable, yet DM-RAG shows that for security logs, aggressively maximizing recall is more valuable than ultra-conservative precision.

WHY IT MATTERS

What this unlocks for the field

DM-RAG unlocks high-recall, interpretable, long-horizon reasoning over structured security logs using a compact Phi-4-mini backbone and dual-memory RAG.

Builders can now deploy lightweight, memory-augmented LLM detectors that track multistage attacks over time without massive external corpora or heavy fine-tuning.

~11 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…