Ayush Gupta - AI Agent Engineer

Every AI agent needs memory. Without it, each conversation starts from zero -- no learned preferences, no context from yesterday, no understanding of what failed last time. The problem is deceptively hard: it is not just storage and retrieval. It is deciding what to keep, what to forget, how to handle contradictions, and how to reason across memories that span weeks or months.

2025-2026 saw an explosion of approaches. Research labs published taxonomies and benchmarks. Startups shipped products. Open-source projects emerged. But the landscape is fragmented, marketing claims are generous, and it is genuinely hard to tell what works from what sounds good in a blog post.

This post cuts through that. Every major system, how each actually works under the hood, benchmark numbers where they exist, and honest assessment of what is open-source versus what is a cloud product with an open-source wrapper.

01 Three Memory Strategies

Despite the variety of products, every agent memory system uses one of three fundamental strategies -- or a combination.

Vector Embeddings

Facts become vectors. Retrieval is "find what looks similar."

+ Fast, simple, no schema

- No relationships, no time, no multi-hop

Knowledge Graphs

Entities as nodes, relationships as edges. Multi-hop traversal.

+ Causal reasoning, auditable, temporal

- Schema overhead, LLM extraction quality

LLM-Controlled

The model itself decides what to store, retrieve, and forget.

+ Flexible, adaptive per task

- Unpredictable, not auditable

The 2026 Consensus

The best-performing systems are hybrids: vectors for breadth (find roughly relevant memories fast), graphs for depth (traverse relationships from those entry points). Pure vector and pure LLM-controlled approaches score significantly lower on benchmarks.

02 The Systems, Compared

LoCoMo Benchmark Scores

Long Conversation Memory -- factual recall, preferences, temporal events, relationships

LoCoMo (Long Conversation Memory) is the primary benchmark. It tests factual recall, preferences, temporal events, and relationships across long multi-session conversations. Higher is better.

System	Strategy	LoCoMo	Open Source	Launched
SuperLocalMemory V3	4-channel hybrid fusion	87.7%	Full OSS	2025
Zep (Graphiti)	Temporal knowledge graph + vector	85.0%	Partial	2024
Letta (MemGPT)	LLM-controlled tiered memory	83.2%	Full OSS	2023
Supermemory	Vector + light graph edges	~70%	Partial	2024
Mem0	Vector similarity (+ optional graph)	~60%	Partial	2024

03 How Each Engine Actually Works

Mem0 -- The Simple Pipeline

Mem0 architecture: single-pass extraction, 3-signal retrieval, optional graph layer

Extraction: One LLM call reads the conversation and extracts facts. Treats agent-generated facts as first-class alongside user statements. Storage: Each fact becomes a vector embedding. Entities indexed separately. Retrieval: Semantic similarity + BM25 + entity matching, scores fused.

Why it scores lower

Single-pass extraction misses nuance. No temporal reasoning -- cannot tell if a fact was superseded. "I live in Mumbai" and "I moved to Delhi" coexist with no invalidation. No multi-hop across connected facts.

Zep (Graphiti) -- The Temporal Knowledge Graph

Zep/Graphiti: multi-stage extraction, 3-tier temporal graph, hybrid retrieval with reranking

The dual-timestamp model is Zep's defining feature. Every fact carries four timestamps: t_valid / t_invalid for when the fact was actually true, and t'_created / t'_expired for when the system recorded it. When a new fact contradicts an existing one, the old fact gets formally invalidated. New information always wins.

Why it scores 85%

The temporal model handles "what changed" and "when did X happen." The episode layer preserves everything as ground truth. Graph traversal enables multi-hop reasoning. Formal contradiction resolution means stale facts do not pollute results.

Letta (MemGPT) -- The OS Approach

Letta/MemGPT: the LLM controls its own memory -- paging between "RAM" and "disk"

No fixed extraction pipeline. The model decides what to remember through tool calls. Flexible and adaptive, but you cannot audit why it chose to remember X and forget Y. Scores 83.2% -- solid, but below systems with explicit graph structure.

04 The Open-Source Reality

"Open source" in the memory space is complicated. Most companies open-source the client SDK while keeping the core engine proprietary.

Open-Source Spectrum

What is actually open vs. what requires their cloud

The honest summary: Letta and SuperLocalMemory are genuinely open-source. Mem0 and Zep open-source the interface but not the brain. If you need self-hosting with full control, Letta is the production-ready option.

05 Engine Internals Side-by-Side

Capability Comparison

How each system handles key memory challenges

06 The Research Landscape

Beyond products, the academic side has been prolific. Here is the timeline of papers that shaped the field:

2023

Generative Agents (Park et al., Stanford)

The foundational paper. Observation-reflection-planning loop producing months of coherent social behavior.

ICLR 2024

MemGPT: Towards LLMs as Operating Systems

First to treat memory as a systems problem. Virtual memory paging for LLMs. Became Letta.

Jan 2025

Zep/Graphiti -- Temporal Knowledge Graph

Bitemporal model with 4 timestamps per fact. 18.5% accuracy improvement over baselines.

Feb 2025

A-Mem -- Agentic Memory

Agents manage their own memory through structured notes rather than passive processes.

ECAI 2025

Mem0 -- Production Memory

First broad comparison of 10 memory approaches. 26% improvement over OpenAI memory.

Dec 2025

Memory in the Age of AI Agents

Unified taxonomy by form, function, and dynamics. Curated paper list on GitHub.

Jan 2026

MAGMA -- Multi-Graph Architecture

4 parallel graphs (semantic, temporal, causal, entity). Policy-guided traversal.

Jan 2026

AgeMem -- RL-Trained Memory

3-stage progressive reinforcement learning. Agent learns WHEN to store, retrieve, discard.

Feb 2026

Graph-Based Agent Memory -- Survey

Taxonomy of 5 graph types. Full lifecycle: extraction, storage, retrieval, evolution.

Mar 2026

Memory for Autonomous LLM Agents -- Definitive Survey

Three-dimensional taxonomy. Key finding: extended context windows cannot replace memory.

Apr 2026

GAM -- Hierarchical Graph Memory

Dual-layer: global topic network + local event graphs. State-based consolidation prevents contamination.

Benchmarks

Benchmark	What It Tests	Limitation
LoCoMo	Long conversation memory: factual recall, preferences, temporal, relationships	Chatbot-focused, not task-grounded
LongMemEval	Long-horizon with temporal and multi-hop questions	Still conversational, not agentic
MemoryArena	Agentic memory + action. Near-perfect recall drops to 40-60% on tasks	Newer, less widely adopted
MemoryAgentBench	Cognitive science-grounded: retrieval, learning, forgetting	Academic, limited validation

The benchmark gap

Every existing benchmark tests chatbot memory -- "what did the user say 50 turns ago?" None test what matters for production agents: did memory help complete a task, avoid a past mistake, or diagnose a recurring problem? This is a wide-open research opportunity.

07 Traction and Adoption

Community Size vs. Benchmark Performance

The simplest tool wins adoption, not the most capable

Mem0 dominates adoption (25k stars) despite the weakest benchmark score. Zep and SuperLocal lead performance but have small communities. The simplest tool wins adoption, not the most capable one.

08 What Is Still Unsolved

The surveys and benchmark papers converge on the same open problems. These represent real research opportunities:

Principled Consolidation

When should an agent merge similar memories into a generalized fact? When should it keep individual episodes? Too aggressive and you lose detail. Too conservative and memory bloats with near-duplicates. Nobody has a good answer beyond heuristics.

Selective Forgetting

The ACT-R cognitive psychology approach (memories decay with time and disuse) is a starting point, but too simple for agents whose tasks change week to week. A memory irrelevant for a month might become critical again.

Causal Retrieval

Current retrieval is semantic similarity or keyword matching. What agents need is causal: "what caused this?" That requires traversing cause-effect chains, not finding similar-looking text. MAGMA's causal graph is a step forward, but extraction quality is still unreliable.

Self-Reinforcing Errors

When agents reflect on behavior and store reflections as memories, errors compound. A wrong diagnosis becomes a "learned pattern" that biases future diagnoses. No current system detects or breaks these feedback loops.

Cross-Agent Memory

Every system is designed for an agent remembering its own history. Multi-agent systems need agents to maintain structured memories about other agents -- capabilities, past behavior, failure patterns. This is essentially unstudied.

Task-Grounded Evaluation

LoCoMo tests "can you recall what was said?" MemoryArena showed near-perfect recall drops to 40-60% on actual tasks. The field needs benchmarks measuring whether memory helps agents do their jobs better, not just remember conversation facts.

09 Where This Is Headed

The agent memory space in 2026 is where vector databases were in 2023: lots of products, unclear differentiation, no standard benchmark everyone trusts. The trajectory is clear:

Hybrid is the future. Pure vector memory will become the "SQLite of agent memory" -- fine for simple cases, insufficient for production agents that run for weeks
Temporal reasoning is table stakes. Any system that cannot answer "when did this change?" will fall behind
Consolidation and forgetting are the hard problems. Storing everything is easy. Knowing what to keep is the research frontier
Benchmarks need to get real. Chatbot recall tests do not predict agent task performance
Half of these products will consolidate or die in 2-3 years. The survivors will have hybrid memory, temporal reasoning, and auditable retrieval by default

If you are building agents today: start with Mem0 for simplicity, move to Letta if you need self-hosting, and evaluate Zep if temporal reasoning is critical. If you are doing research: consolidation, forgetting, cross-agent memory, and task-grounded evaluation are all wide open.

References

Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers. arXiv 2603.07670, Mar 2026
Graph-Based Agent Memory: Taxonomy, Techniques, and Applications. arXiv 2602.05665, Feb 2026
MAGMA: A Multi-Graph based Agentic Memory Architecture. arXiv 2601.03236, Jan 2026
GAM: Hierarchical Graph-based Agentic Memory. arXiv 2604.12285, Apr 2026
Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management. arXiv 2601.01885, Jan 2026
A-Mem: Agentic Memory for LLM Agents. arXiv 2502.12110, Feb 2025
Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv 2501.13956, Jan 2025
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv 2504.19413, ECAI 2025
MemGPT: Towards LLMs as Operating Systems. ICLR 2024
Generative Agents: Interactive Simulacra of Human Behavior. Park et al., 2023
5 AI Agent Memory Systems Compared. DEV Community, 2026
State of AI Agent Memory 2026. Mem0 Blog