Every AI agent needs memory. Without it, each conversation starts from zero -- no learned preferences, no context from yesterday, no understanding of what failed last time. The problem is deceptively hard: it is not just storage and retrieval. It is deciding what to keep, what to forget, how to handle contradictions, and how to reason across memories that span weeks or months.
2025-2026 saw an explosion of approaches. Research labs published taxonomies and benchmarks. Startups shipped products. Open-source projects emerged. But the landscape is fragmented, marketing claims are generous, and it is genuinely hard to tell what works from what sounds good in a blog post.
This post cuts through that. Every major system, how each actually works under the hood, benchmark numbers where they exist, and honest assessment of what is open-source versus what is a cloud product with an open-source wrapper.
01 Three Memory Strategies
Despite the variety of products, every agent memory system uses one of three fundamental strategies -- or a combination.
Vector Embeddings
Facts become vectors. Retrieval is "find what looks similar."
Knowledge Graphs
Entities as nodes, relationships as edges. Multi-hop traversal.
LLM-Controlled
The model itself decides what to store, retrieve, and forget.
The best-performing systems are hybrids: vectors for breadth (find roughly relevant memories fast), graphs for depth (traverse relationships from those entry points). Pure vector and pure LLM-controlled approaches score significantly lower on benchmarks.
02 The Systems, Compared
LoCoMo (Long Conversation Memory) is the primary benchmark. It tests factual recall, preferences, temporal events, and relationships across long multi-session conversations. Higher is better.
| System | Strategy | LoCoMo | Open Source | Launched |
|---|---|---|---|---|
| SuperLocalMemory V3 | 4-channel hybrid fusion | 87.7% | Full OSS | 2025 |
| Zep (Graphiti) | Temporal knowledge graph + vector | 85.0% | Partial | 2024 |
| Letta (MemGPT) | LLM-controlled tiered memory | 83.2% | Full OSS | 2023 |
| Supermemory | Vector + light graph edges | ~70% | Partial | 2024 |
| Mem0 | Vector similarity (+ optional graph) | ~60% | Partial | 2024 |
03 How Each Engine Actually Works
Mem0 -- The Simple Pipeline
Extraction: One LLM call reads the conversation and extracts facts. Treats agent-generated facts as first-class alongside user statements. Storage: Each fact becomes a vector embedding. Entities indexed separately. Retrieval: Semantic similarity + BM25 + entity matching, scores fused.
Single-pass extraction misses nuance. No temporal reasoning -- cannot tell if a fact was superseded. "I live in Mumbai" and "I moved to Delhi" coexist with no invalidation. No multi-hop across connected facts.
Zep (Graphiti) -- The Temporal Knowledge Graph
The dual-timestamp model is Zep's defining feature. Every fact carries four timestamps:
t_valid / t_invalid for when the fact was actually true, and
t'_created / t'_expired for when the system recorded it.
When a new fact contradicts an existing one, the old fact gets formally invalidated.
New information always wins.
The temporal model handles "what changed" and "when did X happen." The episode layer preserves everything as ground truth. Graph traversal enables multi-hop reasoning. Formal contradiction resolution means stale facts do not pollute results.
Letta (MemGPT) -- The OS Approach
No fixed extraction pipeline. The model decides what to remember through tool calls. Flexible and adaptive, but you cannot audit why it chose to remember X and forget Y. Scores 83.2% -- solid, but below systems with explicit graph structure.
04 The Open-Source Reality
"Open source" in the memory space is complicated. Most companies open-source the client SDK while keeping the core engine proprietary.
The honest summary: Letta and SuperLocalMemory are genuinely open-source. Mem0 and Zep open-source the interface but not the brain. If you need self-hosting with full control, Letta is the production-ready option.
05 Engine Internals Side-by-Side
06 The Research Landscape
Beyond products, the academic side has been prolific. Here is the timeline of papers that shaped the field:
Benchmarks
| Benchmark | What It Tests | Limitation |
|---|---|---|
| LoCoMo | Long conversation memory: factual recall, preferences, temporal, relationships | Chatbot-focused, not task-grounded |
| LongMemEval | Long-horizon with temporal and multi-hop questions | Still conversational, not agentic |
| MemoryArena | Agentic memory + action. Near-perfect recall drops to 40-60% on tasks | Newer, less widely adopted |
| MemoryAgentBench | Cognitive science-grounded: retrieval, learning, forgetting | Academic, limited validation |
Every existing benchmark tests chatbot memory -- "what did the user say 50 turns ago?" None test what matters for production agents: did memory help complete a task, avoid a past mistake, or diagnose a recurring problem? This is a wide-open research opportunity.
07 Traction and Adoption
Mem0 dominates adoption (25k stars) despite the weakest benchmark score. Zep and SuperLocal lead performance but have small communities. The simplest tool wins adoption, not the most capable one.
08 What Is Still Unsolved
The surveys and benchmark papers converge on the same open problems. These represent real research opportunities:
Principled Consolidation
When should an agent merge similar memories into a generalized fact? When should it keep individual episodes? Too aggressive and you lose detail. Too conservative and memory bloats with near-duplicates. Nobody has a good answer beyond heuristics.
Selective Forgetting
The ACT-R cognitive psychology approach (memories decay with time and disuse) is a starting point, but too simple for agents whose tasks change week to week. A memory irrelevant for a month might become critical again.
Causal Retrieval
Current retrieval is semantic similarity or keyword matching. What agents need is causal: "what caused this?" That requires traversing cause-effect chains, not finding similar-looking text. MAGMA's causal graph is a step forward, but extraction quality is still unreliable.
Self-Reinforcing Errors
When agents reflect on behavior and store reflections as memories, errors compound. A wrong diagnosis becomes a "learned pattern" that biases future diagnoses. No current system detects or breaks these feedback loops.
Cross-Agent Memory
Every system is designed for an agent remembering its own history. Multi-agent systems need agents to maintain structured memories about other agents -- capabilities, past behavior, failure patterns. This is essentially unstudied.
Task-Grounded Evaluation
LoCoMo tests "can you recall what was said?" MemoryArena showed near-perfect recall drops to 40-60% on actual tasks. The field needs benchmarks measuring whether memory helps agents do their jobs better, not just remember conversation facts.
09 Where This Is Headed
The agent memory space in 2026 is where vector databases were in 2023: lots of products, unclear differentiation, no standard benchmark everyone trusts. The trajectory is clear:
- Hybrid is the future. Pure vector memory will become the "SQLite of agent memory" -- fine for simple cases, insufficient for production agents that run for weeks
- Temporal reasoning is table stakes. Any system that cannot answer "when did this change?" will fall behind
- Consolidation and forgetting are the hard problems. Storing everything is easy. Knowing what to keep is the research frontier
- Benchmarks need to get real. Chatbot recall tests do not predict agent task performance
- Half of these products will consolidate or die in 2-3 years. The survivors will have hybrid memory, temporal reasoning, and auditable retrieval by default
If you are building agents today: start with Mem0 for simplicity, move to Letta if you need self-hosting, and evaluate Zep if temporal reasoning is critical. If you are doing research: consolidation, forgetting, cross-agent memory, and task-grounded evaluation are all wide open.
References
- Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers. arXiv 2603.07670, Mar 2026
- Graph-Based Agent Memory: Taxonomy, Techniques, and Applications. arXiv 2602.05665, Feb 2026
- MAGMA: A Multi-Graph based Agentic Memory Architecture. arXiv 2601.03236, Jan 2026
- GAM: Hierarchical Graph-based Agentic Memory. arXiv 2604.12285, Apr 2026
- Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management. arXiv 2601.01885, Jan 2026
- A-Mem: Agentic Memory for LLM Agents. arXiv 2502.12110, Feb 2025
- Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv 2501.13956, Jan 2025
- Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv 2504.19413, ECAI 2025
- MemGPT: Towards LLMs as Operating Systems. ICLR 2024
- Generative Agents: Interactive Simulacra of Human Behavior. Park et al., 2023
- 5 AI Agent Memory Systems Compared. DEV Community, 2026
- State of AI Agent Memory 2026. Mem0 Blog