You picked vector vs graph for agent memory — the empirical answer is neither, pick compression

Published: 2026-04-27 Last verified: 2026-04-27 secondary-research

PUBLISHED FACT-CHECKED 2026-04-27 · 0 corrections

12 claims 0 tested finding

⚠ Staleness risk: high — provider APIs in this area change frequently. Test specific limits and failure modes in your environment before acting.

You picked vector vs graph for agent memory — the empirical answer is neither, pick compression

From Theory Delta | Methodology | Published 2026-04-27

You are choosing an agent memory layer. Mem0 (~48K stars) markets vector + graph as a hybrid. Zep/Graphiti markets temporal knowledge graphs as the answer to “facts that change over time.” The framing the ecosystem hands you is vector vs graph: pick vector for semantic recall, pick graph for temporal reasoning.

The benchmarks contradict that framing. The architecture that wins is in neither category.

What you expect

Vector memory (mem0) will get you broad semantic recall but struggles with relational structure. Graph memory (mem0 graph mode, Zep/Graphiti) will get you temporal reasoning — knowing that “user preferred Python” was superseded by “user now prefers Rust” — at the cost of operational complexity. The choice is a trade-off between recall quality and infrastructure overhead.

What actually happens

On the same model, compression-based memory beats graph-based memory by 13 points. Scores below are from the Mastra research page LongMemEval leaderboard. Same model (gpt-4o), 500-question LongMemEval set, architecture-only comparison:

System	Architecture	LongMemEval (gpt-4o)
Mastra Observational Memory	Compression (Observer/Reflector)	84.23%
gpt-4o Oracle (cheat upper bound — relevant 1–3 sessions only)	Filtered context	82.40%
Supermemory	Memory graph + RAG	81.60%
Zep	Temporal knowledge graph	71.20%
gpt-4o full context (all ~50 sessions stuffed in)	None — raw context	60.20%
Mem0 (independent test, not on Mastra leaderboard)	Vector + graph	~49%

Two baselines matter here. Oracle (82.4%) is a cheat configuration — it filters the input down to only the 1–3 conversations that contain the answer. It’s an upper-bound; you cannot run an oracle in production without already knowing the answer. Full context (60.2%) is the realistic baseline — stuff all ~50 conversations in and let the model figure it out. These are different scores measuring different things; collapsing them is the kind of error that makes graph memory look more competitive than it actually is.

The picture that survives the same-model comparison: Mastra OM (compression) beats both the oracle and full context. Zep (temporal KG) beats full context by 11 points but loses to compression by 13. Mem0 underperforms even the full-context baseline. Cross-model results in the agent-memory-benchmarks-2026 block widen the gap further (Mastra OM reaches 94.87% on gpt-5-mini — a newer 2026-class OpenAI model; OMEGA reports 95.4% on the same model, self-reported and not independently reproduced). Scores across different models are not directly comparable, but the same-model gap above is.

Mem0’s graph layer used to defeat the temporal-reasoning promise; that bug shipped a fix. The Mem0 paper described soft-delete with temporal reasoning — old facts marked superseded but retained for temporal queries. The implementation did a destructive delete. PR #4188 (merged 2026-03-21) switched the graph delete path to r.valid = false, r.invalidated_at = datetime(). (mem0 PR #4188 closed the underlying mem0 Issue #4187.)

The benchmark gap above predates that patch and was not closed by it. Mastra OM scored 84.23% on gpt-4o in February 2026; the patch landed five weeks later. A delete-semantics fix would not lift mem0’s ~49% LongMemEval score onto compression’s curve — the underperformance sits in extraction and retrieval, not in delete behavior.

Mem0’s TypeScript SDK has graph features locked to OpenAI. Issue #3711 (label: sdk-typescript) documents MemoryGraph.structuredLlm hardcoded to "openai_structured" in the TypeScript SDK, with Anthropic, Groq, and other providers failing on the graph pipeline. The issue was closed as duplicate in March 2026; the underlying fix may track under another issue.

The Python SDK’s status on the same constraint has not been independently verified — re-test against your provider before relying on graph memory with a non-OpenAI model. The cloud product (Mem0g) requires a $249/month Pro tier, also OpenAI-only.

Self-hosted Graphiti has a user-reported async event loop conflict. Embedding graphiti-core directly in FastAPI or LangGraph — the most common production Python agent stack — has been reported to produce RuntimeError: Future attached to a different loop under real async load. (User-reported, no public tracking issue. The reported workaround is to run graphiti-core in its own subprocess with HTTP/queue communication.) This is also not in the README. (covered in detail in graph-memory-self-hosted-not-production-ready)

Coding agents have already voted with their architectures. Every shipping coding agent uses compression-based memory. The table below lists architectural patterns across models — the underlying coding agent runs on different LLMs and the comparison is across models, not a same-model benchmark:

Agent	Memory pattern	Graph store?
Claude Code	CLAUDE.md + MEMORY.md (flat files, 200-line load limit)	No
Windsurf	Auto-generated memories from ~48 hours of codebase analysis	No
Cursor	Rules files + context injection	No
claude-mem (~26K stars)	Session compression + injection	No

Four independently developed coding agents — built by teams that compete on agent memory quality — converged on the same architectural pattern: compress past sessions, inject relevant fragments into the next context. None of them use a graph store. None of them use a vector DB. (source: agent-memory-landscape)

Token efficiency points the same direction. Zep’s temporal-KG retrieval uses ~1.6K tokens of context to score 71.2% on LongMemEval (gpt-4o), versus the full ~115K-token context baseline at 60.2%. Both numbers come from the same Mastra leaderboard above. The lesson is not that graphs are good — it is that compressing the right context beats stuffing all the context. Compression architectures generalize this without paying for a graph store.

What this means for you

If you picked Mem0 for “vector + graph hybrid”: With a non-OpenAI provider in the TypeScript SDK, the graph pipeline returns HTTP 401 — graph mode is effectively off. With OpenAI, the graph layer now soft-deletes (the hard-delete defect was fixed in March 2026), but mem0’s overall LongMemEval score (~49%) is well below the compression baseline. You are paying the operational cost of running Neo4j to get a vector store with worse semantics than treating the conversation as a single context.

If you picked Zep/Graphiti for “real” temporal reasoning: Self-hosted has a user-reported async event loop bug that surfaces in production, not in development. Local tests pass; the production deploy degrades under concurrent async load. Zep Cloud avoids this but requires vendor dependency and silently changed its default OpenAI model from gpt-4o to gpt-4o-mini in v0.27.1 — pin your model explicitly or accept silent quality regression on upgrade.

If you are choosing now: The vector-vs-graph dichotomy is a category error. The empirically validated answer is compression: maintain a per-session digest, inject relevant fragments into the next agent context, do not run a graph store. Mastra OM, OMEGA, Hindsight, and the entire coding-agent cohort are independent confirmations of the same pattern.

If your use case actually requires temporal reasoning (legal, audit, “what did the user used to believe”): Mem0’s graph delete is no longer a hard wipe (PR #4188), but no shipping OSS implementation has been independently benchmarked on the temporal-reasoning sub-task. Graphiti has the data model but the operational hazard. Zep Cloud is the most plausible path, with a vendor commitment and a model-pinning gotcha.

What to do

For most agent memory use cases: Use compression. Mastra OM if you are in TypeScript and want a framework. claude-mem or session-injection patterns if you are building on Claude Code. The compression target is a per-session digest the next agent reads — not a vector index, not a graph.

If you are committed to a retrieval-based architecture: Use Mem0 self-hosted in vector-only mode against Qdrant or PgVector. It works with any LLM provider in vector mode. Do not enable graph features on the TypeScript SDK with a non-OpenAI provider until you have re-verified the fix status against your provider; on the Python SDK, test before relying on it.

If temporal reasoning is genuinely a requirement: Use Zep Cloud, accept the vendor dependency, and pin your OpenAI model explicitly to avoid silent quality regression. Do not embed graphiti-core directly in FastAPI or LangGraph — run it in a subprocess. Verify before relying on temporal queries that your chosen implementation soft-deletes.

Do not pick based on stars. Mem0 is the star leader and scores ~49% on LongMemEval — below the 60.2% baseline of just dumping the whole conversation into a gpt-4o context window. Star count measures awareness; benchmark scores measure capability; convergent architectures across independent teams measure what works.

The evidence table below collects scores reported by each project; some entries are across models (gpt-4o, gpt-5-mini) and not directly comparable, which is called out explicitly in each row.

Evidence

Claim	Source	Verified
Mastra OM 84.23% on LongMemEval (gpt-4o)	Mastra LongMemEval leaderboard	2026-04-27
Zep 71.20% on LongMemEval (gpt-4o)	Mastra LongMemEval leaderboard — same source as Mastra OM and the baselines below, ensuring same-model comparison	2026-04-27
gpt-4o Oracle 82.40% (filtered to relevant 1–3 sessions only — upper-bound cheat)	Mastra LongMemEval leaderboard	2026-04-27
gpt-4o Full context 60.20% (all ~50 sessions stuffed in — realistic baseline)	Mastra LongMemEval leaderboard	2026-04-27
Mem0 ~49% on LongMemEval (independent test, not on Mastra leaderboard)	LongMemEval independent reproductions; vectorize.io and dev.to comparisons. Note: this score comes from a different evaluation run than the Mastra leaderboard rows above; treat the cross-source comparison as directional, not exact	2026-03-21
OMEGA 95.4% (self-reported, gpt-5-mini)	OMEGA project documentation; single-developer, not independently reproduced; cross-model and not directly comparable to gpt-4o rows	2026-03-21
Mem0 graph delete-semantics patched (soft-delete via PR #4188)	Mem0 Issue #4187 (closed-completed 2026-03-21)	2026-04-27
Mem0 graph in TypeScript SDK locked to OpenAI provider	Mem0 Issue #3711 (label `sdk-typescript`; closed as duplicate Mar 2026; Python SDK status unverified)	2026-04-27
Graphiti async event loop conflict	User-reported, no public tracking issue; covered in graph-memory-self-hosted-not-production-ready	2026-04-20
Coding agents converged on compression	Claude Code (CLAUDE.md/MEMORY.md), Windsurf (auto-memories), Cursor (rules), claude-mem (session compression)	2026-04-25

Confidence: secondary-research — based on public benchmark leaderboards, vendor-reported scores, and source-reviewed GitHub issues. LongMemEval scores are vendor-reported; OMEGA’s 95.4% claim has not been independently reproduced. The ~49% mem0 score comes from independent reproductions but is itself secondary to those reports.

Open questions (Apr 2026): Now that mem0 graph soft-deletes (post-PR #4188), does its LongMemEval score change materially? Has the Python SDK been verified for non-OpenAI graph providers, or only the TypeScript SDK? Is there a public benchmark on which graph-based memory beats compression-based memory on the same model? Does Graphiti’s mcp-v1.0.2 release fix the user-reported async event loop conflict?

Falsification criterion: A LongMemEval, LoCoMo, or MemoryBench result where a graph-based memory tool (mem0 graph mode, Zep/Graphiti, Cognee) beats Mastra OM, OMEGA, or another compression-based system on the same model and dataset would disprove this finding; an independent benchmark of mem0 post-PR #4188 showing graph memory decisively beats compression for temporal-reasoning queries would partially falsify it.

Seen different? Contribute your evidence