1 The Core Result: Memory vs No Memory

50-question comparison test using GPT-5. Questions require recalling information stored in AMS.

70%
AMS + GPT-5 Accuracy
0%
GPT-5 Alone (Baseline)

35 out of 50 questions answered correctly with AMS. Zero without it.

MetricAMS + GPT-5GPT-5 (no memory)
Accuracy70.0% (35/50)0% (0/50)
Search Latency796ms medianN/A
Answer Latency13.4s median2.1s median

Why this matters: The baseline model literally cannot answer these questions — it doesn't have the information. AMS provides the context that makes answers possible.

2 Accuracy by Question Type

Breaking down performance across different reasoning requirements.

85.7%
Temporal Queries
68.4%
Single-hop Queries
66.7%
Multi-hop Queries

Temporal queries perform best — AMS excels at "when did X happen" and "what changed between dates" due to explicit timestamp tracking in the memory graph.

3 Adversarial Robustness

Full benchmark: 1,986 questions including adversarial attempts to trick the system.

98.88%
Adversarial Accuracy
441/446
Adversarial Questions Passed

What "adversarial" means: Questions designed to elicit hallucinations, test boundary conditions, and probe for information leakage. 98.88% resisted these attacks.

4 Latency

End-to-end response times from the full 1,986-question benchmark.

StageMedian Latency
Search (retrieval from memory)74ms
Answer Generation (LLM)1,059ms
Total End-to-End3.8s

Note: The 50-question comparison test showed higher latencies (796ms search, 13.4s answer) due to more complex multi-hop queries.

5 Retrieval Metrics

How well does the search find the right memories? (50-question comparison test)

MetricScoreWhat It Means
Precision@K0.23624% of retrieved docs are relevant
Recall@K0.58058% of relevant docs are retrieved
MRR0.374Relevant doc usually in top 3
NDCG0.423Ranking quality score

Honest assessment: These retrieval numbers have room to improve. The 70% end-to-end accuracy shows the LLM can work with imperfect retrieval, but better search = better answers. Active area of development.

📋 Methodology

In Progress

Run Your Own Benchmarks

MemoryBench is open source. Test AMS against your own data.

Get Access →