1 The Core Result: Memory vs No Memory
50-question comparison test using GPT-5. Questions require recalling information stored in AMS.
35 out of 50 questions answered correctly with AMS. Zero without it.
| Metric | AMS + GPT-5 | GPT-5 (no memory) |
|---|---|---|
| Accuracy | 70.0% (35/50) | 0% (0/50) |
| Search Latency | 796ms median | N/A |
| Answer Latency | 13.4s median | 2.1s median |
Why this matters: The baseline model literally cannot answer these questions — it doesn't have the information. AMS provides the context that makes answers possible.
2 Accuracy by Question Type
Breaking down performance across different reasoning requirements.
Temporal queries perform best — AMS excels at "when did X happen" and "what changed between dates" due to explicit timestamp tracking in the memory graph.
3 Adversarial Robustness
Full benchmark: 1,986 questions including adversarial attempts to trick the system.
What "adversarial" means: Questions designed to elicit hallucinations, test boundary conditions, and probe for information leakage. 98.88% resisted these attacks.
4 Latency
End-to-end response times from the full 1,986-question benchmark.
| Stage | Median Latency |
|---|---|
| Search (retrieval from memory) | 74ms |
| Answer Generation (LLM) | 1,059ms |
| Total End-to-End | 3.8s |
Note: The 50-question comparison test showed higher latencies (796ms search, 13.4s answer) due to more complex multi-hop queries.
5 Retrieval Metrics
How well does the search find the right memories? (50-question comparison test)
| Metric | Score | What It Means |
|---|---|---|
| Precision@K | 0.236 | 24% of retrieved docs are relevant |
| Recall@K | 0.580 | 58% of relevant docs are retrieved |
| MRR | 0.374 | Relevant doc usually in top 3 |
| NDCG | 0.423 | Ranking quality score |
Honest assessment: These retrieval numbers have room to improve. The 70% end-to-end accuracy shows the LLM can work with imperfect retrieval, but better search = better answers. Active area of development.
📋 Methodology
- MemoryBench v1.0 — standardized benchmark suite for memory systems
- GPT-5 as the base LLM for all tests
- 50-question comparison: hand-crafted questions requiring stored context
- 1,986-question full benchmark: includes adversarial, temporal, multi-hop categories
- All latencies measured end-to-end including network overhead
- Results from January 2026
In Progress
- Token efficiency measurements
- Citation accuracy validation
- Skill improvement over time tracking
- Multi-model comparison (Claude, Gemini, etc.)