Production Benchmarks | Automaton Memory System

1 The Core Result: Memory vs No Memory

50-question comparison test using GPT-5. Questions require recalling information stored in AMS.

70%

AMS + GPT-5 Accuracy

GPT-5 Alone (Baseline)

35 out of 50 questions answered correctly with AMS. Zero without it.

Metric	AMS + GPT-5	GPT-5 (no memory)
Accuracy	70.0% (35/50)	0% (0/50)
Search Latency	796ms median	N/A
Answer Latency	13.4s median	2.1s median

Why this matters: The baseline model literally cannot answer these questions — it doesn't have the information. AMS provides the context that makes answers possible.

2 Accuracy by Question Type

Breaking down performance across different reasoning requirements.

85.7%

Temporal Queries

68.4%

Single-hop Queries

66.7%

Multi-hop Queries

Temporal queries perform best — AMS excels at "when did X happen" and "what changed between dates" due to explicit timestamp tracking in the memory graph.

3 Adversarial Robustness

Full benchmark: 1,986 questions including adversarial attempts to trick the system.

98.88%

Adversarial Accuracy

441/446

Adversarial Questions Passed

What "adversarial" means: Questions designed to elicit hallucinations, test boundary conditions, and probe for information leakage. 98.88% resisted these attacks.

4 Latency

End-to-end response times from the full 1,986-question benchmark.

Stage	Median Latency
Search (retrieval from memory)	74ms
Answer Generation (LLM)	1,059ms
Total End-to-End	3.8s

Note: The 50-question comparison test showed higher latencies (796ms search, 13.4s answer) due to more complex multi-hop queries.

5 Retrieval Metrics

How well does the search find the right memories? (50-question comparison test)

Metric	Score	What It Means
Precision@K	0.236	24% of retrieved docs are relevant
Recall@K	0.580	58% of relevant docs are retrieved
MRR	0.374	Relevant doc usually in top 3
NDCG	0.423	Ranking quality score

Honest assessment: These retrieval numbers have room to improve. The 70% end-to-end accuracy shows the LLM can work with imperfect retrieval, but better search = better answers. Active area of development.

📋 Methodology

MemoryBench v1.0 — standardized benchmark suite for memory systems
GPT-5 as the base LLM for all tests
50-question comparison: hand-crafted questions requiring stored context
1,986-question full benchmark: includes adversarial, temporal, multi-hop categories
All latencies measured end-to-end including network overhead
Results from January 2026

In Progress

Token efficiency measurements
Citation accuracy validation
Skill improvement over time tracking
Multi-model comparison (Claude, Gemini, etc.)