LLM + INDB Benchmark Notes (Technical)
This document explains the practical uplift coefficients used in public-facing materials. It is not a synthetic leaderboard; it is an architecture-level performance model for LLM + INDB vs LLM-only deployments.
Scope
Compared families:
- ChatGPT-class models
- Claude-class models
- Gemini-class models
- Grok-class models
- DeepSeek-class models
Target workload:
- multi-session assistants
- long-context support workflows
- retrieval-heavy reasoning with memory reuse
Not targeted:
- single-turn chat
- short, stateless Q&A
Measurement Model
Baseline normalization:
1.00= same model family without INDB memory layer
Primary KPIs:
- long-run consistency
- hallucination reduction factor
- memory reuse factor
- latency overhead factor
- integrated quality/cost uplift
The integrated score is interpreted as:
(quality_gain * consistency_gain * hallucination_penalty_reduction) / latency_and_cost_penalty
This is intentionally operational, not academic.
Coefficient Ranges
| Metric | ChatGPT | Claude | Gemini | Grok | DeepSeek |
|---|---|---|---|---|---|
| Long-run consistency | x1.25-x1.45 | x1.20-x1.40 | x1.25-x1.50 | x1.20-x1.45 | x1.30-x1.55 |
| Hallucination factor (lower is better) | x0.65-x0.85 | x0.70-x0.88 | x0.60-x0.82 | x0.65-x0.86 | x0.60-x0.80 |
| Memory reuse | x1.6-x2.4 | x1.5-x2.2 | x1.7-x2.5 | x1.5-x2.3 | x1.8-x2.6 |
| Latency overhead | x1.08-x1.22 | x1.10-x1.25 | x1.07-x1.20 | x1.06-x1.18 | x1.08-x1.22 |
| Integrated quality/cost uplift | x1.30-x1.60 | x1.25-x1.55 | x1.35-x1.70 | x1.25-x1.60 | x1.35-x1.75 |
Why INDB Moves These Metrics
INDB shifts memory work from token-window pressure to an interpretational memory path:
- anchor-based ingestion (
events) - horizontal retrieval (
echo,subwave) - read-time interpretation (
slice,what-if, Prism overlay) - signed memory contract (response integrity)
This generally improves cross-session stability and reuse while adding moderate read-path overhead.
Caveats
- Coefficients are deployment-profile ranges, not fixed constants.
- Ranking can change by domain (support vs coding vs legal).
mode=llm/mode=bothin what-if adds model latency;mode=coreisolates INDB path.- For fair comparisons, disable unrelated external connectors and keep prompt policy fixed.
Reproducibility Checklist
- Pin model family/version for each run
- Fix test set and random seed policy
- Log INDB mode (
core,llm,both) - Capture p50/p95 latency
- Track contradiction rate and factual rollback rate
- Report both absolute values and normalized coefficients
Recommended Reporting Format
For each model family, publish:
- baseline (LLM-only)
- LLM + INDB (core)
- LLM + INDB (both)
- delta and coefficient per KPI
This keeps architecture decisions transparent and auditable.