★ DEVICE primary ★ AI application
SlimeTree-RLM — Measurement procedure and primary materials
For evaluators and procurement. The procedure, rubrics, and LLM configurations behind the -20.4 ± 0.3 pt architectural constant measured across 3 external benchmarks × 3 seeds = 6,870 trials, the 4-LLM cross-validation conditions, paper v10, and access information for patent claims 1-44.
For product overview and deployment scenarios, see the product page (/products/device/slimetree-rlm/). This page focuses on primary materials for reproduction and verification.
1. Evaluation data and external benchmarks Open
We measured against external public benchmarks, not self-made ones. The author, difficulty axis, scale, and scoring metric of each benchmark are listed below. All conditions needed for reproduction are public — a same-condition PoC on your LLM can be set up in 3-5 business days.
| Benchmark | Author / origin | Axis (paper §3.5) | Scale | Scoring metric | Result (RLM effect) |
|---|---|---|---|---|---|
| SimpleQA | OpenAI | T1: long-tail entity | 500 Q × 3 seeds = 1,500 trials | F-score (correct / attempted), SimpleQA paper preferred metric | incorrect -20.5 pt, F +3.7 pt |
| TruthfulQA | Lin et al. 2022 | T5+T6: misconception / trick | 790 Q × 3 seeds = 2,370 trials (790 of standard 817, those that admit binary scoring) |
Truth metric, Llama-3 judge / NLI-equivalent | incorrect -20.1 pt, Truth +20.1 pt |
| HaluEval-QA | HotpotQA-derived (THUDM) | T2+T6: false premise / multi-hop | 1,000 Q × 3 seeds = 3,000 trials | binary correctness on (Question, hallucinated_answer) | incorrect -20.7 pt, F +21.4 pt |
| 3-bench combined | 3 independent question sources | T1 ↔ T5+T6 ↔ T2+T6 (full axis cover) | 6,870 trials (2,290 distinct Q × 3 seeds) | seed-mean ± SD of incorrect-rate Δ | -20.4 ± 0.3 pt ★ |
1.1 Reproduction conditions — LLM, temperature, seed, cache
| LLM | Qwen3:8b / Llama 3.1:8b / Mistral 7B / Gemma 3:4B (via Ollama). Primary benchmarks in this table run on Qwen3:8b; 4-LLM cross-validation in §2 |
|---|---|
| temperature | baseline 0.7, R-mode 0.4 (impl_v2 Phase B, suppresses fabrication randomness) |
| seeds | 3 seeds fixed (23, 47, 89) for reproducibility |
| cache | 200 (absorbs decoding noise) |
| Scoring | SimpleQA: OpenAI preferred F-score (refusal-when-uncertain rewarded); TruthfulQA: Truth metric; HaluEval: binary correctness. Reference rubrics are kept unchanged for all 3 benchmarks |
| variance metric | seed-to-seed σ (standard deviation of per-seed Δ). Enables measuring Property A variance absorption |
| Typical run time | HaluEval 6,000 LLM calls ≈ 22.5 min (same-host Ollama, reference value for an 8B-class model) |
2. 4-LLM cross-validation Open
To show this is not a Qwen3-only number, 4 LLMs were re-run under identical conditions: 100 traps × cache=200 × seed=23, baseline vs routed.
| LLM | Size | Baseline halluc | Routed halluc | Δ halluc | Δ Latency | Routes (D/μ/R) |
|---|---|---|---|---|---|---|
| Qwen3:8b | 8B | 63% | 19% | -44 pt | -85.7% | 51/46/3 |
| Llama 3.1:8b | 8B | 51% | 19% | -32 pt | -83.3% | 51/46/3 |
| Mistral 7B | 7B | 70% | 51% | -19 pt | -74.8% | 51/45/4 |
| Gemma 3:4B | 4B | 79% | 59% | -20 pt | -79.3% | 51/46/3 |
★ Performance equalizer: Both Tier-A 8B-class LLMs (Qwen3 and Llama 3.1) land at 19% hallucination = 81% correct ceiling after routing. Within the same Tier, the choice of LLM stops mattering. Multilingual: Japanese +54 pt / English +24 pt / Arabic +7 pt (paper v10 §3 multilingual matrix).
3. Paper Published on Zenodo (CC-BY 4.0)
| Paper (English, Zenodo) | "SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference" (SASAKI, HIROSHI; published 2026-01-14; CC-BY 4.0). DOI: 10.5281/zenodo.18238339 Direct PDF: slimetree_rlm_paper_final_en.pdf (968.7 KB) Zenodo record: zenodo.org/records/18238339 |
|---|---|
| Citation | Sasaki, H. (2026). SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference. Zenodo. https://doi.org/10.5281/zenodo.18238339 |
| Japanese version v2 | jxiv submission preparing, 15 pages / ~24,685 chars / 221 KB. |
| Target venues | EMNLP / MLSys / VLDB / AMIA / EACL / AAAI / NeurIPS (experimental-rigor requirements cleared). |
| Further inquiries | Contact us (affiliation-/use-case-specific supplements, PoC re-runs, etc.) |
4. Patent Coverage public only — text NDA-gated
The architecture of SlimeTree-RLM is covered by patent claims 1-44. Only the coverage area is public:
- (SemanticTime, SensoryTime) tuple, credibility / forget_index (claims 1, 17, 25)
- Hot Shelf (Treap) + Cold Shelf (RB-Tree) (claims 2, 7, 8)
- Branch-free 3-mode router, failure signal + w·exp(-η·regret), Adaptive η (claims 16, 38-42)
- SAS semantic-area sampling, SpiralIndex + LazySpiralUpdate (claims 2-4, 8)
- Operator ring + Bernstein commutator, Kosaraju SCC (claims 5, 11, 30-31)
- Bron-Kerbosch + greedy mutually-disjoint clique cover (claims 6, 34)
- Hilbert-curve index (claim 9)
- WAL + cascade rollback (non-commuting-side propagation only) (claims 21, 35-37)
- P_split / merge / freeze + fixed-point (claim 43)
- WASM + SharedArrayBuffer + Atomics (claim 12), SlotAdapterAPI (claim 13), MetaGeneSlot GDPR/HIPAA (claim 14), Redlock distributed mutex (claim 16), LLVM Function Pass (claims 30-34), RocksDB/Redis backends (claim 19)
Text access via contact → after NDA.
5. Implementations (code) Distribution preparing
| Python reference implementation | impl/ v0.1: 2,210 lines, zero dependencies, 25 unit tests pass, 80-step demo. README maps paper §x and patent claim N. |
|---|---|
| Improved implementation | impl_v2/: Phase A (subtype bias trial) → Phase B (R-prompt softening + bias inversion + strict grader), reaching 81.3% (σ=4%) at cache=200. |
| Rust port + WASM | 272 KB single binary, 24× vs Python, 138 unit tests, zero data loss under 10,000 slot × 500 step stress. WASM evaluation copies distributed individually. |
| Bench harness | Same-condition replay scripts for SimpleQA / TruthfulQA / HaluEval-QA, including 4-LLM Ollama connector examples. |
Distribution forms (evaluation licence / joint PoC / sponsored development / OEM integration) via contact or the partners page.
6. Related
- Product page: SlimeTree-RLM — product detail (deployment scenarios, enterprise / AI provider angles)
- Deep-dive blog: Just 272 KB to cut LLM hallucinations to one-third — SlimeTree-RLM (Japanese, 7 chapters)
- Related news: research releases and announcements
- Same family, simple-record-system variant: SlimeTree-VSAM + deep-dive blog
- Category: DEVICE products / Resource home
