NEW · 2026-05-31 MIT ★ Estimate → measured Gemini Flash Free tier OK

Measurement Dashboard ― replace estimates with measured values

A batch runner to measure on real queries and update the "layered-cut estimates" on the service page. Run 150 sample prompts through the Gemini Flash → Pro tiered (B-pattern) flow and measure D/μ/R ratio, escalation rate, and approximate cost.

Throttled to 4.2 s/call to stay safely within Free-tier 15 RPM. 150 prompts finish in ~10-12 minutes. Get a personal free key at aistudio.google.com/apikey.

0. Gemini API key + model selection

Key is stored only in localStorage. We measure cheap = Gemini Flash, premium = Gemini Pro (cost ratio ~16.7×). If you have already saved a key on the Google Gateway, it is reused as-is.

Public pricing (as of 2026-05): Flash $0.075/1M in + $0.30/1M out / Pro $1.25/1M in + $5/1M out.
Free tier 15 RPM: Flash is free; Pro may incur a small charge (~$0.06) because 150 prompts × ~15% escalation = roughly 22-30 Pro calls.

1. Sample prompt set (150 prompts, D / μ / R mix)

A business-realistic mix: D-likely 60 (FAQ / known facts) + μ-likely 30 (too short / speculative / PII requests) + R-likely 60 (open questions / comparative analysis / creative). Real SaaS traffic clusters near D 62% / μ 11% / R 27% ([3-benchmark average](/resource/slimetree-rlm/)).

▸ Sample prompt list (150, editable)
One prompt per line, empty lines ignored
Prompt-set difference: factual (default) ― short answers (e.g. "Height of Mt. Fuji?" → "3,776 m") frequently trip the judge's 20-char threshold, inflating the escalation rate (54.5% etc.). chat-style ― SaaS business realistic (e.g. "Compare Python and Rust in 3 points") yields 100-500 char responses and measured values close to the realistic escalation rate (~15% estimate).

2. Run

ms/call (8000 = 7.5 RPM, safe under Flash 15 RPM)
Not yet run

Measured values (cumulative)

total: 0 D: 0 (0%) μ: 0 (0%) R: 0 (0%) cheap call (B): 0 cheap fail: 0 escalated (B): 0 () approx. cost: $0.00

Run log (most recent 50)

(not yet run)

API log

(not yet run)

3. Estimate vs measured (scaled to 10,000)

Scale the measured values to 10,000 prompts and compare against the service-page estimate. The escalation rate and R ratio are the central measurands ― the estimate assumes 27% R + 15% escalation; replacing with measured values shifts the layered-cut percentages.

MetricEstimate (R 27% / esc 15%)Measured (this run)Delta
R ratio27%
Escalation rate15%
★ 10k-scale cost (GPT-5 config equivalent)$14.6 / month
★ 10k-scale cost (Claude Opus config equivalent)$64 / month

* Formula: 10,000 × R ratio × (cheap unit + escalation rate × premium unit). From the measured cheap/premium unit prices on Gemini Flash → Pro, other premium configs (GPT-5 / Claude Opus) scale linearly by token-cost ratio.

4. Export results

Export the raw run data as JSON. Use it later when replacing the service page #multi-agent-saving table with measured values.

(export available after a run)

5. Caveats

  • Bias from the RLM mock: the current `slimetree-rlm-mock.js` returns D/μ from a limited KNOWN_FACTS + MUTE_TRIGGERS set. The real WASM (272 KB) covers far more. The D/μ/R ratios in this run track the mock's fixed-rule pass-through (D from 6 facts only, μ from 4 triggers only).
  • The nature of the escalation rate: it depends on cheap-LLM quality. The Flash → Pro combination measures the mock's judge criteria (length / refusal / hedge / random 15%). On the real WASM, semantic + uncertainty raise the precision.
  • Pricing is indicative: computed from Gemini public unit prices × token count. Actual billing varies with API-side rounding (per-1k token round-up etc.).
  • Free-tier 15 RPM cap: throttle 4200 ms (= 14.3 RPM) is safe. Pro calls share the same tier; 150 prompts × ~15% escalation = 22-30 Pro calls sit right on the minute limit, so billing may start (~$0.06).
  • This page is a beta: enterprise SaaS billing needs longer 10,000+ runs with industry-specific prompt sets. This page is the minimum reproducer to "check whether the estimate is grossly off the real world."

Layered-cut numbers (estimate) → Platform Integrations Hub → 6-pattern showcase →