How It Works
This page documents exactly how each prompt becomes a structured observation, how observations become aggregates, and how aggregates become the public numbers we publish. Deliberately detailed so a hostile reviewer can reproduce and audit every claim.
What we capture on every prompt
When you submit a prompt on the advisor, we synchronously run a PII redaction pipeline and a Tier 1 heuristic classifier, then write a single row to the prompt_events table.
- An anonymous user id and session id (never linked across devices).
- The SHA-256 hash of the original prompt (stable across redaction versions).
- The redacted prompt (emails, keys, SSNs, cards, phones, URLs stripped).
- Structural features — length, language, contains_code, contains_url, prompt shape.
- A multi-axis classification — category (16), subcategory (~60), intent, goal, domain, output type, task structure, reasoning intensity, creativity, precision, latency sensitivity, cost sensitivity, risk class, complexity, ambiguity, craft.
- The advisor's routing decision — recommended model, candidate list with scores, confidence, tradeoff profile.
- The outcome — which model you selected, whether you copied/exported/abandoned/re-routed, time-to-decision.
We do not capture IP, device fingerprint, saved credentials, full model completions, or location beyond the timezone you already share with every website.
The taxonomy
Every prompt is scored on eight orthogonal axes. Axes are deliberately independent so one prompt can belong to multiple useful dimensions.
Labels are additive only. Every event stores the classifier version that produced it.
The classifier pipeline
Classification runs in three tiers so cost per insight stays low:
Keyword-weighted classifier with structural feature extraction. Runs in the hot path for every prompt. Produces a full set of labels with a calibrated confidence score.
Events whose heuristic confidence falls below the escalation threshold are enriched asynchronously by a batched OpenAI classifier. Output is strictly validated against the taxonomy.
Reserved for a small rotating sample (<5%), low-confidence outliers, and disagreement cases. Produces gold-quality labels used to monitor Tier 2 drift.
Router accuracy and calibration
- Router Accuracy@1: share of events where the user's selected model equals the recommended model.
- Override rate: share of events where the user picked a different model. Tracked per category, per domain, per complexity bucket.
- Calibration: routing confidence vs. actual acceptance curve. Published quarterly.
- Inferred satisfaction: composite of copied / exported / session-continued / not-abandoned / not-reformulated signals, clamped to 0–1.
Sample sizes and caveats
Every public claim derived from this data will carry:
- Exact sample size per cell. Cells with n < 100 are never published; n < 500 flagged as preliminary.
- The exact data window (start → end timestamps).
- User-base composition (current panel skews developer/startup-heavy).
- The taxonomy version and classifier version in effect.
- A link back to this methodology page and to any retractions.
If a published finding later turns out to be wrong, we publish a public retraction with the before/after numbers rather than silently deleting the original claim.
Versioning
We version four things independently and stamp every event with the version in effect at write time:
taxonomy_version1.0.0classifier_versionheuristic-1.0.0routing_strategy_versionadvisor-v3.0.0redaction_version1.0.0Reproducibility
The instrumentation layer is part of this repository. The endpoints that power this methodology page and the internal dashboard are open source.
backend/app/core/taxonomy.pyauthoritative label sets
backend/app/core/redaction.pyexact patterns run on every prompt
backend/app/core/intent_classifier.pyTier 1 heuristic classifier
backend/app/api/routes/insights.pydashboard aggregation queries
Found a flaw in the methodology? Good. That is exactly what this page exists for. Open an issue on the repository and we will publish the fix and any resulting corrections.