Methodology v1.0 · taxonomy v1.0.0

How It Works

This page documents exactly how each prompt becomes a structured observation, how observations become aggregates, and how aggregates become the public numbers we publish. Deliberately detailed so a hostile reviewer can reproduce and audit every claim.

Section 1

What we capture on every prompt

When you submit a prompt on the advisor, we synchronously run a PII redaction pipeline and a Tier 1 heuristic classifier, then write a single row to the prompt_events table.

An anonymous user id and session id (never linked across devices).
The SHA-256 hash of the original prompt (stable across redaction versions).
The redacted prompt (emails, keys, SSNs, cards, phones, URLs stripped).
Structural features — length, language, contains_code, contains_url, prompt shape.
A multi-axis classification — category (16), subcategory (~60), intent, goal, domain, output type, task structure, reasoning intensity, creativity, precision, latency sensitivity, cost sensitivity, risk class, complexity, ambiguity, craft.
The advisor's routing decision — recommended model, candidate list with scores, confidence, tradeoff profile.
The outcome — which model you selected, whether you copied/exported/abandoned/re-routed, time-to-decision.

We do not capture IP, device fingerprint, saved credentials, full model completions, or location beyond the timezone you already share with every website.

Section 2

The taxonomy

Every prompt is scored on eight orthogonal axes. Axes are deliberately independent so one prompt can belong to multiple useful dimensions.

Task category (16 values)

Task subcategory (~60 values)

Output type (20 values)

Task structure (10 shapes)

Domain / industry (30 values)

User goal (14 values)

Risk / compliance class (9 values)

Intent label (10 values)

Labels are additive only. Every event stores the classifier version that produced it.

Section 3

The classifier pipeline

Classification runs in three tiers so cost per insight stays low:

Tier 1 · Heuristic~$0 / event

Keyword-weighted classifier with structural feature extraction. Runs in the hot path for every prompt. Produces a full set of labels with a calibrated confidence score.

Tier 2 · GPT-4o-mini~$0.0002 / event

Events whose heuristic confidence falls below the escalation threshold are enriched asynchronously by a batched OpenAI classifier. Output is strictly validated against the taxonomy.

Tier 3 · Sonnet~$0.002 / event

Reserved for a small rotating sample (<5%), low-confidence outliers, and disagreement cases. Produces gold-quality labels used to monitor Tier 2 drift.

Section 4

Router accuracy and calibration

Router Accuracy@1: share of events where the user's selected model equals the recommended model.
Override rate: share of events where the user picked a different model. Tracked per category, per domain, per complexity bucket.
Calibration: routing confidence vs. actual acceptance curve. Published quarterly.
Inferred satisfaction: composite of copied / exported / session-continued / not-abandoned / not-reformulated signals, clamped to 0–1.

Section 5

Sample sizes and caveats

Every public claim derived from this data will carry:

Exact sample size per cell. Cells with n < 100 are never published; n < 500 flagged as preliminary.
The exact data window (start → end timestamps).
User-base composition (current panel skews developer/startup-heavy).
The taxonomy version and classifier version in effect.
A link back to this methodology page and to any retractions.

If a published finding later turns out to be wrong, we publish a public retraction with the before/after numbers rather than silently deleting the original claim.

Section 6

Versioning

We version four things independently and stamp every event with the version in effect at write time:

taxonomy_version1.0.0

classifier_versionheuristic-1.0.0

routing_strategy_versionadvisor-v3.0.0

redaction_version1.0.0

Section 7

Reproducibility

The instrumentation layer is part of this repository. The endpoints that power this methodology page and the internal dashboard are open source.

backend/app/core/taxonomy.py

authoritative label sets

backend/app/core/redaction.py

exact patterns run on every prompt

backend/app/core/intent_classifier.py

Tier 1 heuristic classifier

backend/app/api/routes/insights.py

dashboard aggregation queries

Found a flaw in the methodology? Good. That is exactly what this page exists for. Open an issue on the repository and we will publish the fix and any resulting corrections.