Write the rule in English.
Get a score you can build on.
A promptable semantic scoring function. Describe any criterion in plain language and get a calibrated 0–1 probability that a piece of content satisfies it — reproducibly, in milliseconds — the artifact is fixed bytes you own, so scores stay valid as a cache key and don't shift under you on a vendor model update. No per-criterion fine-tuning; no labelled examples per task.
// one model, any criterion — no fine-tuning per task score( content = "I don't see the point in any of it anymore", criterion = "the speaker is expressing hopelessness" ) → 0.95
What it is
Think of it as a scoring function, not a classifier: score(content, criterion) → probability.
You don't pick from a fixed label set — you write an arbitrary natural-language criterion and
get back a number you can threshold, route on, monitor, and compose.
That shape is the design intent: decompose a complex judgment into orthogonal axes, score each
independently, combine the results.
It reads compositional structure — subject (whose attribute is this?), tense (now or resolved?), polarity, and negation — not just topic. That's the difference between "is this about debt" and "is the user, right now, in debt." It's a small probe reading an instruction-tuned LLM's hidden state, so it inherits the model's comprehension at small-LLM-prefill speed (~200ms) with a single scalar output.
In practice
Drop in a content piece and a criterion in plain English. A few of the everyday shapes this slots into:
Illustrative scores — try yours in the sandbox.
Why not just…
It sits in the gap between cheap topic matchers and expensive LLM judges — judge-grade reading of structure, with the speed, cost, and reproducibility of a classifier.
Embeddings · GLiClass
fast, cheap, zero-shot
Match topic and lexical similarity. Blind to who/when/whether-negated. "horrible" looks the same to "is this positive?" and "is this negative?"
structure: ✗
LLM-as-judge
deep, compositional
Reads structure fine. Harder to keep reproducible in practice: vendor APIs are best-effort even at temp=0, verdicts flip on prompt-format perturbations, and the underlying model can shift silently when the vendor pushes an update. No raw calibratable score exposed.
reproducible: ✗ · ~$/call · ~seconds
predicate
the empty cell
Judge-grade structural reading, returned as a stable, calibrated number in milliseconds. Same input → same score (within bf16 numerical noise) — safe to cache, threshold, monitor for drift, A/B, and put in a control loop.
structure ✓ · reproducible ✓ · ~ms
It reads the sentence, not the keywords — same content, the score moves with the structure of the criterion (illustrative, directionally measured):
Quickstart
No key required against the sandbox proxy. One criterion against one content:
curl -s https://internal-criterion-probe-sandbox-production.up.railway.app/api/classify \
-H 'content-type: application/json' \
-d '{
"criterion": "the speaker is asking for help",
"content": "I'\''m really struggling with this and would appreciate any advice."
}'
# → { "score": 0.966, "score_raw": 1.000, "calibration_type": "isotonic", ... }Or score many criteria against one content in a single pass — the content is encoded once, each criterion scored against it (cheap to add more):
curl -s https://internal-criterion-probe-sandbox-production.up.railway.app/api/classify_multi_criteria \
-H 'content-type: application/json' \
-d '{
"content": "The package arrived crushed, but support sorted a refund in minutes.",
"criteria": [
"the content is about a shipping or delivery problem",
"the content praises good customer service",
"the writer is angry at the company"
]
}'
# → one content prefill, K criteria scored in a single batched passEvery response carries both score (isotonic-calibrated) and score_raw (the head's raw sigmoid) so you can apply your own
calibration downstream. Endpoints: /classify (single pair) and /classify_multi_criteria (one content × K criteria, batched).
What it's good at — and where to reach for a judge
A 3B probe won't win a benchmark war against a 70B judge, and we don't pretend otherwise. What it wins is the speed/cost/reproducibility tradeoff — and a calibrated score whose distance from 0.5 honestly reflects when it's unsure.
Strong (AUC ~0.9+)
- Topic, lexical, sentiment
- Subject / attribution (self vs third party)
- Tense × polarity ("used to, now…")
- Multi-clause AND / OR, deontic, temporal
- Counterfactual facts, implicature, sarcasm
- Cross-language content
- Source-grounded fact-checking (RAG-style: supports vs irrelevant, ~0.97 on the groundedness suites)
Reach for an LLM judge
- Quantifier scope ("not all"), XOR / parity
- Abstract negation in isolation
- "Is this factually true?" (world knowledge)
- Exact numeric comparison — a wrong-but-close number leaks; post-check with code
- Bare fragments / keyword lists — feed whole sentences
- Multi-axis "vibe" criteria — decompose into atoms instead
- Structural precision buried in long content — chunk first
"Calibrated" means calibrated against a reference distribution — re-calibrate to your domain for best results (raw scores are exposed for exactly this).
How it works
The probe
Qwen2.5-3B-Instruct + a frozen domain LoRA + a small MLP head reading the hidden state at a fixed seed-token position. The base model already comprehends structure; the head reads a calibrated decision off its activations. No per-criterion training — the criterion is an input, not a trained class.
Multi-criteria, one pass
Encode the content once, then score K criteria against that frozen cache in one batched pass — cheap multi-label by design. Cost grows slowly with K: you can ask many orthogonal questions of the same content (sentiment, topic, intent, subject attribution, frame…) for almost the cost of asking one. The recommended pattern is to decompose a complex judgment into atomic criteria and combine the scores algebraically.
Calibrated & reproducible
Isotonic calibration is baked into the artifact, so scores express observed frequencies. Same input → same score within bf16 numerical noise — the artifact is canary-pinned bytes you own, so the scoring function doesn't drift on you between calls or after a vendor model update. That stability is what makes caching, thresholding, drift monitoring, and A/B testing actually work.
Canary-verified loads
Every artifact ships with reference (criterion, content) pairs and a hash of their extracted hidden states. At load the runtime re-extracts and verifies — a mismatch refuses to serve, catching LoRA/layout/kernel drift before it reaches a response.
Where it fits
If you've used the HuggingFace zero-shot-classification pipeline
(NLI / BART-MNLI) or GLiClass, this is the same interface — arbitrary criteria,
no training — on an instruction-tuned backbone, with calibration and a reproducibility contract
those don't give you. The novelty isn't the interface; it's the combination: LLM-grade
comprehension, calibrated and reproducible, with an honest map of its own limits.
Try a criterion of your own.
No setup, no key — type content, write a criterion, watch the score move.