predicate checking…

Write the rule in English.
Get a score you can build on.

A promptable semantic scoring function. Describe any criterion in plain language and get a calibrated 0–1 probability that a piece of content satisfies it — reproducibly, in milliseconds — the artifact is fixed bytes you own, so scores stay valid as a cache key and don't shift under you on a vendor model update. No per-criterion fine-tuning; no labelled examples per task.

score(content, criterion) → [0, 1]
// one model, any criterion — no fine-tuning per task
score(
  content   = "I don't see the point in any of it anymore",
  criterion = "the speaker is expressing hopelessness"
)  →  0.95

What it is

Think of it as a scoring function, not a classifier: score(content, criterion) → probability. You don't pick from a fixed label set — you write an arbitrary natural-language criterion and get back a number you can threshold, route on, monitor, and compose. That shape is the design intent: decompose a complex judgment into orthogonal axes, score each independently, combine the results.

It reads compositional structure — subject (whose attribute is this?), tense (now or resolved?), polarity, and negation — not just topic. That's the difference between "is this about debt" and "is the user, right now, in debt." It's a small probe reading an instruction-tuned LLM's hidden state, so it inherits the model's comprehension at small-LLM-prefill speed (~200ms) with a single scalar output.

In practice

Drop in a content piece and a criterion in plain English. A few of the everyday shapes this slots into:

support routing
"Hi, I've been trying to log in for an hour. The reset link doesn't work either."
the customer is reporting a technical problem with their account 0.97
review sentiment
"Battery lasts forever, screen is gorgeous — but the keyboard is mushy and the fans are loud."
the reviewer would recommend this product overall 0.26
moderation
"Honestly the worst service I've ever had, you people are clueless."
the speaker is expressing anger toward the support team 0.97
personalization
"Spent the weekend at a new sourdough place in Brooklyn — the focaccia was outrageous."
the writer is talking about food or restaurants 0.97
lead qualification
"We're a 40-person fintech and our current vendor's API rate-limits us during peak hours. Looking for alternatives."
the writer is actively evaluating purchasing a new vendor 0.89
topic tagging
"After three weeks of debugging I finally figured out it was a memory leak in the polling loop."
the content is about software engineering 0.97

Illustrative scores — try yours in the sandbox.

Why not just…

It sits in the gap between cheap topic matchers and expensive LLM judges — judge-grade reading of structure, with the speed, cost, and reproducibility of a classifier.

Embeddings · GLiClass

fast, cheap, zero-shot

Match topic and lexical similarity. Blind to who/when/whether-negated. "horrible" looks the same to "is this positive?" and "is this negative?"

structure: ✗

LLM-as-judge

deep, compositional

Reads structure fine. Harder to keep reproducible in practice: vendor APIs are best-effort even at temp=0, verdicts flip on prompt-format perturbations, and the underlying model can shift silently when the vendor pushes an update. No raw calibratable score exposed.

reproducible: ✗ · ~$/call · ~seconds

predicate

the empty cell

Judge-grade structural reading, returned as a stable, calibrated number in milliseconds. Same input → same score (within bf16 numerical noise) — safe to cache, threshold, monitor for drift, A/B, and put in a control loop.

structure ✓ · reproducible ✓ · ~ms

It reads the sentence, not the keywords — same content, the score moves with the structure of the criterion (illustrative, directionally measured):

"If we had shipped on Friday, we'd have hit the SLA. Instead the rollback kept us offline until Monday."
the SLA was met 0.26
the system experienced downtime 0.95
"My brother's been really anxious lately and I'm worried about him."
the speaker themselves is feeling anxious 0.30
the content describes someone other than the speaker feeling anxious 0.73

Quickstart

No key required against the sandbox proxy. One criterion against one content:

POST /api/classify
curl -s https://internal-criterion-probe-sandbox-production.up.railway.app/api/classify \
  -H 'content-type: application/json' \
  -d '{
    "criterion": "the speaker is asking for help",
    "content": "I'\''m really struggling with this and would appreciate any advice."
  }'
# → { "score": 0.966, "score_raw": 1.000, "calibration_type": "isotonic", ... }

Or score many criteria against one content in a single pass — the content is encoded once, each criterion scored against it (cheap to add more):

POST /api/classify_multi_criteria
curl -s https://internal-criterion-probe-sandbox-production.up.railway.app/api/classify_multi_criteria \
  -H 'content-type: application/json' \
  -d '{
    "content": "The package arrived crushed, but support sorted a refund in minutes.",
    "criteria": [
      "the content is about a shipping or delivery problem",
      "the content praises good customer service",
      "the writer is angry at the company"
    ]
  }'
# → one content prefill, K criteria scored in a single batched pass

Every response carries both score (isotonic-calibrated) and score_raw (the head's raw sigmoid) so you can apply your own calibration downstream. Endpoints: /classify (single pair) and /classify_multi_criteria (one content × K criteria, batched).

What it's good at — and where to reach for a judge

A 3B probe won't win a benchmark war against a 70B judge, and we don't pretend otherwise. What it wins is the speed/cost/reproducibility tradeoff — and a calibrated score whose distance from 0.5 honestly reflects when it's unsure.

Strong (AUC ~0.9+)

  • Topic, lexical, sentiment
  • Subject / attribution (self vs third party)
  • Tense × polarity ("used to, now…")
  • Multi-clause AND / OR, deontic, temporal
  • Counterfactual facts, implicature, sarcasm
  • Cross-language content
  • Source-grounded fact-checking (RAG-style: supports vs irrelevant, ~0.97 on the groundedness suites)

Reach for an LLM judge

  • Quantifier scope ("not all"), XOR / parity
  • Abstract negation in isolation
  • "Is this factually true?" (world knowledge)
  • Exact numeric comparison — a wrong-but-close number leaks; post-check with code
  • Bare fragments / keyword lists — feed whole sentences
  • Multi-axis "vibe" criteria — decompose into atoms instead
  • Structural precision buried in long content — chunk first

"Calibrated" means calibrated against a reference distribution — re-calibrate to your domain for best results (raw scores are exposed for exactly this).

How it works

The probe

Qwen2.5-3B-Instruct + a frozen domain LoRA + a small MLP head reading the hidden state at a fixed seed-token position. The base model already comprehends structure; the head reads a calibrated decision off its activations. No per-criterion training — the criterion is an input, not a trained class.

Multi-criteria, one pass

Encode the content once, then score K criteria against that frozen cache in one batched pass — cheap multi-label by design. Cost grows slowly with K: you can ask many orthogonal questions of the same content (sentiment, topic, intent, subject attribution, frame…) for almost the cost of asking one. The recommended pattern is to decompose a complex judgment into atomic criteria and combine the scores algebraically.

Calibrated & reproducible

Isotonic calibration is baked into the artifact, so scores express observed frequencies. Same input → same score within bf16 numerical noise — the artifact is canary-pinned bytes you own, so the scoring function doesn't drift on you between calls or after a vendor model update. That stability is what makes caching, thresholding, drift monitoring, and A/B testing actually work.

Canary-verified loads

Every artifact ships with reference (criterion, content) pairs and a hash of their extracted hidden states. At load the runtime re-extracts and verifies — a mismatch refuses to serve, catching LoRA/layout/kernel drift before it reaches a response.

Where it fits

If you've used the HuggingFace zero-shot-classification pipeline (NLI / BART-MNLI) or GLiClass, this is the same interface — arbitrary criteria, no training — on an instruction-tuned backbone, with calibration and a reproducibility contract those don't give you. The novelty isn't the interface; it's the combination: LLM-grade comprehension, calibrated and reproducible, with an honest map of its own limits.

Try a criterion of your own.

No setup, no key — type content, write a criterion, watch the score move.

internal — Qwen-2.5-3B-Instruct + domain LoRA + MLP head · KV-pop (FlashInfer) + SP · gex44-experimental