Three tickets, different words, the same underlying issue.
Built for whoever ends up triaging feedback — support lead, PM, or a founder doing both — not a specific role or team size.
Eight checked steps turn a raw, messy inbox into work packs a human can act on.
The hard part isn't sorting feedback. It's drafting a reply about someone's money without confidently saying something false.
So the answer isn't “trust the model.” It's three guarantees the pipeline enforces — each one inspectable in the product itself.
The definition of “correct” — a 20-item rubric and a hand-labeled golden set — was written before any output was generated. Every version since is scored against that standard. One prompt iteration made accuracy worse; it was caught by the eval, reverted, and kept in the log rather than hidden.
Each quote is verbatim from a real feedback item; each policy statement cites a clause ID you can open. If the context to back a claim isn't loaded, the draft says so rather than inventing a policy.
Tasks are recommendations, not filed tickets. Any reply touching money, timing, or policy is blocked by a review flag until a person verifies it. Nothing sends itself.
* 65% overall on a 20-item golden set with strict multi-axis scoring — intent, dimension, impact, and urgency must all match to count. Individual axes score 75–90%. Shown with their evaluation set, never as a bare percentage.
Honest about what it protects — and what it doesn't yet.
Feedback is messy and often carries personal data. Here is what the pipeline does about that, stated plainly, including the limits it hasn't closed.
Emails, phone numbers, and account identifiers are stripped before any model sees the text. The honest limit: human names aren't caught by regex in v1 — a documented gap, not a hidden one. You can see it directly: open any quote's source.
The pipeline never files a ticket or sends a message on its own. It proposes; a human disposes. Every consequential step has a person in the loop by design.
Nothing is asserted without a source ref. When no supporting context is loaded, the run says so on the results — it does not paper over the gap with a plausible-sounding policy.
Vela Pay is the built-in demo dataset — a synthetic B2B payments company with realistic feedback, policies, and known issues, no real customers. The pipeline runs identically when you upload your own data; the context docs are what change.
v1 limitations — regex PII redaction, direct RAG retrieval, manual orchestration — are documented in the repo, not hidden.
See it run on real feedback.
From a raw inbox to a stack of reviewable work packs.
The same depth you'd expect from a real product's explainer: the full pipeline, every field in a work pack, how the evaluation was designed, and what changed across iterations.
01 — The pipeline, one beat each
02 — What's in a work pack
Every field explained in plain terms. The structure is deliberate — some fields are computed, some block sending, some are only recommendations.
03 — How the evaluation was designed
Before any work pack was generated at scale, twenty real-looking feedback items were hand-labeled against a written rubric — what a correct classification looks like, what a passing work pack has to include.
That standard came first; everything since is measured against it, not the other way around. The rubric mixes deterministic checks the code runs automatically (13 rules) with judgment calls scored by a human reviewer (7 rules) — including whether tasks are correctly scoped and whether the reply tone matches the situation.
04 — Iteration evidence
The first version of the classifier got about 40% of the golden set right. Four rounds of changing the prompt brought that to about 65% — including one round that made things worse and was reverted, kept in the log rather than hidden.
Nine prompt versions across three sessions, validated by four rounds of human eval sampling (12 clusters total). Four code bugs in the automated quality checks were found and fixed during iteration — two from full-output scans, two surfaced during human review. None were visible from spot-checking alone.
Pick a source of feedback.
Paste or upload your own feedback to test how it handles arbitrary input — that's the real test. The built-in packs are a fast preview, not a proof of anything.
Upload a .md or .pdf — the model uses it to ground policy references and reply drafts. Without one, source_refs will be empty.
Runs from this browser tab only. Gone on refresh — there's no account and no saved history.
Run the pipeline to see results here.
{{ runStageName }}
{{ resultCount }} work packs
{{ sel.title }}
{{ sel.brief }}