Portfolio project  ·  AI PM / TPM

Feedback in.
Traceable work
packs out.

Asterline reads raw, unstructured feedback and turns it into structured work packs — one per underlying issue, each with the quotes it came from, the tasks it implies, and a drafted reply waiting on a human's sign-off.

01Eval design

02Iteration

03Traceability

View the code

From noise to one artifact

Three tickets, different words, the same underlying issue.

Raw feedback in

FB-01 · support_ticket

"no error, no confirmation, page just sat there for 10+ minutes."

FB-26 · support_ticket

"We had no way of knowing whether the payments had gone out."

FB-27 · support_ticket

"Uploaded a 540-row file, nothing happened. Is there a size limit nobody told us about?"

One work pack out

CLU-001 ● actionable bug

{{ demoTaskPrio }} {{ demoTaskTeam }} — {{ demoTaskText }}

🔒 {{ demoReviewText }}

Built for whoever ends up triaging feedback — support lead, PM, or a founder doing both — not a specific role or team size.

Eight steps, none skipped

Eight checked steps turn a raw, messy inbox into work packs a human can act on.

{{ st.n }}

Why trust the output

The hard part isn't sorting feedback. It's drafting a reply about someone's money without confidently saying something false.

So the answer isn't “trust the model.” It's three guarantees the pipeline enforces — each one inspectable in the product itself.

Eval-first, not vibes

20-item golden set · 65%* overall

The definition of “correct” — a 20-item rubric and a hand-labeled golden set — was written before any output was generated. Every version since is scored against that standard. One prompt iteration made accuracy worse; it was caught by the eval, reverted, and kept in the log rather than hidden.

Every claim is cited

100% traceable to a source

Each quote is verbatim from a real feedback item; each policy statement cites a clause ID you can open. If the context to back a claim isn't loaded, the draft says so rather than inventing a policy.

A human holds the trigger

0 replies auto-sent

Tasks are recommendations, not filed tickets. Any reply touching money, timing, or policy is blocked by a review flag until a person verifies it. Nothing sends itself.

* 65% overall on a 20-item golden set with strict multi-axis scoring — intent, dimension, impact, and urgency must all match to count. Individual axes score 75–90%. Shown with their evaluation set, never as a bare percentage.

Safety & data handling

Honest about what it protects — and what it doesn't yet.

Feedback is messy and often carries personal data. Here is what the pipeline does about that, stated plainly, including the limits it hasn't closed.

PII redaction runs first

Emails, phone numbers, and account identifiers are stripped before any model sees the text. The honest limit: human names aren't caught by regex in v1 — a documented gap, not a hidden one. You can see it directly: open any quote's source.

No silent automation

The pipeline never files a ticket or sends a message on its own. It proposes; a human disposes. Every consequential step has a person in the loop by design.

Traceable by construction

Nothing is asserted without a source ref. When no supporting context is loaded, the run says so on the results — it does not paper over the gap with a plausible-sounding policy.

Synthetic by design

Vela Pay is the built-in demo dataset — a synthetic B2B payments company with realistic feedback, policies, and known issues, no real customers. The pipeline runs identically when you upload your own data; the context docs are what change.

v1 limitations — regex PII redaction, direct RAG retrieval, manual orchestration — are documented in the repo, not hidden.

See it run on real feedback.

View the code on GitHub

The built-in demo uses Vela Pay — a synthetic B2B payments company with realistic policies, known issues, and customer feedback. No real customers, no real data. Upload your own CSV or paste raw feedback to run the same pipeline on your own input.

How it works

From a raw inbox to a stack of reviewable work packs.

The same depth you'd expect from a real product's explainer: the full pipeline, every field in a work pack, how the evaluation was designed, and what changed across iterations.

01 — The pipeline, one beat each

{{ st.nNum }}

02 — What's in a work pack

Every field explained in plain terms. The structure is deliberate — some fields are computed, some block sending, some are only recommendations.

03 — How the evaluation was designed

Before any work pack was generated at scale, twenty real-looking feedback items were hand-labeled against a written rubric — what a correct classification looks like, what a passing work pack has to include.

That standard came first; everything since is measured against it, not the other way around. The rubric mixes deterministic checks the code runs automatically (13 rules) with judgment calls scored by a human reviewer (7 rules) — including whether tasks are correctly scoped and whether the reply tone matches the situation.

What counts as a pass 5 of 20 — see all →

04 — Iteration evidence

The first version of the classifier got about 40% of the golden set right. Four rounds of changing the prompt brought that to about 65% — including one round that made things worse and was reverted, kept in the log rather than hidden.

Classification accuracy — 20-item golden set

{{ b.pct }}%

{{ b.v }}

Work pack generation — generate-v1 through v9 · 4 auto-check bugs fixed

Nine prompt versions across three sessions, validated by four rounds of human eval sampling (12 clusters total). Four code bugs in the automated quality checks were found and fixed during iteration — two from full-output scans, two surfaced during human review. None were visible from spot-checking alone.

→ fixed

New run

Pick a source of feedback.

Paste or upload your own feedback to test how it handles arbitrary input — that's the real test. The built-in packs are a fast preview, not a proof of anything.

One item per block. Prefilled with Vela Pay sample tickets.

Prototype note This demo uses pre-generated results — your pasted text isn't sent to a model. A real deployment would run the full pipeline on your input.

Drop a CSV of feedback rows

Columns like text, channel, account — mapped on import.

Context document optional

Upload a .md or .pdf — the model uses it to ground policy references and reply drafts. Without one, source_refs will be empty.

Upload .md or .pdf

source_refs will reflect your document's structure — results may vary from the demo dataset

Prototype note This demo runs a simulated pipeline against pre-generated results — no model is called. A production deployment would run the full 8-stage pipeline on your input.

This session

Runs from this browser tab only. Gone on refresh — there's no account and no saved history.

No runs yet this session.
Run the pipeline to see results here.

Running · {{ runLabel }}

{{ runStageCount }}

{{ st.short }}

Results · {{ runLabel }}

{{ resultCount }} work packs

{{ flagSummary }}

No product context loaded for this run — classification and clustering ran without Vela Pay's policy docs or known-issues list.

Intent Signal Flags

            {{ filteredCount }} of {{ totalCount }} shown
          

No clusters match the selected filters.

{{ sel.id }} {{ sel.intentGlyph }} {{ sel.intentLabel }} · {{ sel.dimension }}

Problem brief

Key quotes

"{{ q.text }}"

Tasks recommendations, not auto-filed tickets

⚑ {{ t.deadline }}

Done when: {{ t.acc }}

Reply draft body only — the CRM adds salutation and sign-off

🔒 Locked — a blocking review flag must be cleared before this can be sent.

Reply draft

None — this cluster is classified noise, so no reply is drafted and no tasks are filed.

Signal strength

Computed from member count, account diversity, and severity — not a model opinion.

Cluster members · the evidence

Source refs

No source refs — {{ sel.noRefsReason }}

Review flags block sending

Quality flags don't block · for awareness

{{ f.type }}

Export

Markdown for people, JSON shaped for Jira or Linear.

Feedback in.
Traceable work
packs out.

Three tickets, different words, the same underlying issue.

Eight checked steps turn a raw, messy inbox into work packs a human can act on.

The hard part isn't sorting feedback. It's drafting a reply about someone's money without confidently saying something false.

Honest about what it protects — and what it doesn't yet.

See it run on real feedback.

From a raw inbox to a stack of reviewable work packs.

01 — The pipeline, one beat each

02 — What's in a work pack

03 — How the evaluation was designed

04 — Iteration evidence

Pick a source of feedback.

{{ runStageName }}

{{ resultCount }} work packs

{{ sel.title }}

Feedback in.Traceable workpacks out.

Three tickets, different words, the same underlying issue.

Eight checked steps turn a raw, messy inbox into work packs a human can act on.

The hard part isn't sorting feedback. It's drafting a reply about someone's money without confidently saying something false.

Honest about what it protects — and what it doesn't yet.

See it run on real feedback.

From a raw inbox to a stack of reviewable work packs.

01 — The pipeline, one beat each

02 — What's in a work pack

03 — How the evaluation was designed

04 — Iteration evidence

Pick a source of feedback.

{{ runStageName }}

{{ resultCount }} work packs

{{ sel.title }}

Feedback in.
Traceable work
packs out.