01 · The brief

"Tell us which accounts are about to churn — and why."

The customer was a Series-C SaaS company with ~800 mid-market accounts. CS team of 12. Existing churn signal was a single product-usage dashboard plus a CSM's gut feel during weekly QBR prep. Result: churn signals showed up two weeks before renewal, which is two weeks too late.

The mandate: a daily Watch co-worker that surfaces 3–5 accounts per day with materially elevated risk, with the evidence trail, routed to the assigned CSM via Slack DM. Approval gate: the CSM decides whether to act. No autonomous outreach.

Diagram · 01 The agent loop — signals in, scored draft out
SIGNALS Usage delta · 14d Ticket sentiment Login frequency Exec contact gap Renewal proximity SCORER Per-signal z-score vs account history Weighted sum CSM tuning Threshold gate Top-K only EXPLAINER LLM call 3 highest contributors Plain-English why cites raw rows Slack DM to assigned CSM FLAGGED Deterministic scoring · LLM only writes the explanation, never the score.

Critical separation: the scorer is deterministic (math you can read). The LLM only writes the explanation — never the decision.

02 · The signals

Five inputs. Each measured against the account's own history.

Industry-baseline anomaly detection performs badly on heterogeneous customer bases — an enterprise account's "normal" looks like a SMB account's "high alert." We score each signal as a z-score against the same account's prior 90 days, not against the cohort.

  • Usage delta (14-day window): rolling sum of product events per workspace. Sensitive to seasonality — we deseasonalize by day-of-week.
  • Ticket sentiment: support ticket count and the average severity scored against the account's median. A single P0 is more important than five P3s.
  • Login frequency: distinct daily logins per account. Drop-off is more predictive than total volume.
  • Exec contact gap: days since last meaningful interaction with a named exec sponsor. CRM activity + calendar.
  • Renewal proximity: not a signal itself — a weight multiplier. The same 2σ anomaly matters more 60 days before renewal than 360 days before.

The CSM team tuned the weights in a half-day workshop. Not the engineers. That step mattered — they own the output, they own the math.

03 · The prompts

The LLM has a small, specific job.

The scoring is pure SQL + Python. The LLM has exactly one task: given the top 3 contributing signals (with raw numbers) and a small slice of supporting CRM context, write a 2–3 sentence plain-English summary the CSM can read in 4 seconds before deciding whether to act.

The prompt shape (paraphrased)

You are summarizing an anomaly alert for a CSM. You will receive: account name, top 3 signal deltas (signal name, current value, baseline, z-score), and the renewal window. Write 2–3 sentences explaining what changed and why it might matter. Cite the specific numbers. Do not speculate about causes. Do not recommend an action. Output plain text.

"Do not recommend an action" is load-bearing. The first draft of this co-worker had the LLM recommending outreach playbooks. CSMs hated it — the model didn't know the relationship history, and the recommendations felt presumptuous. We pulled it. Trust improved overnight.

04 · The tool calls

What the agent can actually do.

  • Read SQL queries against four pre-approved warehouse tables. Connection pooled, query timeout 30s, parameterized.
  • Slack DM to a specific user (the assigned CSM) via a Slack app with `chat:write` scope only.
  • Audit-log write to a Riyalabs-managed log table. Every alert with the score, contributors, and CSM action.

Notice what's missing: no email send, no CRM update, no calendar invite. The CSM does those things from Slack as a follow-up. We considered "draft an email" as a second-stage agent — punted to a future engagement after the CSMs trusted the scoring.

"The LLM doesn't decide who's at risk. It explains the decision the math already made — in language the CSM can act on in 30 seconds."

05 · The false-positive math

Daily K = 5, but only because we measured.

Early prototype: alerts for any account with a composite score above 2.5. Result: 18–25 alerts a day. CSMs ignored them by week two. Classic alert fatigue.

We re-scoped to top-K-per-day, K=5. Daily, the agent ranks all accounts and only DMs the top 5. Below K=5, an account waits for tomorrow — if it's still in the top 5, it gets re-DM'd with a note that it was also flagged yesterday.

The trade-off: in weeks with broad risk (post-incident, end-of-quarter), real risk slid out of the daily top 5. We added an escalation rule: if an account has been flagged three days running, the alert is upgraded to a CSM-manager Slack channel as well, not just the IC.

IterationDaily alertsCSM trust scoreVerdict
v1 · score > 2.518–252/10 by week 2Killed
v2 · top 10 daily105/10 by week 2Better, still noisy
v3 · top 5 + escalation5 (+ rare manager escalation)8/10 by week 2Shipped
06 · What we kept learning

Two months in, three lessons.

  • The model is the smallest part of the system. 80% of the work was on signals and thresholds. Swapping the LLM out for a different one would change nothing material in the output.
  • Boring deterministic logic beats clever model behavior. Anywhere we had the LLM "decide" something, we eventually rewrote it as plain SQL with clearer error modes.
  • The CSM tuning workshop should happen on day three, not day twenty-one. Tuning belongs to the people whose names are on the alerts.

Want this exact pattern against your data?

If you have a customer base, a CS team, and a churn signal that arrives too late, this pattern is a 2–3 week build. The first 45 minutes is the cheapest version of the conversation.

Back to resources