Use Case · content-moderation

How to build an AI content moderation agent.

Trust & safety teams are drowning in the same triage pattern: a message lands, a moderator skims it, checks the policy wiki, takes one of five actions. That sequence is a tool-using loop. The hard part isn’t getting a model to classify text — it’s wiring the platform API, the policy doc, and the action surface into a clean, auditable loop a human can actually trust.

Classification without grounding is just guessing.

Most moderation prototypes start with a model and a hardcoded prompt: “flag if hateful, spam, or NSFW.” That works for a week, until policy changes. Then someone has to redeploy. Then you discover the bot never looked at the linked URL, just the message text. Then it flags a benign meme as a slur because the prompt drifted.

The shape that survives contact with real moderators: the agent reads your written policy on every decision, fetches and inspects any embedded URLs, classifies, and either acts (low stakes) or escalates (high stakes) — with a citation back to the policy clause in every log entry.

wmcp.sh wires this in: /integration/discord and /integration/slack for the platform, /integration/notion for the policy doc, and the generic URL fetcher for any link in a message. wmcp.sh is not affiliated with Discord, Slack, or Notion.

Flag → classify → action.

1. Event source. A Discord bot or Slack app subscribes to messages in moderated channels and forwards the message ID into a queue. Each event gets its own bounded agent run.

2. Tool gateway (wmcp.sh). The agent boots with platform tools (Discord or Slack), a Notion search for policy, and a generic URL fetcher for any links in the message.

3. Reasoning loop. The agent fetches the full message, expands any URLs (page text + OpenGraph), searches the policy doc for relevant clauses, classifies, and either acts (e.g. add reaction, hide) or files an item in the human review queue.

4. Audit. Every decision is logged with policy clause, confidence, and action taken. Reviewers can override and the override feeds back into prompt tuning.

What wmcp.sh provides.

CapabilitySystemHow wmcp.sh wires it
Read channel messagesDiscord/integration/discord
Read channel messagesSlack/integration/slack
Search policy docNotion/integration/notion
Inspect linked URL / imageAny URL✅ Generic /api/v1/tools?url=... — text + OG metadata
Soft action (react / hide)Discord / Slack✅ Scoped to low-stakes methods only
Escalate to human queueLinear / your queue✅ OpenAPI adapter via /integration/openapi

A grounded moderation pass.

Python sketch. Receives a message ID; emits a classification + action, always citing the relevant policy clause.

import os, httpx
from anthropic import Anthropic

client = Anthropic()
WMCP = "https://wmcp.sh"

def tools_for(url):
    return httpx.get(f"{WMCP}/api/v1/tools", params={"url": url}).json()["tools"]

tools = (
    tools_for("https://discord.com/api")
    + tools_for("https://www.notion.so/acme-policy")
    + tools_for("about:fetch")
)

msg_id = os.environ["MESSAGE_ID"]

resp = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user",
        "content": f"Message {msg_id}. Fetch full content, expand any URLs, search the "
                   "policy doc for relevant clauses, classify, and propose one action: "
                   "none / soft-warn / hide / escalate. Cite the policy clause."}],
)

print(resp.content)

Hardcoded classifier vs grounded agent.

Static classifier:

  • Policy lives in the prompt — redeploy on every change
  • Never inspects linked content
  • No citation, no audit trail
  • Bias is invisible until a bad action ships

wmcp.sh grounded loop:

  • Policy is fetched live from Notion on every call
  • URLs get expanded and inspected
  • Every decision cites a clause; everything is logged
  • Scope soft and hard actions as separate MCP tools

Common questions.

What does an AI content moderation agent do?
Watches messages, grounds in your policy doc, classifies, and either acts (low stakes) or escalates (high stakes).
Should it act on its own?
Low-stakes actions, yes with audit. High-stakes actions — bans, public removals — route to a human queue.
How does it read images and links?
Via the generic /api/v1/tools?url=... fetcher; multimodal classification needs a vision model.
How do you avoid biased false positives?
Cite the policy clause for every action, gate borderline calls into human review, log everything.
Which platforms are supported?
Discord and Slack natively; anything with an OpenAPI spec (Reddit, Mastodon, Matrix) via /integration/openapi.
Need this built for you?

Hosted moderation loop with audit + override.

Custom platform adapter + hosted MCP at mcp.yourbrand.com + verified badge. Starter $499 one-time · Managed Retainer $999/mo · Enterprise $4,999+/mo.

See /managed → Submit (free)