Should an AI agent take moderation actions on its own?

For low-stakes actions (adding a soft-warn reaction, hiding spam links) — yes, with audit. For high-stakes actions (bans, public removals) — no, route to a human queue. wmcp.sh exposes high-stakes and low-stakes methods as separate MCP tools so you can scope agent permissions tightly.

Use Case · content-moderation

How to build an AI content moderation agent.

Q: What does an AI content moderation agent do?

It watches messages on a platform (Discord channel, Slack workspace), classifies them against your policy, and proposes an action: leave alone, soft-warn, hide, escalate to a human. The agent reads the policy doc directly so when the rules change, the agent adapts without retraining.

Q: How does it read images and links?

Discord and Slack messages embed URLs. The agent calls a generic /api/v1/tools?url=... fetcher to retrieve and inspect the destination — page text, OpenGraph metadata, image hashes. For multimodal classification of images themselves, pair with a vision-capable model.

Q: How do you avoid biased false positives?

Ground every decision in your written policy (stored in Notion or a wiki), require the agent to cite the specific policy clause for each action, and put any borderline call (confidence below a threshold) into a human review queue. Log everything to a warehouse for audit. wmcp.sh /managed bundles this audit layer.

Q: Which platforms are supported?

Discord via /integration/discord, Slack via /integration/slack, Notion for policy via /integration/notion, plus generic URL fetching. Anything with an OpenAPI spec — Reddit, Mastodon, Matrix — works through /integration/openapi. wmcp.sh is not affiliated with Discord, Slack, or Notion.

Trust & safety teams are drowning in the same triage pattern: a message lands, a moderator skims it, checks the policy wiki, takes one of five actions. That sequence is a tool-using loop. The hard part isn’t getting a model to classify text — it’s wiring the platform API, the policy doc, and the action surface into a clean, auditable loop a human can actually trust.

The gap

Classification without grounding is just guessing.

Most moderation prototypes start with a model and a hardcoded prompt: “flag if hateful, spam, or NSFW.” That works for a week, until policy changes. Then someone has to redeploy. Then you discover the bot never looked at the linked URL, just the message text. Then it flags a benign meme as a slur because the prompt drifted.

The shape that survives contact with real moderators: the agent reads your written policy on every decision, fetches and inspects any embedded URLs, classifies, and either acts (low stakes) or escalates (high stakes) — with a citation back to the policy clause in every log entry.

wmcp.sh wires this in: /integration/discord and /integration/slack for the platform, /integration/notion for the policy doc, and the generic URL fetcher for any link in a message. wmcp.sh is not affiliated with Discord, Slack, or Notion.

Architecture

Flag → classify → action.

1. Event source. A Discord bot or Slack app subscribes to messages in moderated channels and forwards the message ID into a queue. Each event gets its own bounded agent run.

2. Tool gateway (wmcp.sh). The agent boots with platform tools (Discord or Slack), a Notion search for policy, and a generic URL fetcher for any links in the message.

3. Reasoning loop. The agent fetches the full message, expands any URLs (page text + OpenGraph), searches the policy doc for relevant clauses, classifies, and either acts (e.g. add reaction, hide) or files an item in the human review queue.

4. Audit. Every decision is logged with policy clause, confidence, and action taken. Reviewers can override and the override feeds back into prompt tuning.

Tools the agent needs

What wmcp.sh provides.

Capability	System	How wmcp.sh wires it
Read channel messages	Discord	✅ /integration/discord
Read channel messages	Slack	✅ /integration/slack
Search policy doc	Notion	✅ /integration/notion
Inspect linked URL / image	Any URL	✅ Generic `/api/v1/tools?url=...` — text + OG metadata
Soft action (react / hide)	Discord / Slack	✅ Scoped to low-stakes methods only
Escalate to human queue	Linear / your queue	✅ OpenAPI adapter via /integration/openapi

Code

A grounded moderation pass.

Python sketch. Receives a message ID; emits a classification + action, always citing the relevant policy clause.

import os, httpx
from anthropic import Anthropic

client = Anthropic()
WMCP = "https://wmcp.sh"

def tools_for(url):
    return httpx.get(f"{WMCP}/api/v1/tools", params={"url": url}).json()["tools"]

tools = (
    tools_for("https://discord.com/api")
    + tools_for("https://www.notion.so/acme-policy")
    + tools_for("about:fetch")
)

msg_id = os.environ["MESSAGE_ID"]

resp = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user",
        "content": f"Message {msg_id}. Fetch full content, expand any URLs, search the "
                   "policy doc for relevant clauses, classify, and propose one action: "
                   "none / soft-warn / hide / escalate. Cite the policy clause."}],
)

print(resp.content)

Where we win

Hardcoded classifier vs grounded agent.

Static classifier:

Policy lives in the prompt — redeploy on every change
Never inspects linked content
No citation, no audit trail
Bias is invisible until a bad action ships

wmcp.sh grounded loop:

Policy is fetched live from Notion on every call
URLs get expanded and inspected
Every decision cites a clause; everything is logged
Scope soft and hard actions as separate MCP tools

FAQ

Common questions.

What does an AI content moderation agent do?

Watches messages, grounds in your policy doc, classifies, and either acts (low stakes) or escalates (high stakes).

Should it act on its own?

Low-stakes actions, yes with audit. High-stakes actions — bans, public removals — route to a human queue.

How does it read images and links?

Via the generic /api/v1/tools?url=... fetcher; multimodal classification needs a vision model.

How do you avoid biased false positives?

Cite the policy clause for every action, gate borderline calls into human review, log everything.

Which platforms are supported?

Discord and Slack natively; anything with an OpenAPI spec (Reddit, Mastodon, Matrix) via /integration/openapi.