← All posts

LLM-Fallback Economics: The Real Cost of Extracting Tools from Any URL via Haiku

Why hauling 50KB of raw HTML to Sonnet is a financial disaster — and how a dual-model dynamic adapter architecture saves 90% on token overhead.

2026-05-27

The $400 API Bill That Scraped 100 Pages

Let's say you're building a shopper-side AI agent. You want this agent to go to any web page, find the checkout or add-to-cart action, and call it programmatically. Unlike a seller-side admin tool that relies on a structured, private API key, your agent operates on behalf of the customer. It has to look at a public storefront, understand the page structure, and execute transactional actions in real time.

If you take a standard, developer-first approach, you might do this:

Fetch the raw HTML of the product page (often 50KB to 150KB of dense, script-laden junk).
Clean it slightly (converting to Markdown or extracting visible text).
Send it directly to a premium model like Claude 3.5 Sonnet with a system prompt like "You are a web shopping assistant. Extract the add-to-cart tool definition from this page and execute it."

This works. Sonnet is extremely smart. It figures out the selectors, parses the JSON-LD, identifies the storefront API calls, and outputs a clean JSON schema representing the tool.

But then your credit card statement arrives.

At $3.00 per million input tokens and $15.00 per million output tokens for Claude 3.5 Sonnet, a single product page with a 60KB payload costs approximately $0.05 to $0.08 per run. If your agent traverses a user’s checkout flow across a few pages, that's $0.25. If 10,000 active users run 10 shopping queries a day, your daily LLM bill sits at $5,000+.

The economics are simple: you cannot scale a shopper-side AI agent using premium models as the frontline parser for arbitrary URLs. The input token bloat will crush your unit margins before you ever reach product-market fit.

Understanding the Token Bloat Pipeline

Why is the cost so high? The answer lies in the Model Context Protocol (MCP) Overhead and raw payload size.

When an agent operates, it doesn’t just see the HTML. It carries the system context, the agent loop history, and the declarations of every single tool it can call. If your shopper agent is equipped with a broad set of browser, search, and transactional tools, the base prompt starts at 10,000 tokens before it even looks at a webpage.

When you inject a raw URL payload, you are compounding this overhead. The HTML contains:

Redundant tracking scripts (GTM, TikTok pixel data).
Deeply nested <div> wrappers and inline CSS classes.
Massive JSON-LD arrays containing unrelated store policies.

A model like Sonnet spends 95% of its attention cycles and token consumption reading noise. It only needs about 5% of the page data to figure out that the storefront expects a POST request to /cart/add.js with a variant_id and quantity.

Enter the Dual-Model Dynamic Adapter Pattern

To solve the economics of arbitrary URL tool extraction, we need a fallback architecture that splits the cognitive load.

Instead of routing raw HTML directly to a premium model, we employ a three-tier parser stack:

Static Storefront Parser: Instantly extracts known schemas (like Shopify’s /products.json or standard Schema.org JSON-LD). This runs locally in microseconds and costs $0.00.
Semantic Fallback Parser: If the site is a custom SPA or obscure framework with no structured markup, we route the page text to a fast, cheap model—Claude 3.5 Haiku.
Orchestrator: Only the final, extracted tool schema (which is tiny, usually <500 tokens) is passed back to the premium model (Sonnet) guiding the main agent loop.

By using Claude 3.5 Haiku ($0.80 per million input tokens and $4.00 per million output tokens) as the frontline semantic fallback, we cut our token cost by 73% compared to Sonnet.

But we can go further. By combining Haiku with prompt caching, we slash costs by up to 90%. Because the system prompt, tool schemas, and core parser definitions remain static across runs, they can be cached on Anthropic’s servers. Caching input tokens reduces the cost from $0.80/M to $0.08/M—an order of magnitude drop.

Here is the exact architectural difference in dollars:

Step	Premium Direct (Sonnet)	Dual-Model Fallback (Haiku Cached)
Frontline Parse	$0.060 (Sonnet)	$0.005 (Haiku with Cache)
Agent Action	$0.015 (Sonnet)	$0.015 (Sonnet with small schema)
Total Per Page	$0.075	$0.020
1M Page Runs	$75,000	$20,000 (Save $55,000!)

Implementing the Fallback Engine in Node.js / Python

This is where open-source tools like wmcp.sh prove their worth. Instead of developers building these complex parser caches, wmcp.sh wraps the entire tripartite extraction (Shopify storefront, JSON-LD, and LLM fallback) into a single, highly optimized endpoint that resolves any URL into agent-callable tools in under 500ms.

If you wanted to build a raw fallback parser yourself, the implementation is straightforward. Here is a production-grade Python script that implements the LLM-fallback pattern using Claude Haiku with custom tool schema output:

import json
import os
import requests
from anthropic import Anthropic

# Initialize Anthropic Client
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

# Lightweight JSON Schema for the expected output
EXTRACTION_SCHEMA = {
    "name": "add_to_cart_tool",
    "description": "JSON schema for the dynamic tool required to add this product to the cart",
    "input_schema": {
        "type": "object",
        "properties": {
            "endpoint_url": {"type": "string", "description": "Absolute URL to POST/GET payload"},
            "method": {"type": "string", "enum": ["POST", "GET"]},
            "payload_format": {"type": "string", "enum": ["json", "form"]},
            "required_fields": {
                "type": "object",
                "description": "Key-value pair of fields. Specify defaults or types."
            }
        },
        "required": ["endpoint_url", "method", "payload_format", "required_fields"]
    }
}

def extract_url_tools(url: str, raw_html: str) -> dict:
    # 1. Truncate and clean HTML locally to save input tokens
    # (Simple text extraction or regex to remove script/style tags)
    import re
    cleaned_text = re.sub(r'<script.*?</script>', '', raw_html, flags=re.DOTALL)
    cleaned_text = re.sub(r'<style.*?</style>', '', cleaned_text, flags=re.DOTALL)
    cleaned_text = " ".join(cleaned_text.split())[:12000] # Limit to ~3,000 tokens
    
    # 2. Query Claude Haiku with explicit prompt caching on the system instructions
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=1000,
        temperature=0.0,
        system=[
            {
                "type": "text",
                "text": (
                    "You are a low-level, high-efficiency web semantic parser. "
                    "Analyze the provided web text and extract the exact form submission action "
                    "or API endpoint needed to perform an 'add to cart' action on this storefront. "
                    "You must output ONLY a valid JSON object matching the provided schema. "
                    "Do not include markdown packaging, code fences, or chat preamble."
                ),
                # Explicitly request prompt caching on these heavy system instructions
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[
            {
                "role": "user",
                "content": f"URL: {url}\nPage Content Snippet:\n{cleaned_text}\n\nSchema:\n{json.dumps(EXTRACTION_SCHEMA)}"
            }
        ]
    )
    
    # Parse the output safely
    try:
        tool_definition = json.loads(response.content[0].text.strip())
        return {
            "status": "success",
            "model_used": "claude-3-5-haiku",
            "tool": tool_definition
        }
    except json.JSONDecodeError:
        return {
            "status": "error",
            "message": "Model failed to output valid JSON",
            "raw_output": response.content[0].text
        }

# Example usage (mocking a public storefront HTML stream)
if __name__ == "__main__":
    sample_url = "https://example-dtc-brand.com/products/sleek-wool-socks"
    sample_html = """
    <html>
      <body>
        <h1>Sleek Wool Socks</h1>
        <form action="/cart/add" method="POST" id="AddToCartForm">
          <input type="hidden" name="id" value="4829103952" />
          <input type="number" name="quantity" value="1" min="1" />
          <button type="submit">Add to Cart</button>
        </form>
      </body>
    </html>
    """
    
    result = extract_url_tools(sample_url, sample_html)
    print(json.dumps(result, indent=2))

The Broader Landscape: Shopper-Side vs Owner-Side

When you look at modern tool-calling platforms like Composio or Shopify’s official AI Toolkit, they focus heavily on owner-side integrations. They want you to supply an OAuth credential or an admin API token so that the agent can read dashboard settings or process orders in the background.

This is highly effective for internal administration, but it fails completely for user-facing shopping assistants.

A shopper-side agent must be able to navigate the web exactly like a consumer. It cannot ask a merchant for an admin API token just to add a pair of shoes to a cart. It needs to look at the storefront interface and generate dynamic adapters instantly.

By shifting to a shopper-side architecture powered by wmcp.sh, developers can build agents that operate across millions of merchant storefronts out of the box, with zero manual setup. And by combining deterministic static parsing with a cached, low-cost Claude Haiku fallback layer, we ensure that the entire operation runs at a fraction of a cent per page.

Unit economics are what separate toys from sustainable, scalable AI applications. Don't waste your Sonnet tokens on basic HTML cleaning—let Haiku and local parser fallback chains handle the grunt work, and save your premium reasoning for the strategic orchestration.

Want this implemented on your stack? Custom adapter + hosted MCP + verified directory listing. From $499 one-time setup.

See /managed → Submit (free)