The 3 Patterns of E-Commerce Parsing: One Weekend, Two Adapters
Shopify JSON-LD, raw HTML scraping, and LLM-based fallbacks—how to structure a resilient storefront parser for transactional AI agents.
The Scraping Fragility Trap
If you have ever built an e-commerce shopping agent, you have faced the Scraping Fragility Trap.
You write a custom scraping script utilizing Playwright or Puppeteer for a popular online shop. You locate the DOM selectors for the product title (.product-title), price (span.price), and sizes (select#sizes). Your script works flawlessly on your machine. You push it to production.
Three days later, the merchant updates their CSS framework. Suddenly, .product-title is now .product__title-wrapper, and your selector-bound parser returns None. Your agent crashes.
To build a reliable consumer-side agent that operates across millions of digital storefronts, you cannot rely on visual DOM selectors. You need a structured, tiered approach to storefront extraction.
In e-commerce parsing, there are three primary patterns of data extraction, ordered from fastest and cheapest to slowest and most expensive:
- Direct Storefront API Extraction (Shopify products.json)
- Structured Semantic Tag Extraction (JSON-LD markup)
- Heuristics-Based LLM Fallback (Claude 3.5 Haiku)
By combining these three patterns into a unified adapter architecture, you can build a storefront parser in a single weekend that works on 99% of e-commerce web pages without constant maintenance.
Pattern 1: Direct Storefront API Extraction
The easiest way to parse a storefront is to bypass the HTML entirely.
Over 2 million e-commerce brands run on Shopify. One of Shopify's best kept secrets is that almost every store exposes a public client-side REST endpoint that returns structured product catalogs in raw JSON:
https://<store-domain>/products/<product-handle>.json
If you append .json to any standard Shopify product URL, you will receive a clean, structured JSON payload containing the product ID, description, title, vendor, full variant details (including SKU, inventory state, and price), and high-resolution image links.
This endpoint is incredibly fast (typically resolving in under 80ms) and bypasses heavy DOM processing entirely.
Below is a production-grade TypeScript implementation of a Shopify Storefront API Adapter designed to run inside edge environments like Cloudflare Workers:
export interface ShopifyProduct {
id: number;
title: string;
body_html: string;
vendor: string;
variants: Array<{
id: number;
title: string;
price: string;
sku: string;
available: boolean;
}>;
}
export class ShopifyAdapter {
static async extract(url: string): Promise<ShopifyProduct | null> {
try {
const parsedUrl = new URL(url);
// Ensure the URL matches the standard Shopify product path
if (!parsedUrl.pathname.startsWith('/products/')) {
return null;
}
// Construct the JSON storefront API URL
const apiEndpoint = `${parsedUrl.origin}${parsedUrl.pathname}.json`;
const response = await fetch(apiEndpoint, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
'Accept': 'application/json'
}
});
if (!response.ok) {
return null;
}
const payload = await response.json() as { product: ShopifyProduct };
return payload.product;
} catch (e) {
return null;
}
}
}
Pattern 2: Structured Semantic Tag Extraction
Not every storefront runs on Shopify. Magento, WooCommerce, BigCommerce, and custom-built headless frameworks represent a massive chunk of the e-commerce market.
To scrape these sites without fragile selector paths, we utilize JSON-LD (JavaScript Object Notation for Linked Data).
Google, Bing, and other search engines require e-commerce stores to embed structured metadata within their HTML. This metadata allows Google to display product ratings, prices, and stock indicators directly in search results. The industry standard for this metadata is the JSON-LD schema, which is wrapped in a <script type="application/ld+json"> tag.
Because merchants must maintain correct JSON-LD schemas to keep their Google Search rankings, this data is incredibly stable and rarely changes.
Below is a Python implementation of a JSON-LD parser that extracts e-commerce schemas using BeautifulSoup:
import json
from bs4 import BeautifulSoup
import requests
class JsonLdAdapter:
@staticmethod
def extract(url: str) -> dict | None:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36"
}
try:
r = requests.get(url, headers=headers, timeout=8)
if r.status_code != 200:
return None
soup = BeautifulSoup(r.text, 'html.parser')
scripts = soup.find_all('script', type='application/ld+json')
for script in scripts:
try:
data = json.loads(script.string or '')
# Normalize both flat objects and @graph arrays
items = data if isinstance(data, list) else [data]
for item in items:
if item.get("@type") == "Product" or "Product" in str(item.get("@type")):
return item
# Traverse @graph if present
if "@graph" in item:
for sub_item in item["@graph"]:
if sub_item.get("@type") == "Product":
return sub_item
except (json.JSONDecodeError, TypeError):
continue
return None
except Exception:
return None
Pattern 3: Heuristics-Based LLM Fallback
If both the direct API lookup and JSON-LD extraction fail, the third and final line of defense is the LLM Fallback.
Some headless custom storefronts completely lack structured JSON-LD scripts and block API paths. In this case, we scrape the clean raw text of the HTML page, strip out redundant boilerplate (script blocks, styling tags, head headers), and pipe it into a lightweight LLM—such as claude-3-5-haiku—with a highly strict JSON system prompt.
While this pattern is highly resilient, it is also the slowest (1.5s latency vs 50ms) and the most expensive. It should be used exclusively as a fallback when deterministic adapters fail.
At wmcp.sh, we combine these three patterns in a unified dynamic router. By checking Shopify first and JSON-LD second, we handle 90% of requests deterministically at the edge, maintaining sub-100ms response times while falling back to Haiku ONLY when absolutely necessary.