← All posts

The 3 Patterns of E-Commerce Parsing: One Weekend, Two Adapters

Shopify JSON-LD, raw HTML scraping, and LLM-based fallbacks—how to structure a resilient storefront parser for transactional AI agents.

2026-05-27

The Scraping Fragility Trap

If you have ever built an e-commerce shopping agent, you have faced the Scraping Fragility Trap.

You write a custom scraping script utilizing Playwright or Puppeteer for a popular online shop. You locate the DOM selectors for the product title (.product-title), price (span.price), and sizes (select#sizes). Your script works flawlessly on your machine. You push it to production.

Three days later, the merchant updates their CSS framework. Suddenly, .product-title is now .product__title-wrapper, and your selector-bound parser returns None. Your agent crashes.

To build a reliable consumer-side agent that operates across millions of digital storefronts, you cannot rely on visual DOM selectors. You need a structured, tiered approach to storefront extraction.

In e-commerce parsing, there are three primary patterns of data extraction, ordered from fastest and cheapest to slowest and most expensive:

  1. Direct Storefront API Extraction (Shopify products.json)
  2. Structured Semantic Tag Extraction (JSON-LD markup)
  3. Heuristics-Based LLM Fallback (Claude 3.5 Haiku)

By combining these three patterns into a unified adapter architecture, you can build a storefront parser in a single weekend that works on 99% of e-commerce web pages without constant maintenance.


Pattern 1: Direct Storefront API Extraction

The easiest way to parse a storefront is to bypass the HTML entirely.

Over 2 million e-commerce brands run on Shopify. One of Shopify's best kept secrets is that almost every store exposes a public client-side REST endpoint that returns structured product catalogs in raw JSON:

https://<store-domain>/products/<product-handle>.json

If you append .json to any standard Shopify product URL, you will receive a clean, structured JSON payload containing the product ID, description, title, vendor, full variant details (including SKU, inventory state, and price), and high-resolution image links.

This endpoint is incredibly fast (typically resolving in under 80ms) and bypasses heavy DOM processing entirely.

Below is a production-grade TypeScript implementation of a Shopify Storefront API Adapter designed to run inside edge environments like Cloudflare Workers:

export interface ShopifyProduct {
  id: number;
  title: string;
  body_html: string;
  vendor: string;
  variants: Array<{
    id: number;
    title: string;
    price: string;
    sku: string;
    available: boolean;
  }>;
}

export class ShopifyAdapter {
  static async extract(url: string): Promise<ShopifyProduct | null> {
    try {
      const parsedUrl = new URL(url);
      
      // Ensure the URL matches the standard Shopify product path
      if (!parsedUrl.pathname.startsWith('/products/')) {
        return null;
      }
      
      // Construct the JSON storefront API URL
      const apiEndpoint = `${parsedUrl.origin}${parsedUrl.pathname}.json`;
      
      const response = await fetch(apiEndpoint, {
        headers: {
          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
          'Accept': 'application/json'
        }
      });
      
      if (!response.ok) {
        return null;
      }
      
      const payload = await response.json() as { product: ShopifyProduct };
      return payload.product;
    } catch (e) {
      return null;
    }
  }
}

Pattern 2: Structured Semantic Tag Extraction

Not every storefront runs on Shopify. Magento, WooCommerce, BigCommerce, and custom-built headless frameworks represent a massive chunk of the e-commerce market.

To scrape these sites without fragile selector paths, we utilize JSON-LD (JavaScript Object Notation for Linked Data).

Google, Bing, and other search engines require e-commerce stores to embed structured metadata within their HTML. This metadata allows Google to display product ratings, prices, and stock indicators directly in search results. The industry standard for this metadata is the JSON-LD schema, which is wrapped in a <script type="application/ld+json"> tag.

Because merchants must maintain correct JSON-LD schemas to keep their Google Search rankings, this data is incredibly stable and rarely changes.

Below is a Python implementation of a JSON-LD parser that extracts e-commerce schemas using BeautifulSoup:

import json
from bs4 import BeautifulSoup
import requests

class JsonLdAdapter:
    @staticmethod
    def extract(url: str) -> dict | None:
        headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36"
        }
        try:
            r = requests.get(url, headers=headers, timeout=8)
            if r.status_code != 200:
                return None
                
            soup = BeautifulSoup(r.text, 'html.parser')
            scripts = soup.find_all('script', type='application/ld+json')
            
            for script in scripts:
                try:
                    data = json.loads(script.string or '')
                    
                    # Normalize both flat objects and @graph arrays
                    items = data if isinstance(data, list) else [data]
                    for item in items:
                        if item.get("@type") == "Product" or "Product" in str(item.get("@type")):
                            return item
                            
                        # Traverse @graph if present
                        if "@graph" in item:
                            for sub_item in item["@graph"]:
                                if sub_item.get("@type") == "Product":
                                    return sub_item
                except (json.JSONDecodeError, TypeError):
                    continue
            return None
        except Exception:
            return None

Pattern 3: Heuristics-Based LLM Fallback

If both the direct API lookup and JSON-LD extraction fail, the third and final line of defense is the LLM Fallback.

Some headless custom storefronts completely lack structured JSON-LD scripts and block API paths. In this case, we scrape the clean raw text of the HTML page, strip out redundant boilerplate (script blocks, styling tags, head headers), and pipe it into a lightweight LLM—such as claude-3-5-haiku—with a highly strict JSON system prompt.

While this pattern is highly resilient, it is also the slowest (1.5s latency vs 50ms) and the most expensive. It should be used exclusively as a fallback when deterministic adapters fail.

At wmcp.sh, we combine these three patterns in a unified dynamic router. By checking Shopify first and JSON-LD second, we handle 90% of requests deterministically at the edge, maintaining sub-100ms response times while falling back to Haiku ONLY when absolutely necessary.

Want this implemented on your stack? Custom adapter + hosted MCP + verified directory listing. From $499 one-time setup.
See /managed → Submit (free)