Scoring Methodology
The Serge score measures one thing: how easy it is for an AI agent to find a product on your site and complete the next meaningful step. A single number from 0 to 100. The scanner is deterministic; replay is the higher-fidelity product surface when you need to inspect a specific failure.
This page explains what the scan actually does, how the score is computed, and how the methodology evolves over time. We publish this openly because trust matters more than pretending the model is static.
The scanner is live today, and the scoring model is versioned openly. We keep the public methodology in the open because the product is still early and the check library will continue to tighten as we validate against more real-world replay scenarios.
The core scanner engine already runs on every domain: crawl, static analysis, deterministic checks, and a calibrated 0-100 output. What continues to evolve is the exact weighting, benchmark distribution, and how aggressively we translate structural failures into score movement.
What follows is the current version. We expect to revise it as replay evidence accumulates, and we will keep documenting those changes here.
What we measure
One question: how easy is it for a user behind an AI agent to find a product on this site and place it in a cart?
The scenario: a shopper asks Claude, ChatGPT, or Operator to buy something on your site. The agent fires up its browser (via a Model Context Protocol server, a built-in computer-use tool, or a headless automation layer), navigates to your site, tries to find the right product, tries to add it to the cart, and either succeeds or gives up. If it gives up, the user goes to a competitor — and you never see the lost sale in GA4.
Serge measures, deterministically and without running a live agent at scan time, whether the structural conditions for that journey to succeed are in place. We look at things an agent would need: machine-readable product data, navigable DOM, accessible interactive elements, unblocked crawler access, consistent URL structure, clear inventory and pricing signals, and a handful of other signals that correlate with agent traversal success.
How we measure it
Deterministic. The public scanner does not call an LLM at scan time. Running a real agent against every domain that hits the site would make the scanner unaffordable and slow. Instead, Serge crawls a representative sample of the site and inspects structural properties that predict whether an agent would be able to complete the task.
Live agent runs happen in the paid product. Investigate Mode (our enterprise deliverable) runs a real AI agent — Claude Desktop via MCP, and later Operator and GPT Agent — against a specific URL to capture the actual session, reasoning, DOM interactions, and failure points. That's a separate surface, gated behind the paid tier. The public scanner is the deterministic cousin that gives you a fast, cheap, shareable answer to the same question.
The specific checks and weights continue to evolve. The current research direction includes:
- Can the agent reach the site at all? Bot protection posture, robots.txt, WAF behavior.
- Can the agent find products? Sitemap, navigation semantics, product URL discoverability, internal link structure.
- Can the agent parse product data? Schema.org Product / Offer / Availability, structured pricing, variant metadata.
- Can the agent interact with the page? Real buttons vs.
<div onClick>, accessible names, ARIA roles, keyboard semantics. - Can the agent add to cart? Cart state exposure, variant selection mechanics, inventory visibility.
We will keep publishing revisions here as the check list, weights, and benchmark calibration tighten.
How the score is calculated
Check states
Each individual check returns one of four states:
| State | Description |
|---|---|
| Pass | Full points. The agent would be able to traverse this successfully. |
| Partial | Half points. The agent can get through but something is incomplete or fragile. |
| Fail | Zero points. The agent would likely fail here. |
| Blocked | Excluded from scoring. Bot protection prevented the check from being evaluated. |
Calibration curve
Raw check scores are passed through a piecewise calibration curve that maps the theoretical 0–100 range to a practical distribution. The curve is calibrated against benchmark data from scanned domains and is designed so that:
- Early wins are rewarding. Fixing a single blocking issue on a site scoring 20 produces a visible score increase — so early adopters see progress fast.
- The middle range has clear separation. A score of 50 is meaningfully different from a score of 65.
- The top is hard to reach. Moving from 90 to 95 requires substantially more engineering investment than moving from 40 to 55.
This follows the same principle used by Google Lighthouse, which calibrates scores against real-world distributions so they reflect practical benchmarks rather than theoretical maximums.
Score bands
| Score | Label | Meaning |
|---|---|---|
| 0–24 | Agents cannot complete a purchase | Agents may not be able to find or buy your products |
| 25–44 | Agents struggle to buy | Agents can reach the site but often fail before adding to cart |
| 45–64 | Agents will hit gaps | Agents can browse but may fail at variant selection or cart |
| 65–84 | Mostly works for agents | Agents can find and add products with minor friction |
| 85–100 | Agents can buy here | Agents can reliably find products and add them to the cart |
The band labels and meanings are being updated as part of the methodology rebuild to describe outcomes in product-findability terms (e.g. “agents can find most products but may fail at variant selection”) rather than the earlier generic readiness terms.
How we handle edge cases
Bot protection and browser fallback
If the Serge crawler is blocked by a WAF or challenge page, we mark affected checks as “blocked” rather than “fail.” Blocked checks are excluded from the score denominator — we only score what we can actually see.
When SergeBot's default user-agent is silently blocked, we re-fetch key resources using a standard browser user-agent. This lets us deliver real results for sites behind aggressive CDN bot protection. The scan report discloses when this fallback was used.
Bot protection blocking is itself a finding. If our deterministic crawler can't reach your site, real AI agents using headless browsers may face similar restrictions. This is reported as a finding with specific recommendations for allowing legitimate agent access without opening the door to abusive scrapers.
Sites without e-commerce
Serge's scoring is designed for e-commerce sites — sites that exist to let a customer find a product and buy it. If the scanner finds no product pages, no cart mechanic, and no pricing, the score becomes a degenerate case and we say so explicitly rather than inventing a meaningless number. Non-commerce sites are not our target ICP in the current cohort.
Score stability during the rebuild
Because the methodology is actively being rebuilt, scores generated during the transition period may shift as checks are added, removed, or reweighted. Once the new methodology is locked, we commit to keeping it stable so that a score of 65 measured in month A means the same thing as a score of 65 measured in month B. Progress must be measured against a stable ruler.
Prior work and influences
The Serge scoring methodology draws on established practices from several domains:
Scoring curve design
- Google Lighthouse — Uses log-normal curves calibrated against real-world web performance data (HTTP Archive). Serge follows the same principle: scores are calibrated against the real distribution of scanned domains, not theoretical maximums.
- SSL Labs — Moved from 0–100 numeric scores to A–F letter grades for communication clarity. Also pioneered hard caps: a single critical failure overrides the total score.
- SecurityScorecard — Logarithmic scale demonstrating that external, non-intrusive scanning can produce scores that correlate with real-world outcomes.
Agent protocols and standards
Where relevant, the scoring references emerging standards and protocols that help agents interact with web platforms. These are input signals to the scoring, not frameworks we invented:
- WCAG 2.2 — Accessibility standards for assistive technology. Heavy overlap with what agents need to interact with a page.
- Schema.org Product — Structured product data that agents parse to understand inventory, price, availability, and variants.
- Robots Exclusion Protocol (RFC 9309) — How sites declare crawler permissions.
- Model Context Protocol — Anthropic's open standard for agent-to-tool integration. Real AI agents use browser MCP servers (Playwright MCP, Browser MCP, PageBolt) to drive browsers today.
Why the scanner is free
The Serge scanner follows a model proven by:
- HubSpot Website Grader (2007) — Free website scoring tool that graded 4 million websites and became HubSpot's primary lead generation mechanism, generating 40,000+ organic backlinks.
- SecurityScorecard (2014) — Free security score widget used by 880,000+ companies and scaled to $140M ARR as a category-defining lead generator.
Both demonstrated that a free, shareable scoring tool can define a category and build authority through transparency and consistency.
Limitations
We believe in being transparent about what the score can and cannot tell you.
What the score measures: Whether the structural conditions for an AI agent to find a product and add it to a cart are in place on your site, based on externally observable signals.
What the score does not measure:
- The actual volume of agent-driven sessions you receive (that's the passive tracking snippet's job)
- Whether a specific agent on a specific day actually completed a specific task (that's Investigate Mode's job)
- Whether your site appears in ChatGPT's answers or Google AI Overviews (that's the GEO layer, covered by other tools)
- Checkout completion beyond the add-to-cart step (future expansion)
- Your backend API reliability or latency (we can't measure these externally)
- The quality of your merchandising or product descriptions from a marketing standpoint
The score is not a guarantee. A high score means the structural conditions for agent success are in place. It doesn't guarantee every agent on every task will succeed. For specific, observable task-completion data, use Investigate Mode (run real agents against your site) or the passive tracking snippet (capture real agent sessions as they happen).
The score is a point-in-time measurement. The agent ecosystem is evolving rapidly. We commit to keeping the framework current and, once the rebuild is complete, to maintaining score stability so that progress can be measured against a stable ruler.
Feedback
We welcome feedback on the methodology. If you believe a check is producing inaccurate results, if we're missing an important signal, or if you have research that should inform the framework, please contact us.
The goal is to get this right — for the e-commerce teams being scored and for the agents trying to shop on their sites.