Building SEO Agent Skills, Day 60: 6 Things I Got Wrong

Thirty days ago I published How to Build SEO Agent Skills That Actually Work. It was an honest snapshot of what I'd learned in 34 days of building. I stand behind every word of it.

I'd also write it differently today.

Twelve more audits. Several thousand more dollars in API spend, half of it wasted on the failure modes below. One audit shipped a "Grade C+" report on 9% of the actual site because Cloudflare blocked the rest and the crawler didn't notice. One regression made every ticket read identically across two completely different sites. One reviewer recommendation almost cost a client their footer architecture for the wrong reason.

Six things from the original article are now load-bearing edits. Three claims doubled down. This post is the diff.

If you haven't read the original, start there. The architecture in it is sound, and this update assumes you know it. This is not a rewrite. It's the next 30 days of bruises.

For busy SEOs, founders, and AI agent builders

Crawl once. Save the bytes. Check many. The original treated crawling and analysis as one job. They're two. Conflating them is what produces hallucinated findings.
Python checks first. LLM second. 91 deterministic checks now. The LLM never decides "is the canonical broken." Python does. The LLM decides "given that it is, what's the right severity for THIS site at THIS scale?"
The validation framework grew from 4 tests to 6. Page Value test rejects findings on low-value pages. Harm test asks "what's the worst that could happen if the client implements this?"
Confidence isn't correctness. A recent audit shipped a Grade C+ report on 9% of the site because Cloudflare's bot challenge blocked the rest. The crawler counted 651 of 718 fetches as "successful." Now there's a guardrail.
Don't silently drop bad tickets. Tag them. Audit trail beats clean dashboards.
Templates are for structure only. Pre-filling prose creates output that reads identically across two completely different sites. Specialists must author per finding.

1. The Article Missed the Most Important Architectural Shift: Crawl Once, Check Many

The original frames each agent as a unit that crawls AND analyzes. That's wrong. Crawling and analysis are two different jobs that should never share a process.

What we run today: Phase 1 is pure Python. It crawls the site, follows redirects, fetches headers, runs JS rendering on detected SPAs, and saves every byte to disk:

audits/<id>/raw/
├── html/<url-hash>.html              HTML at fetch time
├── rendered/<url-hash>.html          post-JS render (if SPA)
├── headers/<url-hash>.json           full HTTP headers
├── redirects/<url-hash>.json         redirect chain + final URL
├── timing/<url-hash>.json            TTFB, total ms
├── ua-variants/<url-hash>/           multi-UA fetches for cloaking
└── ...

Phase 3 (the specialist agents) reads from disk. Zero live network calls during analysis. Re-running a check on the same saved artifacts produces byte-identical Python output. Re-running the LLM layer on the same raw data produces consistent verdicts.

The original article doesn't mention this because v3 of our crawler was still doing live fetches inside the agent's main loop. By v7 we'd separated them. The cost reduction was real: every analysis pass after the first is free, because the raw inputs are already saved. Every dashboard render of past audits, every replay of "what would the new version of CHK-074 say about last month's audit," every cross-audit diff. All of it is reading from disk, not re-fetching the web.

The deeper consequence: continuous monitoring becomes economically viable. When a daily-monitoring crawl can re-use yesterday's raw artifacts as a baseline and only fetch what changed, you can monitor 50 client sites for the cost of one full audit per day. Without crawl-once, you're paying full freight every morning.

If you're building agents that touch the web, save the bytes. Always.

2. Python Checks First. LLM Second. Not the Other Way Around.

The original says "give agents tools, not instructions." That's right but incomplete. The deeper version: the tools should produce 100% of the deterministic findings. The LLM's job is judgment on top of facts, not detection.

We have 91 Python checks now. Each one is a pure function:

def check(crawl_data: CrawlData, audit_dir: Path) -> Finding:
    # read saved artifacts, decide pass/fail, return structured Finding
    ...

The LLM never decides "is the canonical broken." Python does. By parsing the HTML, resolving the canonical, fetching the target's headers, checking for noindex. The LLM decides "given that the canonical is broken on these 26 pages, what's the right severity for THIS site at THIS scale, and how do I explain it to a non-technical CMO?"

That split is the architecture. Detection is deterministic. Judgment is LLM. Different jobs. Conflating them is what produces the hallucinated findings the original article warned about.

Practical consequence: we now have a --python-only mode that runs all 91 checks in 90 seconds for $0 and exits with a CI-gateable code. Same engine, no LLM. Three product surfaces unlock instantly:

Per-PR CI checks. The dev pushes a branch, the CI bot fetches the preview URL, runs Python-only mode, comments "this PR introduced 3 new SEO regressions: missing canonical on /auto-quote, accidental noindex on /products/v2, LCP cliff on /pricing (3.4s vs main 1.8s). Block merge?" in 90 seconds, for free.
Daily regression monitoring. Re-crawl, run Python-only, diff against yesterday. Alert on changes.
Free public lead-magnet tools. "Is your site invisible to ChatGPT, Claude, Perplexity?" → robots.txt parse + UA probe + chart. Costs $0.05 per submission. Public web form. No LLM in the path.

None of those need agency-grade prose. All of them need deterministic checks. The original article's "give agents tools" advice points at this; it just doesn't go far enough.

Build the tools so they can run alone. The LLM layer is the upsell, not the engine.

3. The Validation Framework Grew From 4 Tests to 6, and One Is "What's the Worst That Could Happen?"

The original article describes 4 tests (Google engineer / developer / agency reputation / implementation). Two more shipped after we tripped over their absence:

The Page Value Test. Does this finding's affected page actually matter? We were shipping "Add H1 to privacy policy" tickets. Embarrassing. Now: every page in the crawl gets classified high / medium / low / skip, and any finding whose affected_urls are 100% low-value gets rejected at the gate. Privacy policy doesn't drive traffic. Don't ship a ticket about it.
The Harm Test. Could implementing this fix break the site? "Block pagination via robots.txt" sounds clean until you realize it orphans every page Google reaches via pagination. "Add canonical sitewide pointing to homepage" de-indexes the rest of the site. "Add noindex to admin paths". Fine, until your blog is at /admin/blog because of a CMS quirk. Every recommendation needs a "what's the worst that could happen" check, with the risk named explicitly in the ticket if non-zero.

Test 1 (the Google Engineer Test) also got a sharp extension: scale matching. We caught ourselves applying crawl-budget arguments to 12-page Webflow sites. Google's crawl-budget guidance is explicit: "If your site doesn't have a large number of pages that change rapidly … you don't need to read this guide." 10K+ pages updating daily, or 1M+ moderate. On a 50-page brochure site, "wastes crawl budget" framing is technically wrong even if it sounds professional.

Reframe in terms that actually apply at the site's actual scale: page-load latency, repeat-visitor caching, conversion friction, vendor security review, brand consistency, specific user-experience harm. The pattern is the same (broken canonical, slow page, missing schema) but the why it matters sentence has to match the site's reality.

Same applies to PageRank-sculpting recommendations. Cutts deprecated nofollow PageRank redistribution in 2009. The equity allocated to a nofollow link evaporates. It does NOT shift to dofollow links on the same page. We had agents recommending "nofollow these footer links to redirect equity to your products." Wrong since 2009. Now in the gotchas file.

The lesson: the framework being right doesn't mean the agent applying it is right. Every recommendation gets cross-checked against Google's own published myths/facts page before shipping.

4. Confidence Isn't Correctness, and the Sandbox Can't Catch It

Real failure from a recent audit. We pointed the engine at a Cloudflare-fronted site with --max-pages 2000. The audit completed. The Lead synthesized 22 tickets. The dashboard showed Grade C+ (79/100). Looked clean.

Buried in the executive summary, in prose, as the LAST sentence: "Critical limitation: Cloudflare's bot challenge blocked our crawler from 80% of sitemap URLs (only 139 of 652 pages crawled)."

Underneath that summary, in the raw crawl log: status codes 403: 651, 200: 64, 404: 3. 90% of all fetches were 403'd. The crawler treated 403 as a successful HTTP transaction and kept going. The "errors" counter said zero. Every one of the 22 tickets was based on the 9% sample that slipped through.

The original article tells you to test agents on sandboxes with planted bugs. That's right. But sandboxes don't simulate WAFs. Real sites do. The sandbox training loop catches the failures you can imagine; production catches the ones you can't.

The fix wasn't a smarter prompt. It was a Python guardrail in the crawler:

WAF_PROBE_WINDOW = 30        # first 30 fetches
WAF_BLOCK_THRESHOLD = 0.30   # 30% 403/429/503 → halt

If the threshold trips, the crawler aborts, writes a crawl_aborted.json marker, and exits non-zero. The Lead's prompt now reads: "If crawl_aborted.json exists, retry with playwright_crawler.py (real-browser fingerprint bypasses Cloudflare). Do NOT proceed to specialists with a bot-blocked sample."

WAF FAILURE TIMELINE. BEFORE / AFTER GUARDRAIL

  BEFORE
  Crawler sends 718 requests → 651 return 403 → errors = 0 → keep going
  → audit completes on 9% of site → ship Grade C+ report → reviewer notices
  in prose three days later

  AFTER
  Crawler sends 30 requests → 27 return 403 → ratio 0.90 ≥ threshold 0.30
  → halt → write crawl_aborted.json → exit code 2
  → Lead reads marker → retry with playwright_crawler → real browser
  fingerprint → bypass → full crawl → ship the right report

If the original article taught you to "build guardrails," the update is: build them for the failure modes you can't reproduce in the sandbox. Production WAFs, rate limits, geofencing, captcha walls, 403s that look like 200s to a counter that doesn't know better. Every one is invisible to a unit test.

The deeper lesson: a counter that says "errors: 0" is not the same as "no errors." It's the same as "no errors of the type the counter knows about." Audit your error definitions before you trust the dashboard.

5. Don't Drop Bad Tickets. Tag Them.

The original article is silent on what happens when validation rejects a finding. The update: silent deletion is the worst option.

I had the reviewer agent recommend "drop ticket RS-016" once. The proposed fix was technically wrong (recommending nofollow on internal footer links to redistribute equity, which doesn't work post-2009). The reviewer was right about the fix being wrong. But the underlying observation ("this site has 4 footer links, all to legal pages with no path to commercial content") was real. The architectural insight was salvageable. Dropping the ticket meant losing it.

Now we tag, never drop. Three actions on a flagged ticket:

Rewrite in place. When the finding is real but framed wrong. Strip the bad recommendation, keep the observation, replace with a correct fix.
Tag with quality_flags. ["bad_ticket", "scale_mismatch"], ["needs_rewrite", "low_value_padding"], etc. The ticket stays in tickets.json with a "Bad Ticket" badge in the dashboard. Reviewer sees both the ticket AND why it failed. They can manually accept, rewrite, or override.
Reject (verdict: rejected). Only when the underlying observation is itself false (the URL doesn't exist, the canonical isn't actually broken, the crawler misread). This is rare.

Three benefits over silent deletion:

Audit trail. Future audits debug against past failures. "Why did the agent miss this?" is answerable. With silent deletion, you have no signal.
Reviewer learning. "This is what bad looks like" is a teaching moment. Stripped tickets are invisible teaching moments.
Salvage value. Half-correct findings rewrite into right ones. Stripped findings rewrite into nothing.

If your reviewer's only output is approve / drop, it's not a reviewer. It's a binary classifier. Real review is approve / rewrite / tag-as-bad / reject, and you need all four.

6. Site-Specific Authorship Is Not Optional. Templates Are for Structure Only.

The original section on templates is right but incomplete. We had a regression where Python pre-filled the three human-readable fields (plain_english, business_impact, who_fixes_it) from catalog templates. Output structure was perfect. Every ticket had three paragraphs, 60-100 words each, exactly as templated.

Output content was generic.

"Add H1 tags to your pages. H1 is the primary topic signal for the page. Missing H1 weakens topical relevance, especially for long-tail queries. A web developer can fix this." Reads identically on a Next.js commercial site and a custom-React SaaS marketing site. Same prose. Different sites. Wrong.

The rule: templates inform structure. Specialists author content per finding, reading the actual HTML, headers, and site profile.

A ticket on the Next.js commercial site might say: "Your case-study template at /case-studies/<slug>/ is rendering H1 from a visual-CMS component that's not populated. The editor placed the title in a 'subhead' slot instead. 13 case studies are affected (full list in evidence). The fix is in the CMS component config, not the codebase."

A ticket on the custom-React SaaS site might say: "Your homepage hero block uses an h1-styled <div> instead of an <h1> element. See components/Hero.tsx:42. Screen readers and Google's parser both miss it. One-line change: <div className=\"text-5xl\"> becomes <h1 className=\"text-5xl\">."

Same check (CHK-041, "H1 present on indexable pages"). Same severity (HIGH). Completely different prose, because the underlying CMS, template, and failure point are different. A specialist who has read this site's actual rendered HTML cannot write the same ticket twice across different sites. That's the test.

If your tickets read the same across two sites, you have a templating problem disguised as a quality problem. The fix is to strip the pre-fill and force per-finding authorship.

What's the Same, Doubled Down

Three claims from the original got stronger with more data:

The reviewer is still the biggest quality improvement. Approval rate is now 99.4% across 14 audits and ~340 tickets. Every percent of that is review, not generation.
Stable templates still beat smart prompts. When output drifts, it's almost always a template gap, not a prompt gap. (See Edit #6 for the corollary: stable structural templates that still let prose breathe.)
The first version still never works. Of the last 8 features I shipped, 6 took 3+ rewrites. Two took 5+. The crawler is now on v9. The page-value classifier is on v4. The validation framework is on v3. Nothing arrives finished.

The Six Edits in One Diagram

ORIGINAL ARCHITECTURE              DAY 60 ARCHITECTURE

  1. agent crawls + analyzes  →   1. Python crawls, saves raw artifacts;
                                     specialists read from disk only

  2. agents have tools         →   2. tools produce 100% of detection;
                                     LLM does only judgment + prose

  3. 4 validation tests        →   3. 6 tests; +Page Value +Harm;
                                     +scale-match on Test 1

  4. sandbox training          →   4. + production guardrails (WAF,
                                     rate-limit, captcha) the sandbox
                                     can't simulate

  5. reviewer approves/drops   →   5. approve / rewrite / tag-bad / reject;
                                     never silently strip

  6. templates everywhere      →   6. templates for STRUCTURE only;
                                     prose authored per finding

Where to Go From Here

If you read the original article and started building, the day-60 version of the advice is:

Save raw artifacts. Every byte. Forever. (Storage is cheap. Re-fetching is expensive.)
Push detection into Python. The LLM is for judgment, not detection.
Add the Page Value and Harm tests to your validation framework. Add scale-matching to whatever you call your "is this real?" test.
Build guardrails for production failure modes the sandbox can't reproduce.
Tag bad tickets with structured flags. Never silently delete.
Templates inform structure. Prose is per-finding.

The architecture from the original is sound. These six are the load-bearing edits.

Day 90 will probably surface six more.

This is a follow-up to How to Build SEO Agent Skills That Actually Work, published one month ago. The methodology in that article is the foundation; this one is the diff. If you're new to this series, start with What is Agentic SEO? for the why, then the original guide for the how, then this for the corrections.

Building SEO Agent Skills, Day 60: 6 Things I Got Wrong

1. The Article Missed the Most Important Architectural Shift: Crawl Once, Check Many

2. Python Checks First. LLM Second. Not the Other Way Around.

3. The Validation Framework Grew From 4 Tests to 6, and One Is "What's the Worst That Could Happen?"

4. Confidence Isn't Correctness, and the Sandbox Can't Catch It

5. Don't Drop Bad Tickets. Tag Them.

6. Site-Specific Authorship Is Not Optional. Templates Are for Structure Only.

What's the Same, Doubled Down

The Six Edits in One Diagram

Where to Go From Here

Itay Malinski

On this page

Read next

Agentic technical SEO: 8 AI agents audit instantly.ai { grade: C }

Why We Build Our Own SEO Tools (And Why It's a Competitive Moat)

From Gap Analysis to Content Plan: How Multi-Step Keyword Clustering Builds a Content Roadmap

Sima Krupatkin

Itay Malinski

Yaron Avisar

Next Steps?

Current Status

Gap Analysis

Forecast

Next Steps?

Current Status

Gap Analysis

Forecast