How to Build SEO Agent Skills That Actually Work

I've built 10+ SEO agent skills in 34 days. 6 of them worked on the first try. The other 4 taught me everything I'm about to tell you.

Every LinkedIn post about AI SEO skills is missing the same thing: the folder structure. They show you a prompt. Maybe a screenshot of output. Never the architecture that makes it reliable.

This is the practical guide. You'll finish this and know how to build an agent skill from scratch, test it, fix it, and ship it with confidence.

30-day update

A month after publishing this article I shipped twelve more audits, broke a few of the claims below, and added two new tests to the validation framework. If you want the corrections before you commit to the original version, read Building SEO Agent Skills, Day 60: 6 Things I Got Wrong first, then come back. The architecture below is still the foundation. The follow-up is the diff.

For busy SEOs, founders, and AI agent builders

Every "AI SEO skill" you've seen on LinkedIn is a single markdown file. And that's a recipe for hallucinated garbage.
Skills are folders, not files. Scripts, references, memory, templates. Not a prompt.
The crawler took 5 complete rewrites in a single day. Version 5 finally worked.
10 failure modes that will burn you. Every one cost me hours. Now they're encoded so they can't happen again.
The reviewer agent was the single biggest quality improvement. It just checks everyone else's work.
We built test websites with planted bugs to train agents before they touch real sites.

Why Most AI SEO Skills Fail

Here's what a typical "AI SEO skill" looks like on LinkedIn:

You are an SEO expert. Analyze the following website
and provide a comprehensive audit with recommendations.

That's it. One prompt. Maybe some formatting instructions. The person posts a screenshot of the output, gets 500 likes, and moves on.

The output looks professional. It reads well. It's also 40% wrong.

I know because I tried this exact approach. Early in the build, I pointed an agent at a website and said "find SEO issues." It came back with 20 findings. 8 didn't exist.

The agent had never visited some of the URLs it was reporting on.

Three problems kill single-prompt skills:

No tools. The agent has no way to actually check the website. It's working from training data and guessing.

When you ask "does this site have canonical tags?" the agent imagines what the site probably looks like instead of fetching the HTML and reading it.

No verification. Nobody checks if the output is true. The agent says "missing meta descriptions on 15 pages."

Which 15? Are those pages even indexed? Are they noindexed on purpose? No one asks. No one verifies.

No memory. Run the same skill twice, you get different output. Different structure. Different severity labels. Sometimes different findings entirely.

There's no consistency because there's no template, no schema, no record of past runs.

If your "skill" is a prompt in a single file, you don't have a skill. You have a coin flip.

Skills Are Folders, Not Files

Every agent in our system has a workspace. Think of it like a new hire's desk, stocked with everything they need.

Here's what the workspace looks like for the agent that crawls websites and maps their architecture:

title="agent-workspace/"

agent-workspace/
  AGENTS.md          instructions, rules, output format
  SOUL.md            personality, principles, quality bar
  scripts/
    crawl_site.js    tool the agent calls to crawl
    parse_sitemap.sh tool to read XML sitemaps
  references/
    criteria.md      what counts as an issue vs noise
    gotchas.md       known false positives to watch for
  memory/
    runs.log         past execution history
  templates/
    output.md        expected output structure

Six components. One prompt file would cover maybe 20% of this.

AGENTS.md is the instruction manual. Thousands of words of methodology.

Not "crawl the site." More like: "Start with the sitemap. If no sitemap exists, check /sitemap.xml, /sitemap_index.xml, and robots.txt for sitemap references. Respect crawl-delay. Use a browser user-agent string, never a bare request. If you get 403s, note the pattern and try with different headers before reporting it as a block."

scripts/ are the agent's tools. The agent calls node crawl_site.js --url example.com. It doesn't write curl commands from scratch every time.

That's the difference between giving someone a toolbox and telling them to forge their own wrench.

references/ are the judgment calls. Criteria for what counts as an issue. Known false positives to watch for. Edge cases that took me 20 years to learn.

The agent reads these when it encounters something ambiguous.

memory/ is institutional knowledge. A log of past runs. What it found last time. How long the crawl took. What broke.

The next execution benefits from the last.

templates/ enforce consistency. "Use this exact structure. These exact fields. This severity scale."

Not "write a report." Output templates are the difference between getting the same quality on run 14 that you got on run 1.

Walk-Through: Building the Crawler From Scratch

Let me show you exactly how I built one skill. The crawler. It maps a site's architecture, discovers every page, and reports what it finds.

Version 1: The Naive Approach

Instructions: "Crawl this website and list all pages."

The agent wrote its own HTTP requests. Used bare curl. Got blocked by the first site it touched.

Every modern CDN blocks requests without a browser user-agent string. Dead on arrival.

Version 2: Added a Script

Built crawl_site.js using Playwright. Headless browser. Real user-agent. The agent calls the script instead of writing its own requests.

Worked on small sites. Crashed on anything over 200 pages. No rate limiting. No resume capability.

Hammered servers until they blocked us.

Version 3: Rate Limiting and Resume

Added throttling. Two requests per second default. One every two seconds for CDN-protected sites. The agent reads robots.txt and adjusts speed without asking permission.

Added checkpoint files so a crashed crawl can resume from where it stopped.

Worked on most sites. Failed on sites that require JavaScript rendering.

Version 4: JS Rendering

Added a browser rendering mode. The agent detects if a site is a single-page app (React, Next.js, Angular) and switches to full browser rendering automatically.

Compares rendered HTML against source HTML. Found real issues this way. Sites where the source HTML was an empty shell but the rendered page was full of content.

Google might or might not render it properly. Now we check both.

Worked on everything. But the output was inconsistent between runs.

Version 5: Templates and Memory

Added templates/output.md with exact fields: URL count, sitemap coverage, blocked paths, response code distribution, render mode used, issues found. Every run produces the same structure.

Added memory/runs.log. The agent appends a summary after every execution. Next time it runs, it reads the log and can compare results.

"Last crawl found 485 pages. This crawl found 487. Two new pages added."

Version 5 is what we run today. 5 iterations. One day of building.

THE CRAWLER'S EVOLUTION

  v1: Raw curl           → blocked everywhere
  v2: Playwright script  → crashed on large sites
  v3: Rate limiting      → couldn't handle JS sites
  v4: Browser rendering  → inconsistent output
  v5: Templates + memory → stable, consistent, reliable

  Time: 1 day. Lesson: the first version never works.

The pattern is always the same: start small, hit a wall, fix the wall, hit the next wall.

Five versions in one day doesn't mean five failures. It means five lessons that are now permanently encoded.

I've rebuilt delivery systems 4 times over 20 years. The process doesn't change. You start with what's elegant, then reality hits, and you end up with what works.

Give Agents Tools, Not Instructions

This is the most important architectural decision I made.

When you write "use curl to fetch the sitemap" in your instructions, the agent generates a curl command from scratch every time. Sometimes it adds the right headers. Sometimes it doesn't.

Sometimes it follows redirects. Sometimes it forgets.

When you give the agent a script called parse_sitemap.sh, it calls the script. The script always has the right headers, always follows redirects, always handles edge cases.

The agent's judgment goes into WHEN to call the tool and WHAT to do with the results. The tool handles HOW.

Our agents have tools for everything:

crawl_site.js: Playwright-based crawler with rate limiting, resume, and rendering
parse_sitemap.sh: Fetches and parses XML sitemaps, counts URLs, detects nested indexes
check_status.sh: Tests HTTP response codes with proper user-agent strings
extract_links.sh: Pulls internal and external links from page HTML

The agent decides which tools to use and what parameters to set. The crawler chooses its own crawl speed based on what it encounters.

Two requests per second for small sites. Throttled for CDN-protected sites. It reads robots.txt and adjusts. It has judgment within guardrails.

Think of it this way: you give a new hire a CRM, not instructions on how to build a database. The tools are the CRM. The instructions are the process for using them.

Progressive Disclosure: Don't Dump Everything

Here's a mistake I made early: I put everything in AGENTS.md. Every rule. Every edge case. Every gotcha. Thousands of words.

The agent got confused. Too much context. It started prioritizing obscure edge cases over common tasks.

It would spend time checking for hash routing issues on a WordPress blog.

The fix: progressive disclosure.

Core rules go in AGENTS.md. The 80% case. What the agent needs to know for every single run.

Edge cases go in references/gotchas.md. The agent reads this file when it encounters something ambiguous. Not before every task. Only when it needs it.

Criteria for severity scoring go in references/criteria.md. The agent checks this when it finds an issue and needs to decide how bad it is. Not upfront.

Same way a skilled employee operates. They know the core process by heart. They check the handbook when something weird comes up. They don't re-read the entire handbook before answering every email.

The 10 Gotchas: Failure Modes That Will Burn You

Every one of these cost me hours. They're now encoded in our agents' references/gotchas.md files so they can't happen again.

Agents hallucinate data they can't verify. I asked the research agent to find law firms and count their attorneys. It made every number up. It had never visited any of their websites. Only ask agents to produce data they can actually fetch and verify. Separate what they know (training data) from what they can prove (fetched data).
Knowledge doesn't transfer between agents. A fix I figured out on day 1 (use a browser user-agent string to avoid CDN blocks) had to be re-taught to every new agent. Day 34, a brand new agent hit the exact same problem. Agents don't share memories. Encode shared lessons in a common gotchas file that multiple agents can reference.
Output format drifts between runs. Same prompt, different field names. "note" vs "assessment." "lead_score" vs "qualification_rating." Run it twice, get two different schemas. The fix: strict output templates with exact field names. Not "write a report." "Use this exact template with these exact fields."
Agents confidently report issues that don't exist. The first 3 audits had false positives delivered with total confidence. The fix wasn't a better prompt. It was a better boss. A dedicated reviewer agent whose only job is to verify everyone else's work. Same reason code review exists for human developers.
Bare HTTP requests get blocked everywhere. Every modern CDN blocks requests without a browser user-agent string. The crawler learned this on audit #2 when an entire site returned 403s. One-line fix. Now it's in the gotchas file. Every new agent reads it on day one.
Don't guess URL paths. Agents love to construct URLs they think should exist. /about-us, /blog, /contact. Half the time those URLs 404. Rule: fetch the homepage first, read the navigation, follow real links. Never guess.
"Done" vs "In Review" matters. Agents marked tasks as "done" when posting their findings. Wrong. "Done" means approved. "In review" means waiting for human verification. Small distinction, huge impact on workflow clarity when you have 10 agents posting work simultaneously.
Categories must be hyper-specific. "Fintech" is useless for prospecting. "PI law firms in Houston" works. Every company in a category should directly compete with every other company. First attempt at sales categories was "Personal Finance & Fintech." A crypto exchange doesn't compete with a budgeting app. Lesson learned in 20 minutes.
Never ask an LLM to compile data. It fabricates. I asked an agent to summarize findings from 5 separate reports into one document. It invented findings that weren't in any of the source reports. Always build data compilations programmatically. Script it. Never prompt it.
Agents will try things you never planned. The research agent tried to call an API we never set up. It assumed we had access because it knew the API existed. The fix: be explicit about what tools are available. If a script doesn't exist in the scripts/ folder, the agent can't use it. Boundaries prevent creative failures.

Build the Reviewer First

This is counterintuitive. When you're excited about building, you want to build the workers. The crawler. The analyzers. The fun parts.

Build the reviewer first.

Here's why: without a review layer, you have no way to measure quality. You ship the first audit and it looks great. But 40% of the findings are wrong. You don't know that until a client or a colleague spots it.

Our review agent reads every finding from every specialist agent. It checks:

Does the evidence support the claim?
Is the severity appropriate for the actual impact?
Are there duplicates across different specialists?
Did the agent check what it says it checked?

That single agent was the biggest quality improvement I made. Bigger than any prompt tweak. Bigger than any new tool.

The approval rate across 270 internal linking recommendations: 99.6%. That number exists because a reviewer verifies every single one.

I've seen the same pattern with human SEO teams for 20 years. The teams that produce great work aren't the ones with the best analysts. They're the ones with the best review process.

The analysis is table stakes. The review is the product.

BUILD ORDER (WHAT I LEARNED THE HARD WAY)

  What I did first:        Build workers → Ship output → Discover quality problems → Build reviewer
  What I should have done: Build reviewer → Build workers → Ship reviewed output → Iterate both

  The reviewer defines quality. Build it first. Everything else gets measured against it.

The Validation Standard (Our Unfair Advantage)

The reviewer catches technical errors. But there's a higher bar than "technically correct."

We have a real SEO agency. Real clients. A team with 50 years of combined experience. Every agent finding gets validated against one question: "Would we stake our reputation on this?"

Not "does it look right." Would we actually send this to a client, put our name on the report, and tell the developer to build it?

Four tests. Every finding. No exceptions.

The Google engineer test. If this client's cousin works at Google, would they read this finding and nod? Would they say "yes, this is a real issue, this makes sense"? If the answer is no, it doesn't ship.

The developer test. Can a developer reproduce this without asking a single follow-up question? "Fix your canonicals" fails. "Change CANONICAL_BASE_URL from http to https in your production .env" passes.

The agency reputation test. Would we defend this finding in a client meeting? If I'd be embarrassed explaining it to a technical CMO, it gets cut.

The implementation test. Is this specific enough to actually fix? Not "improve your page speed" but "your hero video is 3.4MB, which is 72% of total page weight. Serve a compressed version to mobile. Here's the file."

This is our unfair advantage. We're not building agents in a vacuum. Most people building AI SEO tools have never run a real audit. They don't know what "good" looks like.

We do. We've been delivering it for 20 years with real clients. That's why our approval rate is 99.6%.

Sandbox Testing: Train on Planted Bugs

You don't train an agent on real client sites. You build a test environment where you KNOW the answers.

We built two sandbox websites with SEO issues we planted on purpose:

A WordPress-style site with 27+ planted issues: missing canonicals, redirect chains, orphan pages, duplicate content, broken schema markup.
A Node.js site simulating React/Next.js/Angular patterns with ~90 planted issues: empty SPA shells, hash routing, stale cached pages, hydration mismatches, cloaking.

The training loop:

Run agent against sandbox
Compare agent's findings to known issues
Agent missed something? Fix the instructions
Agent reported a false positive? Add it to gotchas.md
Re-run. Compare again.
Only when it passes the sandbox consistently does it touch real data

Think of it like a driving test course. Every accident that happens on real roads gets turned into a new obstacle on the course. New drivers face every known challenge before they hit the highway.

The sandbox is a living test suite. Every verified issue from a real audit gets baked back in. It only gets harder. The agents only get better.

THE SANDBOX TRAINING LOOP

  ┌───────────────┐
  │ Sandbox Site   │  (known issues planted)
  └───────┬───────┘
          │
          ▼
  ┌───────────────┐
  │ Agent runs     │
  │ audit          │
  └───────┬───────┘
          │
          ▼
  ┌───────────────┐     missed issue?      ┌────────────────┐
  │ Compare to     │────────────────────────│ Fix agent       │
  │ known issues   │                        │ instructions    │
  └───────┬───────┘                        └────────┬───────┘
          │                                         │
          │ false positive?    ┌────────────────┐   │
          │────────────────────│ Add to          │   │
          │                    │ gotchas.md      │   │
          │                    └────────┬───────┘   │
          │                             │           │
          ▼                             ▼           ▼
  ┌───────────────┐           ┌─────────────────────┐
  │ All clear?     │◄──────────│ Re-run               │
  └───────┬───────┘           └─────────────────────┘
          │
          ▼
  ┌───────────────┐
  │ Deploy to      │
  │ real sites     │
  └───────────────┘

Consistency: The Unsexy Secret

Nobody writes about this because it's boring. But consistency is what separates a demo from a product.

Three things that make output consistent:

Templates. Every agent has an output template in templates/output.md. Exact fields. Exact structure. Exact severity scale.

If the output looks different every run, you don't need a better prompt. You need a template file.

Run logs. After every execution, the agent appends a summary to memory/runs.log. Timestamp, site, pages crawled, issues found, duration.

The next run reads this log. It knows what happened last time. It can compare. "Found 14 issues last run. Found 16 this run. 2 new issues identified."

Schema enforcement. Field names are locked. "severity" not "priority." "url" not "page_url." "description" not "summary."

When you let field names drift, downstream tooling breaks. Templates solve this permanently.

If your agent output looks different every run, you need a template file, not a better prompt. I cannot stress this enough. The single fastest quality improvement for any agent is a strict output template.

The Stack That Makes It Work

A quick note on infrastructure, because the tools matter.

Our agents run on OpenClaw. It's the runtime that handles wake-ups, sessions, memory, and tool routing.

Think of it as the operating system the agents run on. When an agent finishes one task and needs to pick up the next, OpenClaw handles that transition. When an agent needs to remember what it did last session, OpenClaw provides that memory.

Paperclip is the company OS. Org charts, goals, issue tracking, task assignments.

It's where agents coordinate. When the crawler finishes mapping a site and needs to hand off to the specialist agents, Paperclip manages that handoff through its issue system. Agents create tasks for each other. Auto-wake on assignment.

Claude Code is the builder. Every script, every agent instruction file, every tool was built with Claude Code running Opus 4.6.

I'm a vibe coder. 20 years of SEO expertise, zero traditional programming training. Claude Code turns domain knowledge into working software.

The combination: OpenClaw runs the agents. Paperclip coordinates them. Claude Code builds everything.

The Result

14+ audits completed. 12 to 20 developer-ready tickets per audit, with exact URLs and fix instructions. Hours, not weeks.

99.6% approval rate on internal linking recommendations. 270 links across 2 sites, verified by a dedicated review process.

More than 80 SEO checks mapped across 7 specialist agents. Each check has expected outcomes, evidence requirements, and false positive rules.

Every finding is specific. Not "your site is slow." Instead: "Your hero video is 3.4MB, which is 72% of total page weight. The main app JS bundle is 78% unused. Here are the exact files to fix."

That level of specificity comes from the skill architecture. The folder structure. The tools. The references. The templates. The review layer.

Not the prompt.

If you want to build SEO agent skills that actually work, stop writing prompts and start building workspaces. Give your agents tools, not instructions. Test on sandboxes, not clients.

Build the reviewer first. Enforce templates. Log everything.

The first version will fail. The fifth version will surprise you.

I know this because I laid out the foundation in my first article on what Agentic SEO actually is. The methodology there is the foundation. This article is the how-to manual for building on top of it.

The agents do it the way I taught them. Every time. While I sleep. The 14th audit gets the same precision as the first.

Not because the AI is smart. Because the skill architecture is sound.

Thirty days after publishing this article I'd write parts of it differently. Six load-bearing edits. Crawl-once architecture, Python checks before LLM judgment, two new tests in the validation framework, the WAF guardrail Cloudflare won't warn you about, and why dropping a bad ticket is worse than shipping it. Are in the follow-up: Building SEO Agent Skills, Day 60: 6 Things I Got Wrong.

How to Build SEO Agent Skills That Actually Work

Why Most AI SEO Skills Fail

Skills Are Folders, Not Files

Walk-Through: Building the Crawler From Scratch

Version 1: The Naive Approach

Version 2: Added a Script

Version 3: Rate Limiting and Resume

Version 4: JS Rendering

Version 5: Templates and Memory

Give Agents Tools, Not Instructions

Progressive Disclosure: Don't Dump Everything

The 10 Gotchas: Failure Modes That Will Burn You

Build the Reviewer First

The Validation Standard (Our Unfair Advantage)

Sandbox Testing: Train on Planted Bugs

Consistency: The Unsexy Secret

The Stack That Makes It Work

The Result

Itay Malinski

On this page

Read next

SEO Health Score: How to Measure SEO Performance as a Single Metric

SEO Case Study: How We Helped a Client Reach 1,155% Organic Traffic Growth

Clickstream ETV vs Search Volume: Why Traditional Traffic Estimates Overstate Reality

Sima Krupatkin

Itay Malinski

Yaron Avisar

Next Steps?

Current Status

Gap Analysis

Forecast