How do I filter out bad URLs before my crawler follows them?

The problem

An AI agent crawling the web wastes compute following dead links, redirect traps, and parked domains. Every bad URL your crawler follows is time, bandwidth, and money spent on content that will never make it into your index. Worse, malicious content can end up in your pipeline if the crawler follows links indiscriminately.

Redirect loops burn resources. A single redirect chain with five or more hops can tie up a crawler thread for seconds, and at scale those seconds compound into hours of wasted processing. Without pre-flight URL validation, your crawler's behaviour is essentially "follow everything and hope for the best." The same exposure hits autonomous agents that act on URLs beyond just crawling, where a bad link isn't wasted compute — it's a possible prompt injection vector.

How Unphurl solves it

Run URLs through Unphurl before your crawler follows them. The crawler scoring profile weights redirect behaviour, domain legitimacy, and resolution signals so your AI agents only follow URLs worth crawling. Known-good domains resolve instantly from cache. Suspicious URLs get analysed and scored so your crawler can skip, flag, or deprioritise them. The same filter runs against product feed ingestion, where dead or parked seller URLs degrade a marketplace catalogue.

Signals that matter for this use case

– Redirects 5+ flags redirect traps and chains that waste crawler resources.
– Chain incomplete means the URL doesn't resolve. Dead end, nothing to crawl.
– Parked means the domain is a placeholder. No real content to index.
– URL contains IP flags raw IP addresses, which are often suspicious or ephemeral infrastructure.
– TLD redirect change detects when the final destination is on a different TLD than expected.
– Subdomain excessive flags deeply nested subdomains that often indicate tracking URLs or spam infrastructure.
– HTTP only signals poorly maintained or abandoned sites.
– Domain age under 7 days flags brand-new domains that may be ephemeral or malicious.

Suggested scoring profile

{
  "name": "crawler",
  "weights": {
    "redirects_5": 30,
    "chain_incomplete": 30,
    "parked": 25,
    "url_contains_ip": 15,
    "tld_redirect_change": 10,
    "subdomain_excessive": 10,
    "http_only": 10,
    "domain_age_7": 5
  }
}

What a result looks like

Your AI agent's crawler encounters 10,000 URLs during a crawl session. Unphurl analyses each one before the crawler follows it:

Total URLs checked: 10,000

Resolved free (Tranco/cache): 9,200

Pipeline checks: 800

Flagged: 340

Breakdown of flagged URLs:

120 redirect traps (5+ hops)

95 parked domains

80 dead links (chain incomplete)

45 suspicious (IP addresses, TLD changes)

Your crawler skips the 340 URLs that scored above your threshold, saving compute and keeping low-quality content out of your index. The remaining 9,660 URLs scored below your threshold and proceed to crawling.

Cost

Depends on crawl scope. Most well-linked sites resolve from the Tranco list or cache at no cost. Only unknown or suspicious URLs require pipeline checks. The Pro package ($99 for 2,000 pipeline checks) handles large crawls comfortably. If your AI agents crawl the same domains repeatedly, caching means your effective cost drops with each run.

Get started

# Check a URL before crawling it
curl -X POST https://api.unphurl.com/v1/analyse \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/page",
    "profile": "crawler"
  }'

Read the docs All use cases