How do I filter out bad URLs before my crawler follows them?
The problem
An AI agent crawling the web wastes compute following dead links, redirect traps, and parked domains. Every bad URL your crawler follows is time, bandwidth, and money spent on content that will never make it into your index. Worse, malicious content can end up in your pipeline if the crawler follows links indiscriminately.
Redirect loops burn resources. A single redirect chain with five or more hops can tie up a crawler thread for seconds, and at scale those seconds compound into hours of wasted processing. Without pre-flight URL validation, your crawler's behaviour is essentially "follow everything and hope for the best."
How Unphurl solves it
Run URLs through Unphurl before your crawler follows them. The crawler scoring profile weights redirect behaviour, domain legitimacy, and resolution signals so your AI agents only follow URLs worth crawling. Known-good domains resolve instantly from cache. Suspicious URLs get analysed and scored so your crawler can skip, flag, or deprioritise them.
Signals that matter for this use case
- – Redirects 5+ flags redirect traps and chains that waste crawler resources.
- – Chain incomplete means the URL doesn't resolve. Dead end, nothing to crawl.
- – Parked means the domain is a placeholder. No real content to index.
- – URL contains IP flags raw IP addresses, which are often suspicious or ephemeral infrastructure.
- – TLD redirect change detects when the final destination is on a different TLD than expected.
- – Subdomain excessive flags deeply nested subdomains that often indicate tracking URLs or spam infrastructure.
- – HTTP only signals poorly maintained or abandoned sites.
- – Domain age under 7 days flags brand-new domains that may be ephemeral or malicious.
Suggested scoring profile
{
"name": "crawler",
"weights": {
"redirects_5": 30,
"chain_incomplete": 30,
"parked": 25,
"url_contains_ip": 15,
"tld_redirect_change": 10,
"subdomain_excessive": 10,
"http_only": 10,
"domain_age_7": 5
}
} What a result looks like
Your AI agent's crawler encounters 10,000 URLs during a crawl session. Unphurl analyses each one before the crawler follows it:
Total URLs checked: 10,000
Resolved free (Tranco/cache): 9,200
Pipeline checks: 800
Flagged: 340
Breakdown of flagged URLs:
120 redirect traps (5+ hops)
95 parked domains
80 dead links (chain incomplete)
45 suspicious (IP addresses, TLD changes)
Your crawler skips the 340 URLs that scored above your threshold, saving compute and keeping low-quality content out of your index. The remaining 9,660 URLs scored below your threshold and proceed to crawling.
Cost
Depends on crawl scope. Most well-linked sites resolve from the Tranco list or cache at no cost. Only unknown or suspicious URLs require pipeline checks. The Pro package ($99/month for 2,000 pipeline checks) handles large crawls comfortably. If your AI agents crawl the same domains repeatedly, caching means your effective cost drops with each run.
Get started
# Check a URL before crawling it
curl -X POST https://api.unphurl.com/v1/analyse \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/page",
"profile": "crawler"
}'