How do I find dead citations and broken links in research archives?

The problem

Studies show 25%+ of web links die within a few years. University libraries, journal publishers, and research archives maintain collections of cited URLs that gradually rot. A citation that pointed to a live dataset or source paper five years ago may now resolve to a parked page, a dead domain, or nothing at all. The same decay hits any long-lived website's outbound links, just at smaller scale.

Dead citations undermine the integrity of academic references. Researchers who follow a citation and land on a dead page lose access to the evidence chain. Manual link checking at scale is prohibitively expensive, and most institutions simply don't do it.

How Unphurl solves it

Export your citation URLs and batch check them through Unphurl with a link rot profile. The system analyses each URL for signs of decay: broken resolution chains, parked domains, expiring registrations, excessive redirects. You get a scored list that separates live citations from dead or dying ones, so you can prioritise which references need archival links or replacement.

The cache grows with each scan, so repeat sweeps of the same archive cost significantly less. Run an initial baseline, then incremental checks on new additions. Directory operators run the same incremental pattern against business listings instead of citations.

Signals that matter for this use case

– Chain incomplete means the URL can't resolve. The cited resource is gone.
– Parked domain means the domain behind the citation has been abandoned.
– Expiring soon means the domain registration is about to lapse. The link is likely to die.
– Bad domain status (pendingDelete, serverHold) means the domain is effectively dead.
– Excessive redirects (5+) suggests the original resource has moved multiple times and may be unreachable.
– Moderate redirects (3+) worth flagging for review, as redirect chains degrade over time.

Suggested scoring profile

{
  "name": "link-rot",
  "weights": {
    "chain_incomplete": 35,
    "parked": 30,
    "expiring_soon": 20,
    "domain_status_bad": 25,
    "redirects_5": 20,
    "redirects_3": 10,
    "ssl_invalid": 0,
    "brand_impersonation": 0,
    "domain_age_7": 0
  }
}

What a result looks like

Scanning 50,000 citation URLs from a research archive:

50,000 citation URLs checked:

Tranco / cache (free): 35,000

Pipeline checks (charged): 15,000

Still live: 12,500

Flagged as dead or dying: 2,500

Chain incomplete (dead): 1,100

Parked (abandoned): 800

Expiring soon: 350

Excessive redirects: 250

2,500 dead or dying citations identified. Each one can be flagged for archival lookup (Wayback Machine, DOI resolution) or marked as unavailable.

Cost

A large initial scan, then incremental checks on new additions. The cache grows with each scan, so repeat sweeps cost less. The Scale package ($399 for 10,000 pipeline checks) handles a full archive scan. Subsequent quarterly sweeps of the same archive will resolve most domains from cache for free.

Get started

For developers: Export citation URLs and batch check via CLI.

# Export citation URLs as CSV, then:
unphurl --batch citations.csv --profile link-rot

# Get dead citations for archival review
unphurl --batch citations.csv --profile link-rot --json | \
  jq -r '.results[] | select(.result.score >= 30) | .url' > dead-citations.txt

For non-coders (Claude Cowork, OpenClaw, or any AI agent): Give your AI your exported CSV and ask: "Check all these citation URLs. Give me a list of the dead and dying ones so I can find archival replacements."

Read the docs All use cases