[HN Gopher] Finding Dead Websites
___________________________________________________________________
Finding Dead Websites
Author : ingve
Score : 98 points
Date : 2025-06-17 12:03 UTC (2 days ago)
(HTM) web link (www.marginalia.nu)
(TXT) w3m dump (www.marginalia.nu)
| 55555 wrote:
| It's a real edge case, but someone could conceivably let their
| own domain expire and then register it anew and restore their
| website. It will be impossible to tell this apart from an SEO
| buying and restoring a website to use for link juice.
| AznHisoka wrote:
| The DNS records would be completely revamped, or removed in
| that case.
| marginalia_nu wrote:
| Yeah there's no shortage of caveats in this space. One could
| conceivably compare the outgoing links (being a search engine
| and all and having historical crawl data to compare against),
| but my hunch the cost of distinguishing between these two cases
| is going to be way out of proportion when compared to the
| benefit.
| atribecalledqst wrote:
| Before I RTFA, I was wondering if this would be about trying to
| find a way to include Wayback Machine results in search.
| Searching the Wayback Machine is always such a nightmare, and
| wouldn't it be nice if your search turned up that long-dead 1997
| web page that has the exact answer for what you're looking for...
|
| (minor use case I had recently was I was trying to find old
| Japanese blogs for Tamagotchis, which I gather there were a ton
| of in the 90s but almost none survive today - imagine if I could
| get those instead of the 1,000,000 sites just trying to sell them
| to me)
| Lammy wrote:
| Kagi has this feature, "Blast From The Past"
| https://blog.kagi.com/kagi-features#:~:text=Interesting%20fi...
| marginalia_nu wrote:
| They're likely only serving previously accessible domains
| already in their index as wayback machine links, which is
| neat, but doesn't really solve the problem of indexing the
| wayback machine in a broader sense.
|
| Would be a very nice feature to have indeed, though the data
| is a bit too inaccessible to index as far as I can tell (even
| though I've not given it any serious effort, so maybe it is?)
| Lammy wrote:
| I kinda consider that a feature and not a bug. If it were
| easier to find all the really deep stuff in the Wayback
| Machine, people would be trying to censor it all the time.
| I like being able to spear-fish my way into the deep shit
| by finding layers of URI references in other archived
| pages.
| cosmicgadget wrote:
| Agreed, it'd be neat to test links on the fly and substitute
| wayback links if they are dead and cached information if there
| is no snapshot.
| l5870uoo9y wrote:
| What a pleasant website theme for reading.
| mlhpdx wrote:
| I'm not sure what the authors point was with respect to ASN
| 16509. Are they saying parked domains don't like being viewed by
| Amazon IPs or that moving to Amazon is a strong signal for being
| parked? The latter seems absurd. But is it?
| marginalia_nu wrote:
| It seems an especially strong signal along with the other
| signals, i.e. ok status + losing encryption.
|
| The entire game is combining a bunch of weak indicators into a
| strong one.
| koprocezar wrote:
| That was interesting.
| renegat0x0 wrote:
| Whoa, this is what I have been wondering for some time, for my
| crawler.
|
| Crawler results depend on domain authority. If page owner, or
| page contents page change the ranking may, or should change.
|
| However original author also could change contents, and page
| ranking should not be changed. So this is not easy to determine
| what to do with domain of it becomes inactive, or changes
| contents dramatically.
|
| Currently I use only 30 day window to keep track of domains.
| After that period inactive domain is thrown out of the window.
|
| However valuable domains, even if dead, reside longer. My UI
| provides easy link to wayback machine. So even for dead links I
| can browse them.
|
| I noticed also that some domains, even if expired do serve
| contents, even if author left it alone. Page contents is served,
| but with a text that it expired.
| JdeBP wrote:
| As someone with a WWW site hit by Brexit where half the country
| voted to stop me having my domain name (and some other things) I
| read this with interest to consider how badly it would be caught
| out on the sort of false positive where a WWW site owner has to
| change ASes, change HTTP servers, set up redirects and meta
| information for the time left before eu. becomes unavailable, and
| even change DNS servers let alone a number of resource records. A
| lot of those seem to be things that will add up in this model. As
| would the fact that my prior domain name is today parked. In
| Canada!
|
| Not the first sudden and unwelcome discontinuity, either.
|
| _Google_ came close to thinking that I was dead, and turned out
| when I recently checked to be still looking for me under eu.,
| years after the fact.
|
| And with a broader view, this sort of stuff happens to the world,
| and there are enough people in the same boat that it is worth
| thinking of false positives when major upheavals occur. They can
| range from ISPs just up and deciding to close up shop with zero
| notice (which also happened to me) to international geopolitical
| upheavals. Who knows! If Brexit happened, it is conceivable that
| one day, the island of Niue might eventually prevail and then
| decide overnight that non-Niue citizens may not own a nu. domain.
| (-:
|
| I wonder how many times Marginalia would have declared me dead,
| by now. (-:
| marginalia_nu wrote:
| I think some degree of false positives is inevitable with this
| type of feature, but it can still provide use even if it's not
| perfect. Websites with flakey profiles that keep changing emit
| a signal of their own.
___________________________________________________________________
(page generated 2025-06-19 23:01 UTC)