hngopher.com

       [HN Gopher] Finding Dead Websites
       ___________________________________________________________________
        
       Finding Dead Websites
        
       Author : ingve
       Score  : 98 points
       Date   : 2025-06-17 12:03 UTC (2 days ago)
        
 (HTM) web link (www.marginalia.nu)
 (TXT) w3m dump (www.marginalia.nu)
        
       | 55555 wrote:
       | It's a real edge case, but someone could conceivably let their
       | own domain expire and then register it anew and restore their
       | website. It will be impossible to tell this apart from an SEO
       | buying and restoring a website to use for link juice.
        
         | AznHisoka wrote:
         | The DNS records would be completely revamped, or removed in
         | that case.
        
         | marginalia_nu wrote:
         | Yeah there's no shortage of caveats in this space. One could
         | conceivably compare the outgoing links (being a search engine
         | and all and having historical crawl data to compare against),
         | but my hunch the cost of distinguishing between these two cases
         | is going to be way out of proportion when compared to the
         | benefit.
        
       | atribecalledqst wrote:
       | Before I RTFA, I was wondering if this would be about trying to
       | find a way to include Wayback Machine results in search.
       | Searching the Wayback Machine is always such a nightmare, and
       | wouldn't it be nice if your search turned up that long-dead 1997
       | web page that has the exact answer for what you're looking for...
       | 
       | (minor use case I had recently was I was trying to find old
       | Japanese blogs for Tamagotchis, which I gather there were a ton
       | of in the 90s but almost none survive today - imagine if I could
       | get those instead of the 1,000,000 sites just trying to sell them
       | to me)
        
         | Lammy wrote:
         | Kagi has this feature, "Blast From The Past"
         | https://blog.kagi.com/kagi-features#:~:text=Interesting%20fi...
        
           | marginalia_nu wrote:
           | They're likely only serving previously accessible domains
           | already in their index as wayback machine links, which is
           | neat, but doesn't really solve the problem of indexing the
           | wayback machine in a broader sense.
           | 
           | Would be a very nice feature to have indeed, though the data
           | is a bit too inaccessible to index as far as I can tell (even
           | though I've not given it any serious effort, so maybe it is?)
        
             | Lammy wrote:
             | I kinda consider that a feature and not a bug. If it were
             | easier to find all the really deep stuff in the Wayback
             | Machine, people would be trying to censor it all the time.
             | I like being able to spear-fish my way into the deep shit
             | by finding layers of URI references in other archived
             | pages.
        
         | cosmicgadget wrote:
         | Agreed, it'd be neat to test links on the fly and substitute
         | wayback links if they are dead and cached information if there
         | is no snapshot.
        
       | l5870uoo9y wrote:
       | What a pleasant website theme for reading.
        
       | mlhpdx wrote:
       | I'm not sure what the authors point was with respect to ASN
       | 16509. Are they saying parked domains don't like being viewed by
       | Amazon IPs or that moving to Amazon is a strong signal for being
       | parked? The latter seems absurd. But is it?
        
         | marginalia_nu wrote:
         | It seems an especially strong signal along with the other
         | signals, i.e. ok status + losing encryption.
         | 
         | The entire game is combining a bunch of weak indicators into a
         | strong one.
        
       | koprocezar wrote:
       | That was interesting.
        
       | renegat0x0 wrote:
       | Whoa, this is what I have been wondering for some time, for my
       | crawler.
       | 
       | Crawler results depend on domain authority. If page owner, or
       | page contents page change the ranking may, or should change.
       | 
       | However original author also could change contents, and page
       | ranking should not be changed. So this is not easy to determine
       | what to do with domain of it becomes inactive, or changes
       | contents dramatically.
       | 
       | Currently I use only 30 day window to keep track of domains.
       | After that period inactive domain is thrown out of the window.
       | 
       | However valuable domains, even if dead, reside longer. My UI
       | provides easy link to wayback machine. So even for dead links I
       | can browse them.
       | 
       | I noticed also that some domains, even if expired do serve
       | contents, even if author left it alone. Page contents is served,
       | but with a text that it expired.
        
       | JdeBP wrote:
       | As someone with a WWW site hit by Brexit where half the country
       | voted to stop me having my domain name (and some other things) I
       | read this with interest to consider how badly it would be caught
       | out on the sort of false positive where a WWW site owner has to
       | change ASes, change HTTP servers, set up redirects and meta
       | information for the time left before eu. becomes unavailable, and
       | even change DNS servers let alone a number of resource records. A
       | lot of those seem to be things that will add up in this model. As
       | would the fact that my prior domain name is today parked. In
       | Canada!
       | 
       | Not the first sudden and unwelcome discontinuity, either.
       | 
       |  _Google_ came close to thinking that I was dead, and turned out
       | when I recently checked to be still looking for me under eu.,
       | years after the fact.
       | 
       | And with a broader view, this sort of stuff happens to the world,
       | and there are enough people in the same boat that it is worth
       | thinking of false positives when major upheavals occur. They can
       | range from ISPs just up and deciding to close up shop with zero
       | notice (which also happened to me) to international geopolitical
       | upheavals. Who knows! If Brexit happened, it is conceivable that
       | one day, the island of Niue might eventually prevail and then
       | decide overnight that non-Niue citizens may not own a nu. domain.
       | (-:
       | 
       | I wonder how many times Marginalia would have declared me dead,
       | by now. (-:
        
         | marginalia_nu wrote:
         | I think some degree of false positives is inevitable with this
         | type of feature, but it can still provide use even if it's not
         | perfect. Websites with flakey profiles that keep changing emit
         | a signal of their own.
        
       ___________________________________________________________________
       (page generated 2025-06-19 23:01 UTC)