[HN Gopher] 10% of the top million sites are dead
___________________________________________________________________
10% of the top million sites are dead
Author : Soupy
Score : 235 points
Date : 2022-07-15 17:22 UTC (5 hours ago)
(HTM) web link (ccampbell.io)
(TXT) w3m dump (ccampbell.io)
| gumby wrote:
| His 'www' logic is flawed: https://www.example.com and
| https://example.com need not return the same results, but his
| checking code sends the output straight to /dev/null so he has no
| way of knowing.
| cbarrick wrote:
| In theory, sure.
|
| In practice, how many orgs serve on both example.com and
| www.example.com yet operate each as entirely separate sites?
|
| I cannot think of any example.
| gumby wrote:
| MIT was, for decades, though they seem to have changed.
| phkahler wrote:
| Read that again folks:
|
| "a very reasonable but basic check would be to check each domain
| and verify that it was online and responsive to http requests.
| With only a million domains, this could be run from my own
| computer relatively simply and it would give us a very quick
| temperature check on whether the list truly was representative of
| the "top sites on the internet". "
|
| This took him 50 minutes to run. Think about that when you want
| to host something smaller than a large commercial site. We live
| in the future now, where bandwidth is relatively high and
| computers are fast. Point being that you don't need to rent or
| provision "big infrastructure" unless you're actually quite big.
| cratermoon wrote:
| > you don't need to rent or provision "big infrastructure"
| unless you're actually quite big.
|
| Or if you have hard response-time requirements. I really don't
| think it would be good to, for example, wait an hour to process
| the data from 800K earthquake sensors and send out an alert to
| nearby affected areas.
| stevemk14ebr wrote:
| your point has a truth behind it for sure, but there's a large
| difference between serving requests and making requests. Many
| sites are simple html and css pages, but many others also have
| complex backends. It's those that often are hard to scale and
| why the cloud is hugely popular, maintaining and scaling the
| backend is hard
| phkahler wrote:
| Oh absolutely, but he also said this:
|
| I found that my local system could easily handle 512 parallel
| processes, with my CPU @ ~35% utilization, 2GB of RAM usage,
| and a constant 1.5MB down on the network.
|
| Another thing that happened in the early web days was Apache.
| People needed a web server and it did the job correctly.
| Nobody ever really noticed that it had terrible performance,
| so early on infrastructure went to multiple servers and load
| balancers and all that jazz. Now with nginx, fast multi-core,
| and speedy networks even at home, it's possible to run sites
| with a hundred thousand users a day at home on a laptop. Not
| that you'd really want to do exactly that but it could be
| done.
|
| Because of this I think an alternative to github would be
| open source projects hosted on peoples home machines. CI/CD
| might require distributing work to those with the right
| hardware variants though.
| [deleted]
| jayd16 wrote:
| The flip side is anyone can run these kinds of tools against
| your site easily and cheaply.
| kozziollek wrote:
| Most of cities in Poland have their own $city.pl domain and allow
| websites to buy $website.$city.pl. That might not be well known.
| And cities have theri websites, so I guess it's OK.
|
| But info.pl and biz.pl? Did nobody hear about country variants of
| gTLDs?!
| drdaeman wrote:
| Those are called Public Suffixes or effective TLDs (eTLDs):
| https://en.wikipedia.org/wiki/Public_Suffix_List
|
| And you're entirely correct that author should've referred to
| such list.
| macintux wrote:
| Title is misleading: that's the outcome, but the bulk of the
| story is the data processing to reach that conclusion.
| hinkley wrote:
| It happens. Most of the stuff we do these days invokes a number
| of disciplines. I forget sometimes that maybe ten percent of us
| just play with random CS domains for "fun" and that most people
| are coming into big problems blind, even sometimes the
| explorers (though having comfort with exploring random fields
| is a skill set unto itself).
|
| Before the Cloud, when people would ask for a book on
| distributed computing, which wasn't that often, I would tell
| them seriously "Practical Parallel Rendering". That book was
| almost ten years old by then. 20 now. It's ostensibly a book
| about CGI, but CGI is about distributed work pools, so half the
| book is a whirlwind tour of distributed computing and queuing
| theory. Once they start talking at length about raytracing, you
| can stop reading if CGI isn't your thing, but that's more than
| halfway through the book.
|
| I still have to explain some of that stuff to people, and it
| catches them off guard because they think surely this little
| task is not so sophisticated as that...
|
| I think this is where the art comes in. You can make something
| fiddly that takes constant supervision, so much so that you get
| frustrated trying to explain it to others, or you can make
| something where you push a button and magic comes out.
| crikeyjoe wrote:
| allknowingfrog wrote:
| I don't have any particular opinions on the author's conclusions,
| but I learned a thing or two about the power of terminal commands
| by reading through the article. I had no idea that xargs had a
| parallel mode.
| thelamest wrote:
| Probably not news to anyone who works with big data(tm), but I
| learned, after additional searches, that using (something like)
| duckdb as a CSV parser makes sense, especially if the
| alternative is loading the entire thing to memory with
| (something like) base R. This was informative for me:
| https://hbs-rcs.github.io/large_data_in_R/.
| zinekeller wrote:
| TLDR: Campbell's methodology is flawed, does not consider edge
| cases (one of which (equating apex-only and www-prefixed domains)
| I consider reckless), and didn't understand how Majestic collects
| and processes its data.
|
| Longer version: This isn't comprehensive, but I think of two main
| reasons why:
|
| - The Majestic Million lists only the registrable part (with some
| exceptions), and this sometimes lead to central CDNs being
| listed. For example, the Majestic Million lists wixsite.com (for
| those who are unaware is a CDN domain used by Wix.com with
| separate subdomains), but if you visit wixsite.com you wouldn't
| get anything. Same with Azure, subdomains of azureedge.net and
| azurewebsites.net do exist (for example
| https://peering.azurewebsites.net/) but azureedge.net and
| azurewebsites.net themselves don't exist. Without similar
| filtering, using the Cisco list (https://s3-us-
| west-1.amazonaws.com/umbrella-static/index.htm...) would quickly
| lead you to this precise problem (mainly because the number one
| is "com", but phew at least http://ai./ does exist!)
|
| - Also, shame on the author considering www-prefixed and apex-
| only as one and the same. For some websites, it isn't. Take this
| example: jma.go.jp (Japan Meteorological Agency), which doesn't
| respond (actually NODATA) on http://jma.go.jp/ but is fine on
| https://www.jma.go.jp/. Similarly, beian.gov.cn (Chinese ICP
| Licence Administrator) wouldn't respond at all but
| _www_.beian.gov.cn will. And for ncbi.nlm.nih.gov (National
| Center for Biotechnology Information) ? I can 't blame Majestic:
| https://www.ncbi.nlm.nih.gov/ and https://ncbi.nlm.nih.gov/ don't
| redirect to a canonical domain, and unless you've compared the
| HTTP pages there's no way you would know that they are the same
| website!
|
| Edit: I've downloaded out the CSV to check my claims, and it
| shows: wixsite.com 0 beian.gov.cn 0
|
| Please, for the love of sanity, consider what the Majestic
| Million (and similar lists) criterion on inclusion. I can't
| believe it to say, but can we crowd-source "Falsehoods
| programmers believe about domains"?
|
| Also addendum to crawling but I consider "probably forgivable":
|
| - Some websites are only available in certain countries (internal
| Russian websites don't respond at all outside Russia for
| example). This can skew the numbers a little bit.
| [deleted]
| zepearl wrote:
| > _Take this example: jma.go.jp (Japan Meteorological Agency),
| which doesn 't respond (actually NODATA) on http://jma.go.jp/
| but is fine on https://www.jma.go.jp/. Similarly, beian.gov.cn
| (Chinese ICP Licence Administrator) wouldn't respond at all but
| www.beian.gov.cn will._
|
| I can confirm stuff like that - I'm writing a crawler&indexer-
| program (prototype in Python, now writing the final version in
| Rust) and assuming anything while crawling is NOK. I ended up
| adding URLs to my "to-index"-list by considering only links
| explicitly mentioned by other websites (or by pages within the
| same site).
| cratermoon wrote:
| It even says right at the top of the Majestic Million site "The
| million domains we find with the most referring subnets", not
| implying anything about reachability for http(s) requests.
| nr2x wrote:
| Majestic is a shit list. Mystery solved.
| softwaredoug wrote:
| My current beliefs about how people use and trust information on
| the Web.
|
| First, trust is _everything_ on the Web, it is the thing people
| first think of when arriving on some information. But how people
| evaluate trust has changed dramatically over the last 10 years.
|
| - Trust now comes almost exclusively from social proof. Searching
| reddit, youtube, etc and other extremely _moderated_ sources of
| information, where the most work is done to ensure content comes
| from actual human beings. How many of us now google `<topic>
| reddit` instead of just `<topic>`?
|
| - Of course a lot of this trust is misplaced. There's a very thin
| line between influencers and cult leaders / snake oil salesmen.
| Our last President used this hack really effectively.
|
| - Few trust Google's definition of trust anymore -- essentially
| page rank. This made more sense when the Web essentially was
| social, where inbound links were very organic. Now with the trust
| in general Web sites evaporated, the main 'inbound links' anyone
| cares about come from individuals or community they trust or
| identify with. They don't trust Googles algorithm (its too
| opaque, and too easily gamed).
|
| This of course means the fracturing of truth away from elites.
| Sometimes this could be a good thing, but in many cases _cough_
| Covid _cough_ it might be pretty disastrous for misinformation
| wolverine876 wrote:
| > How many of us now google `<topic> reddit` instead of just
| `<topic>`?
|
| One of us lives in a bubble. I don't trust Reddit for anything,
| or YouTube or any social media. IME, it's mis/disinformation -
| not only a lack of information, but a negative; it leaves me
| believing something false. My experience is, and plenty of
| research shows, that we have no way to sort truth from fiction
| without prior expertise in the domain. The misinformation and
| disinformation on social media, and its persuasiveness, is very
| well known. The results are evident before us, in the madness
| and disasters, in dead people, in threats to freedom,
| prosperity, and stability.
|
| Why would people in this community, who are aware of these
| issues, trust social media? How is that working out?
|
| > This of course means the fracturing of truth away from
| elites. Sometimes this could be a good thing
|
| I think that's mis/disinformation. 'Elite' is a loaded,
| negative (in this context) word. It makes the question about
| power and the conclusion inevitable.
|
| Making it about power distracts from the core issue of
| knowledge, which is truth. I want to hear from the one person,
| or one of the few people, with real knowledge about a topic; I
| don't want to hear from others.
|
| _In matters of science the authority of thousands is not worth
| the humble reasoning of one single person._
| Brian_K_White wrote:
| They already acknowledge the problem of trusting the crowd,
| but you seem to not acknowledge the problem of trusting a
| central dispensary. In fact it's unwise to trust either one.
| Everything has to be evaluated case by case. The same source
| should be trusted for one thing today, and not for some other
| thing tomorrow.
| mountainriver wrote:
| > How many of us now google `<topic> reddit` instead of just
| `<topic>`
|
| I sure hope not, Reddit is horrible place for information
| failTide wrote:
| I use the strategy for a few things - including when I want
| to get reviews of a product or service. There's still
| potential for manipulation there, but you can judge the
| replies based on the user history - and you know that
| businesses aren't able to delete or hide bad reviews there.
|
| But in general I agree with you - reddit is full of
| misinformation, propaganda and astroturfing
| romanhn wrote:
| When I have a specific technical question, I append
| "stackoverflow" to my search queries. When I want to read a
| discussion, I add "reddit" (or "hacker news").
| zX41ZdbW wrote:
| This looks surprisingly similar to the unfinished research that I
| did: https://github.com/ClickHouse/ClickHouse/issues/18842
| mouzogu wrote:
| whenever i go through my bookmarks, i tend to find maybe 5-10%
| are now 404.
|
| this is why i like the archive.ph project so much and using it
| more as a kind of bookmarking service.
| system2 wrote:
| archive.ph = Russian federation website. Blocked by most
| firewalls by default.
| syedkarim wrote:
| What's the benefit to using archive.ph instead of archive.org
| (Internet Archive)? Seems like the latter is much more likely
| to be around for awhile.
| mouzogu wrote:
| i find archive.ph does a better job of preserving the page as
| is (it also takes a screenshot) compared to internet archive
| which can be flaky at best.
|
| i also find archive.ph much faster at searching, and the
| browser extension is really useful too.
|
| the faq does a great job of explaining too
| https://archive.ph/faq
| yellow_lead wrote:
| Isn't archive.ph/today the one with questionable funding
| sources and backing? Who is behind it and can it be trusted
| for longevity?
| mouzogu wrote:
| yeah funding is a grey area...
|
| fwiw the website is only accessible by VPN in a lot of
| countries, which is say a lot for me..and i don't think
| they've taken down any content, although i cant say for
| sure.
| NavinF wrote:
| In this case the less we know, the longer it will last.
| Notice how this site ignores robots.txt and copyright
| claims by litigious companies that would like to see
| their past erased.
|
| The data saved on your NAS will outlast this site
| regardless of who owns/funds it.
| fragmede wrote:
| How do you figure?
| NavinF wrote:
| What do you mean? There's a line of companies waiting to
| sue anyone involved with that site. That's been the case
| for many years.
| mgdlbp wrote:
| archive.today does that by rewriting the page to mostly
| static HTML at the time of capture.
|
| archive.org indexes all URLs first-class and presents as
| close to what was originally served as possible. It also
| stores arbitrary binary files and captures JS and Flash
| interactivity with remarkable fidelity.
|
| When logged in, the archive.org Save Page Now interface
| gains the options of taking a screenshot and non-
| recursively saving all linked pages. I cannot reason why--
| the more saved, the better, right?
|
| archive.org has a browser extension too
| yajjackson wrote:
| Tangential, but I love the format for your site. Any plans to do
| a "How I built this blog" post?
| kerbersos wrote:
| Likely using Hugo with the congo theme
| Soupy wrote:
| Yup, nailed it. Hugo with Congo theme (and a few minor layout
| tweaks). Hosted on cloudflare pages for free
| the_biot wrote:
| By what possible criteria are these the "top" million sites, if
| 10% are dead? I'd start with questioning _that_ data.
| kjeetgill wrote:
| Dude, it's the second sentence of the first paragraph:
|
| > For my purposes, the Majestic Million dataset felt like the
| perfect fit as it is ranked by the number of links that point
| to that domain (as well as taking into account diversity of the
| origin domains as well).
| MatthiasPortzel wrote:
| And moreover, the author's conclusion is that the dataset is
| bad.
|
| > While I had expected some cleanliness issues, I wasn't
| expecting to see this level of quality problems from a
| dataset that I've seen referenced pretty extensively across
| the web
| winddude wrote:
| part of the problem is it's not the number of links, it's
| referring subnets. Fairly certain this includes, script tags.
| the_biot wrote:
| Yeah, but they're still providing a dataset that's just plain
| bad. It's hardly relevant how many sites link to some other
| site, if it's dead.
| Brian_K_White wrote:
| It's only bad data if it does not include what it claims to
| include.
|
| If the dataset is defined as inlinks, and it is inlinks,
| then the data is good.
| deltree7 wrote:
| Exactly!
|
| Garbage In == Garbage Out
| winddude wrote:
| No they're not.
| gojomo wrote:
| Many issues with this analysis, some others have already
| mentioned, including:
|
| * The 'domains' collected by the source, as those "with the most
| referring subnets", aren't necessarily 'websites' that now, or
| ever, respnded to HTTP
|
| * In many cases any responding web server will be on the `www.`
| subdomain, rather than the domain that was listed/probed - & not
| everyone sets up `www.` to respond/redirect. (Author
| misinterprets appearances of `www.domain` and `domain` in his
| source list as errant duplicates, when in fact that may be an
| indicator that those `www.domain` entries also have significant
| `subdomain.www.domain` extensions - depending on what Majestic
| means by 'subnets'.)
|
| * Many sites may block `curl` requests because they _only_ want
| attended human browser traffic, and such blocking (while usually
| accompanied with some error response) _can_ be a more aggressive
| drop-connection.
|
| * `curl` given a naked hostname likely attempts a plain HTTP
| connection, and given that even browsers now auto-prefix `https:`
| for a naked hostname, some active sites likely have _nothing_
| listening on plain-HTTP port anymore.
|
| * Author's burst of activity could've triggered other rate-
| limits/failures - either at shared hosts/inbound proxies
| servicing many of the target domains, or at local ISP egresses or
| DNS services. He'd need to drill-down into individual failures to
| get a beter idea to what extent this might be happening.
|
| If you want to probe if _domains_ are still active:
|
| * confirm they're still registered via a `whois`-like lookup
|
| * examine their DNS records for evidence of current services
|
| * ping them, or any DNS-evident subdomains
|
| * if there are any MX records, check if the related SMTP server
| will confirm any likely email addresses (like postmaster@) as
| deliverable. (But: don't send an actual email message.)
|
| * (more at risk of being perceived as aggressive) scan any extant
| domains (from DNS) for open ports running any popular (not just
| HTTP) services
|
| If you want to probe if _web sites_ are still active, start with
| an actual list of web site URLs that were known to have been
| active at some point.
| spc476 wrote:
| It dawned on me when I hit the Majestic query page [1] and saw
| the link to "Commission a bespoke Majestic Analytics report."
| They run a bot that scans the web, and (my opinion, no real
| evidence) they probably don't include sites that block the
| MJ12bot. This could explain why my site isn't in the list, I
| had some issues with their bot [2] and _they_ blocked
| themselves from crawling my site.
|
| So, is this a list of the actual top 1,000,000 sites? Or just
| the top 1,000,000 sites they crawl?
|
| [1] https://majestic.com/reports/majestic-million
|
| [2] http://boston.conman.org/2019/07/09-12
| useruserabc wrote:
| As near as I can tell, these are the top 1,000,000 domains
| referred to by other websites they crawled.
|
| The report is described as "The million domains we find with
| the most referring subnets"[1] and a referring subnet is a
| host with a webpage which points at the domain.
|
| So to the grandparent, presumably if something is "linking"
| to these domains, they probably were meant to be websites.
|
| [1] https://majestic.com/reports/majestic-million [2]
| https://majestic.com/help/glossary#RefSubnets,
| https://majestic.com/help/glossary#RefIPs and also
| https://majestic.com/help/glossary#Csubnet
| bioemerl wrote:
| I'm honestly amazed that out of the top million sites, which
| probably includes a ton of tiny tiny sites that are idle or
| abandoned, only ten percent are offline.
| MonkeyMalarky wrote:
| How many are placeholder pages thrown up by registrars like
| Network Solutions?
| denton-scratch wrote:
| If they're placeholder _pages_ , they're not dead. Those 10%
| are not responding at all; the requests aren't reaching any
| HTTP server.
| winddude wrote:
| at least from his computer/script. A number could have been
| blocked simply detecting him as a bot.
| zamadatix wrote:
| Not all placeholder pages will forever stay placeholder
| pages though. Some may get sold, become a site, then stop
| being a site again. Some may not get sold, come up for
| renewal and be deemed unlikely to be worth trying to sell
| anymore (renewal is cheap for a registrar but the registry
| will still charge a small fee).
|
| Of course the vast majority with enough interest to make
| this list will either be sold and be an active page or
| still be an active placeholder but I wouldn't rule out
| there being a good count of pages towards the lower end of
| the top million being placeholders that were eventually
| deemed not worth trying for anymore.
| mike_hock wrote:
| Yeah, I'd expect a list of 1,000,000 "top" "sites" to contain
| much more than what can be called a "site," especially in 2022
| when the internet has been all but destroyed and all that's
| left is a corporate oligopoly.
| ehsankia wrote:
| How is "top" defined here? If they were dead, wouldn't they
| fairly quickly stop being "top"?
|
| EDIT: the article uses a list sorted by inlinks, and I guess
| other websites don't necessarily update broken links, but that
| may be less true in the modern age where we have tools and
| automated services to automatically warn us about dead links on
| our websites.
| nine_k wrote:
| I can expect large SEO spam clusters of "sites" with many
| links inside a cluster to make them look legit. For some time
| such bits of SEO spam were on top of certain google searches
| and enjoyed significant traffic, putting them firmly into
| "top 1M".
|
| Once a particular SEO trick is understood and "deoptimized"
| by Google, these "sites" no longer make money, and get
| abandoned.
| Swizec wrote:
| Blows my mind that my blog is 210863rd on that list. That makes
| the web feel somehow smaller than I thought it was.
| wincent wrote:
| Eyeing you jealously from my position at 237,014 on the
| list... We're almost neighbors, I guess.
| gravitate wrote:
| > Domain normalization is a bitch
|
| I'm a no-www advocate. All my sites can be accessed from the Apex
| domain. But some people for whatever reason like to prepend www
| to my domains, so I wrote a rule in Apache's .HTACCESS to rewrite
| the www to the Apex.
|
| Here's a tutorial for doing that: https://techstream.org/Web-
| Development/HTACCESS/WWW-to-Non-W...
| noizejoy wrote:
| > I'm a no-www advocate.
|
| I used to feel the same way. -- Until the arrival of so many
| new TLDs.
|
| Since then I always use www, because mentioning www.alice.band
| in a sentence is much more of a hint to a general audience as
| to what I'm referring to than just alice.band
| gravitate wrote:
| I hear you. But a redirect is a good solution in that case.
| noizejoy wrote:
| Yes it is.
|
| I just redirect the other way round, so those ever rarer
| individuals typing in domains are also served fine on my
| websites. And also to automatically grab the https.
|
| I just find it's ever slightly more "honest" to have the
| server name I mention, also be the one that's actually
| being served. -- And that's also because I'm quite annoyed
| at URL shorteners and all kinds of redirect trickery having
| being weaponized over the years.
|
| So I optimize for honesty and facilitate convenience.
|
| But this pretty subtle stuff and I'm not advocating
| anymore. -- I don't think it's that big of a deal either
| way and I'm just expressing my little personal vote and
| priorities on the big Internet. :-)
|
| So my post wasn't intended to change your mind, but more as
| a bit of an alternative view and what made me get there.
| macintux wrote:
| 25 years ago I added a rule to my employer's firewall to allow
| the bare domain to work on our web server.
|
| Inbound email immediately broke. I was still very new, and
| didn't want to prolong the downtime, so I reverted instead of
| troubleshooting.
|
| A few months after I left, I sent an email to a former co-
| worker, my replacement, and got the same bounce message. I rang
| him up and verified that he had just set up the same firewall
| rule.
|
| Been much too long to have any clue now what we did wrong.
| JackMcMack wrote:
| You probably created a cname from the apex to www? This
| problem still exists today.
|
| From https://en.wikipedia.org/wiki/CNAME_record: "If a CNAME
| record is present at a node, no other data should be present;
| this ensures that the data for a canonical name and its
| aliases cannot be different."
|
| So if you're looking up the MX record for domain, but happen
| to find a cname for domain to www.domain , it will follow
| that and won't find any MX records for www.domain.
|
| The correct approach is to create a cname record from
| www.domain to domain, and have the A record (and MX and other
| records) on the apex.
|
| Most DNS providers have a proprietary workaround to create
| dns-redirects on the apex (such as AWS Route53 Alias records)
| and serve them as A records, but those rarely play nice with
| external resources.
| tux2bsd wrote:
| > You probably created a cname from the apex to
|
| You can't do that, period.
|
| A lot of "cloud" and other GUI interfaces deceive people
| into thinking it's possible, they just do A record fuckery
| behind the scenes (clever in it's own right but it causes
| misunderstanding).
| MonkeyMalarky wrote:
| Last time I tried to crawl that many domains, I ran into problems
| with my ISP's DNS server. I ended up using a pool of public DNS
| servers to spread out all the requests. I'm surprised that wasn't
| an issue for the author?
| wumpus wrote:
| You have to run your own resolver. Crawling 101.
| MonkeyMalarky wrote:
| This is of course the correct answer. It just felt like
| shaving a big yak at the time.
| mh- wrote:
| A properly configured unbound running locally can be a
| decent compromise.
| denton-scratch wrote:
| That _is_ running your own resolver. Unbound is a
| resolver.
| mh- wrote:
| well, yes, but I guess I think of unbound in a different
| category from setting up (e.g.) bind. but, my experience
| configuring bind is probably more than 20 years out of
| date.
|
| you're right to make that correction though, so thank
| you. :)
| fullstop wrote:
| BIND is odd in that it combines a recursive resolver with
| an authoritative name server, and this has actually led
| to a number of security vulnerabilities over the years.
| Other alternatives, such as djb's dnscache/tinydns and
| NLNet Labs' Unbound/nsd separate the two to avoid this
| entirety.
| altdataseller wrote:
| All these top million lists are very good at telling you the top
| most 10K-50K sites on the web. After that, you're going into
| 'crapshoot' land, where the 500,000th most popular site is very
| likely to be a site that got some traffic a long time ago, but
| now isn't even up.
|
| So I would take this data with a grain of salt. You're better off
| just analyzing the top 100K sites on these lists.
| giantrobot wrote:
| > where the 500,000th most popular site is very likely to be a
| site that got some traffic a long time ago, but now isn't even
| up.
|
| That's literally the phenomenon the article is describing.
| altdataseller wrote:
| Ok let me reword it differently: the 500,000th most popular
| site on these lists most likely isnt the 500,000th most
| visited and it might not even be in the top 5 million. These
| data sources are so bad at capturing popularity after 50k
| sites or so simply because they dont have enough data
| iruoy wrote:
| I haven't tested this, but the "Cisco Umbrella 1 Million"
| is generated daily from DNS request made to the Cisco
| Umbrella DNS service. That seems to be a very good and
| recent dataset.
|
| It does count more than just visiting websites though. If
| all Windows computers query the IP of microsoft.com once a
| day that'll move them up quite a bit. And things in their
| top 10 like googleapis.com and doubleclick.net are
| obviously not visited directly.
|
| So while it is quite a reliable and recent dataset, it is
| not a good test of popularity.
| spaceman_2020 wrote:
| Not surprising. We're far away from the glory days of the
| vibrant, chaotic web.
|
| In countries like India that onboarded most users through
| smartphones instead of computers, websites are not even
| necessary. There's a huge dearth of local-focused web content as
| well since there just isn't enough demand.
| [deleted]
| superb-owl wrote:
| One of the few things I like about blockchain is the promise of a
| less ephemeral web.
| bergenty wrote:
| Is that actually true? Don't most nodes hold heavily compressed
| pointers only while there are only a percentage of nodes that
| host the entire blockchain. I mean if what you're saying is
| true then each node needs to have a copy of the entire internet
| which isn't reasonable.
| matkoniecz wrote:
| One of many things I dislike about cryptoscams is making
| promises which are lies.
| deltree7 wrote:
| spoken like someone who is clueless about Blockchain
| pahool wrote:
| zombo.com still kicking!
| system2 wrote:
| The png rotates with this:
|
| .rotate {animation: rotation .5s infinite linear;}
|
| I think it wasn't like this before. They must've updated it at
| one point.
| smugma wrote:
| I downloaded the file and looked at the second 000 in his file,
| which refers to wixsite.com.
|
| It appears that wixsite.com isn't valid but www.wixsite.com is,
| and redirects to wix.com.
|
| It's misleading to say that the sites are dead. As noted
| elsewhere, his source data is crap (other sites I checked such as
| wixstatic.com don't appear to be valid) but his methodology is
| bad, or at least his describing the sites as dead is misleading.
| code123456789 wrote:
| wixsite.com is a domain for free sites built on Wix, so if your
| username on Wix is smugma, and your site name is mysite, then
| you'll have a URL like smugma.wixsite.com/mysite for your Home
| page.
|
| That's why this domain is in the top
| smugma wrote:
| Correct, that's why it's in the top. Your example further
| confirms why the author's methodology is broken.
| winddude wrote:
| 100% agree his methodology is broken. Another example like this
| is googleapis.com. If I remember correctly there a quite a
| number of domains like this in magestic million.
|
| Not to mention a number of his requests may have been blocked.
| zinekeller wrote:
| > other sites I checked such as wixstatic.com don't appear to
| be valid
|
| But docs.wixstatic.com _is_ valid.
| quickthrower2 wrote:
| He takes this into account by generously considering _any_
| returned response code as "not dead".
|
| > there's a longtail of sites that had a variety of non-200
| reponse codes but just to be conservative we'll assume that
| they are all valid
| mort96 wrote:
| That doesn't take this into account, no. `curl wixsite.com`
| returns a "Could not resolve host" error; it doesn't return a
| response code, so the author would consider it invalid, even
| though `curl www.wixsite.com` does return a response (a 301
| redirect to www.wix.com).
| quickthrower2 wrote:
| Oh how does that work then? How does the browser get to the
| redirect when curl doesn't get any response at all? Is this
| a DNS thing?
| chrisweekly wrote:
| apex domain is different from www cname
| zzzeek wrote:
| irony that the site is not responding?
| ghostly_s wrote:
| Wow, I would not have suspected `tee` is able to handle multiple
| processes writing to the same file. Doesn't seem to be mentioned
| on the man-page, either.
| ocdtrekkie wrote:
| I've been working on trying to migrate sites I ran in 2008 or so
| into my new preferred hosting strategy lately: I know zero people
| look at them, since many were functionally broken at present, but
| I don't like the idea of actually removing them from the web. So
| I'm patching them up, migrating them to a more maintainable
| setting, and keeping them going. Maybe someday some historian
| will get something out of it.
| tete wrote:
| The biggest problem I find is that it seems to be pretty
| "outdated" to keep redirects in place, if you move stuff. So many
| links to news websites, etc. will cause a redirect to either / or
| a 404 (which is a very odd thing to redirect to in my opinion).
|
| If you are unlucky an article you wanted to find also completely
| disappeared. This is scary, because it's basically history
| disappearing.
|
| I also wonder what will happen to text on websites that are some
| ajax and javascript breaks because a third party goes down. While
| the internet archive seems to be building tools for people to use
| to mitigate this I found that they barely worked on websites that
| do something like this.
|
| Another worry is the ever-increasing size of these scripts making
| archiving more expensive.
| Kye wrote:
| You can often pop the URL into the Wayback Machine to bring up
| the last live copy. It's better at handling dynamic stuff the
| more recent it is. Older stuff, especially early AJAX pages,
| are just gone because the crawler couldn't handle it at the
| time. It's far from a perfect solution, especially in light of
| the big publishers finally getting their excuse to go after the
| Internet Archive legally. It's a good silo, but just as
| vulnerable as any other.
___________________________________________________________________
(page generated 2022-07-15 23:00 UTC)