[HN Gopher] At what point is the internet too big for Google to ...
___________________________________________________________________
At what point is the internet too big for Google to index it?
Recently there was a lot of talk about diminishing Google search
results, missing forum links, etc. That got me thinking. At what
point in time or Volume is the internet too big for Google to index
it?
Author : blablablub
Score : 6 points
Date : 2022-07-02 20:05 UTC (2 hours ago)
| sytelus wrote:
| Infrastructure wise scaling will continue to be possible with CS
| innovations. Algo wise, I am not sure if we can handle all the
| additional adversarial content and noise. A lot of index pruning
| happens just to reduce adversarial content and noise. However,
| ultimately it all comes down to cost in long run. Cost of
| crawling and serving extra Y% needs to be equal or lower than the
| potential drop in revenue in long run. At current stage, it is
| likely that vast majority of crawlable internet is not actually
| in index. By some measure, just 50B pages were sufficient to keep
| most users fairly happy. Going to 150B pages has marginal gain
| that small players cannot afford. The reachable size of internet
| is well over 1T pages.
| betaby wrote:
| It feels that Internet is trivially small, at lest text based
| part.
| bediger4000 wrote:
| I thought google was deliberately "forgetting" older material, to
| make room for new. The "long tail" turned out to be hogwash, and
| advertising corrupted everything.
| Havoc wrote:
| I don't think google's recent troubles are a result of index
| size.
|
| Pretty sure G could throw more money and resources at it if they
| thought that would make a dent.
|
| It feels more like a lot of real content is collateral damage to
| the SEO vs google wars. e.g. A blogger was complaining the other
| day that someone had set up an automation to automatically scrape
| their content the second it gets published, run it through
| translate twice and publish the resulting semi-gibberish.
|
| Those sort of shenanigans are I suspect quite hard to deal with
| even if you're google
| qeternity wrote:
| It's only going to get worse with the proliferation of NLP nets
| a la GPT.
|
| Will be interesting to see a decade from now how researchers
| collect a corpus that isn't chock full of a model's own output
| jstx1 wrote:
| I really don't think the the reduced quality in results is
| because the Internet is too big all of a sudden.
| dekhn wrote:
| Google blackholes many sites and they don't get indexed.
| marginalia_nu wrote:
| Google already doesn't index the entire Internet. The internet
| even having a size becomes more questionable the more you think
| about it.
|
| Let's say we set up a wildcard domain *.example.com all pointing
| to a server set up so that 0.example.com/ has a
| link to 0.example.com/0 and 1.example.com/ 0.example.com/0
| has a link to 0.example.com/1 and 0.example.com/0/0
| 0.example.com/0/0 has a link to 0.example.com/0/1 and
| 0.example.com/0/0/0 1.example.com/ has a link to
| 1.example.com/1 and 2.example.com/ 1.example.com/0 has a
| link to 1.example.com/1 and 1.example.com/0/0
|
| and so forth.
|
| This way even a raspberry pi is able to trivially host an
| infinite number of infinite websites.
| sacrosanct wrote:
| Google can cope with a Zettabye Era
| (https://en.m.wikipedia.org/wiki/Zettabyte_Era) it's separating
| wheat from chaff which is the hard problem. Also most data is
| largely being siloed behind walled gardens and can't be indexed.
| wilde wrote:
| The problem isn't that the internet is too big. It's that Google
| and the internet grew apart from each other.
|
| Some old sites never upgraded to https or other technical demands
| Google made of them. Google chose to stop indexing these sites to
| force them to change their behavior.
|
| Most new content is trapped in walled gardens of some format. The
| one I see all the time is Discord, but the communities you care
| about are probably talking in a non-indexable group chat rather
| than an indexable Internet forum like they might have 20 years
| ago.
| cpach wrote:
| I believe we have already reached that point.
___________________________________________________________________
(page generated 2022-07-02 23:02 UTC)