[HN Gopher] Only Google is really allowed to crawl the web
___________________________________________________________________
Only Google is really allowed to crawl the web
Author : skinkestek
Score : 711 points
Date : 2021-03-26 14:34 UTC (8 hours ago)
(HTM) web link (knuckleheads.club)
(TXT) w3m dump (knuckleheads.club)
| graiz wrote:
| Not sure why http://commoncrawl.org/ wasn't mentioned.
| dclaw wrote:
| I can't really trust a website that spells its own name wrong on
| their homepage. "Knucklesheads' Club"
|
| Edit: https://imgur.com/a/inqYrjV
| slenk wrote:
| Everyone makes mistakes
| sgsvnk wrote:
| Money earns more money. Privilege begets more privilege.
|
| This is not just true in the case of Google but in other other
| domains as well like the financial markets.
|
| Would you blame capitalism?
| wunderflix wrote:
| Even that won't change much. There is no way Google can be out-
| googled by other search engines because of its market dominance:
| more traffic means more clicks, more clicks mean better search
| results, better search results will drive more traffic.
|
| I try bing and DDG for a week or so every 6 months. I always
| switch back to google eventually because the results are so much
| better.
|
| Google can only be disrupted if something new is invented,
| something different than search but delivering way better
| results. I have no clue what that might be. But I hope someone is
| working on it.
| internetslave wrote:
| Yup. My opinion has long been that the only thing that will
| take down google is a massive increase in NLP, such that the
| historical click data can be outperformed by a straight up
| really good NLP model
| wunderflix wrote:
| That's interesting. Is anyone working on this already? SV
| startup? And: don't you think Google is in the best position
| to build such a thing?
| Zelphyr wrote:
| I've had the exact opposite reaction to the comparison between
| Google and DuckDuckGo. I use the latter daily and only rarely
| revert to Google. Even then I usually don't find the results to
| be any better and often find them to be worse.
|
| In my estimation, Google's search results have significantly
| declined in recent years.
| rstupek wrote:
| Agreed. I've fully changed over to DDG on my phone and rarely
| add the !g to get a google search.
| wunderflix wrote:
| Ha, maybe I should give it a try again :) My 6 months period
| is almost over again.
| jrockway wrote:
| I think there are plenty of other people crawling the web.
| There's Common Crawl, there's the Wayback machine... it's not
| just Google. Then there is a very long tail of crawlers that show
| up in the logs for my small-potatoes personal website. Whatever
| they're doing, they seem to be existing in peace, at the very
| least.
|
| To some extent, I agree with this site that people are nicer to
| Google than other crawlers. That's because the crawl consumes
| their resources but provides benefits -- you show up on Google,
| the only search engine people actually use. But at the same time,
| they are happy to drag Google in front of Congress for some
| general abuse, so... maybe there is actually a little bit of
| balance there.
| anonu wrote:
| > There Should Be A Public Cache Of The Web
|
| This might be closest to it: https://commoncrawl.org/
| lawwantsin17 wrote:
| I'm all for killing Google's monopoly but spiders can ignore
| robots.txt you know. This just seems like a failure of other
| companies to effectively ignore those.
| jeelecali wrote:
| I'm looking for $ 576
| villgax wrote:
| The irony is that they bitch about you not scraping search or
| other platforms without paid plans & want to do the same to you
| ajcp wrote:
| They really missed an opportunity to get creative with their own
| `robots.txt` implementation.
| nova22033 wrote:
| _This isn't illegal and it isn't Google's fault_
|
| Right there in the article..
| WarOnPrivacy wrote:
| Again, with critical context.
|
| _This isn't illegal and it isn't Google's fault, but this
| monopoly on web crawling that has naturally emerged prevents
| any other company from being able to effectively compete with
| Google in the search engine market._
| tyingq wrote:
| The bigger problem, to me, is not around crawling. It's the
| asymmetrical power Google has after crawling.
|
| Google is obviously on a mission to keep people on Google owned
| properties. So, they take what they crawl and find a way to
| present that to the end user without anyone needing to visit the
| place that data came from.
|
| Airlines are a good example. If you search for flight status for
| a particular flight, Google presents that flight status in a box.
| As an end user, that's great. However, that sort of search used
| to (most times) lead to a visit to the airline web site.
|
| The airline web site could then present things Google can't do.
| Like "hey, we see you haven't checked in yet" or "TSA wait times
| are longer than usual" or "We have a more-legroom seat upgrade if
| you want it".
|
| Google took those eyeballs away. Okay, fine, that's their choice.
| But they don't give anything back, which removes incentives from
| the actual source to do things better.
|
| You see this recently with Wikipedia. Google's widgets have been
| reducing traffic to Wikipedia pretty dramatically. Enough so that
| Wikipedia is now pushing back with a product that the Googles of
| the world will have to pay for.
|
| In short, I don't think the crawler is the problem. And I don't
| think Google will realize what the problem is until they start
| accidentally killing off large swaths of the actual sources of
| this content by taking the audience away.
| bouncycastle wrote:
| In regards to airlines, Google and Amadeus have a partnership I
| believe. Amadeus is the main source of data for many of these
| airline websites. If Google gets the data from Amadeus directly
| and not these websites, they are just cutting out the
| middleman. I don't shed a tear for any of these middleman
| (together with their Dark Pattern UX design).
| tyingq wrote:
| Amadeus isn't a source of flight status. It is a source for
| (some) planned schedules and fares. Global distribution
| systems are a complex topic that's hard to sum up on HN. For
| flight status, Google is pulling from OAG and Flight Aware,
| and also from airline websites. Though they don't show
| airline sites as a source.
| dan-robertson wrote:
| The way to look at this from Google's point of view is to
| realise that most websites are slow and bad[1], so if Google
| sent you there you would have a bad experience with a bad slow
| website trying to find the information you want. Google want to
| make it better for you.
|
| [1] it feels like Google have contributed a lot to websites
| being slow and bad with eg ads, amp, angular, and probably more
| things for the other 25 letters of the alphabet.
| [deleted]
| zentiggr wrote:
| > Google want to make it better for you.
|
| Hehe, sure, nothing nefarious or greedy here... move along,
| move along, nothing to see...
| supert56 wrote:
| Perhaps I am misunderstanding or over simplifying things but it
| always surprises me that there are legal cases brought against
| companies who scrape data when so many of Google's products are
| doing exactly this.
|
| It definitely feels like one set of rules for them and a
| different set for everyone else.
| lupire wrote:
| Google doesn't scrape anything that the site owner objects
| to.
| Spivak wrote:
| I mean it's not that weird that a company would authorize
| major search engines scraping them but no one else.
|
| I don't really see this as Google playing by different rules
| so much as economic incentives being aligned in Google's
| favor.
| 838812052807016 wrote:
| Standardized interoperability enables overall progress.
|
| Every airline doesn't need their own webpage. They could all
| provide a standard API.
| lelanthran wrote:
| > And I don't think Google will realize what the problem is
| until they start accidentally killing off large swaths of the
| actual sources of this content by taking the audience away.
|
| What makes you think they care? Killing off the sources of
| content might even be there goal. If they kill off sources of
| content, they'd be more than happy to create an easier-to-
| datamine replacement.
|
| Hypothetically, if they killed off wikipedia, they are best
| placed to use the actual wikipedia content[1] in a replacement,
| which they can use for more intrusive data-mining.
|
| Google sells eyeballs to advertisers; being the source of all
| content makes them more money from advertisers while making it
| cheaper to acquire each eyeball.
|
| [1] AFAIK, wikipedia content is free to reuse.
| ilaksh wrote:
| The way that the web has been fundamentally broken by Google
| and other companies is one of the reasons I am excited about an
| alternative protocol called Gemini. It doesn't replace the web
| entirely, but for basic things like exchanging information,
| it's great. https://gemini.circumlunar.space/
| treis wrote:
| >However, that sort of search used to (most times) lead to a
| visit to the airline web site.
|
| I don't think that's correct. In the old days you'd either call
| a travel agent or use an aggregator like expedia.
|
| Google muscles out intermediaries like Expedia, Yelp, and so
| on. It's not likely much better or worse for the end user or
| supplier. Just swapping one middleman for another.
| darkwater wrote:
| It's actually pretty different because another middleman can
| basically arise only if it's a big success in the iOS App
| Store because coming up in Google searches would be
| impossible and more or less the same in the Play Store. So,
| Google is not just yet another intermediary.
| tyingq wrote:
| I can't prove it was that way, but I spent a lot of time in
| the space. For a long time, the airline's site used to be the
| top organic result, and there was no widget. Similar for
| other travel related searches (not just airlines) over time.
| Google has been pushing down organic results in favor of ads
| and widgets for a long time...and slowly, one little thing at
| a time. Like no widgets -> small widget below first organic
| result -> move the widget up -> make it bigger -> etc.
| supernovae wrote:
| I don't think google muscling out intermediaries like Expedia
| is a good thing.
|
| Just for example, Expedia is probably 5% of Google's total
| revenue and Google doesn't like slim margin services by and
| large that can't be automated.
|
| Travel is fairly high-touch - people centric. It doesn't fit
| Google's "MO".
|
| But... its shitty that google can play all sides of the
| markets while holding people ransom to mass sums of money to
| pay to play on PPC where google doesn't... i think that's
| where the problem shines.
|
| In essence, you're advocating that eBay goes away because
| google could do it... they could.. and eBay is technically
| just an intermediary, but do we want everything to be
| googlefied?
|
| Google bought up/destroyed other aggregators - remember the
| days of fatwallet, priceline, pricewatch, shopzilla and such
| when they used to focus on discounts/coupons/deals and now
| they're moving more towards rewards/shopping/experience - it
| used to be i could do PPC on pricewatch and reach millions of
| shoppers are a reasonable rate, but now that google destroyed
| them all, the PPC rate on "goods" is absurdly high and not
| having an affordable market means only the amazons and
| walmarts can really afford to play...
|
| it used to be you could niche out, but even then, that's
| getting harder
| treis wrote:
| >In essence, you're advocating that eBay goes away because
| google could do it... they could.. and eBay is technically
| just an intermediary, but do we want everything to be
| googlefied?
|
| I don't think I'm really advocating for it as much as I see
| as a more or less neutral change.
|
| That said, I'm pretty ambivalent about Google. Their size
| is a concern, but they also tend to be pretty low on the
| dark pattern nonsense. eBay, to use an example you gave,
| screwed me out of some buyer protection because of poor UX
| and/or bug (I never saw the option to claim my money after
| the seller didn't respond). In this specific instance
| Google ends the process by sending you to the airline to
| complete the booking. That, imho, is likely better than
| dealing with Expedia.
| supernovae wrote:
| Companies opt in to sites like Expedia and list their
| properties/flights/vacations on their marketplace and
| they pay a commission for those being booked. Expedia
| doesn't just crawl them and demand a royalty for sending
| them traffic...
|
| Google has a huge pay 2 play problem with PPC... i've
| worked for Expedia so that's the only reason i know this
| :)
|
| It's the reason companies work with Expedia many times
| because they don't have the leverage expedia group
| does...
|
| i see it as unnatural change btw... "borg" if you will.
| josefx wrote:
| Only if Google stays around long term. I wouldn't be
| surprised if each free product on its graveyard took down a
| dozen of competing products before it was killed of.
| pc86 wrote:
| Then someone can start a competitor up again, right?
| Assuming there's actually a market for it.
| josefx wrote:
| Not every market is lucrative in the extreme and it can
| take a long time to recover from being "disrupted". I
| think it is also a common practice for larger shopping
| chains to dump prices when they open a new location in
| order to clear out the local competition, so the damage
| it causes is well understood to be long lasting.
| devoutsalsa wrote:
| I've noticed that sometimes Google had updated flight
| information before the displays at the airport.
| tyingq wrote:
| For the most part individual airports own that
| infrastructure. So it's hard to generalize. For most types of
| notable flight status/time changes, however, airlines usually
| know first.
|
| There are exceptions, like an airport-called ground stop.
| magicalist wrote:
| > _You see this recently with Wikipedia. Google 's widgets have
| been reducing traffic to Wikipedia pretty dramatically._
|
| Wikipedia visitors, edits, and revenue are all increasing, and
| the rate that they're increasing is increasing, at least in the
| last few years. Is this a claim about the third derivative?
|
| > _Enough so that Wikipedia is now pushing back with a product
| that the Googles of the world will have to pay for._
|
| The Wikimedia Enterprise thing seems like it has nothing to do
| with missing visitors and that companies ingesting raw
| Wikipedia edits are an opportunity for diversifying revenue by
| offering paid structured APIs and service contracts. Kind of
| the traditional RedHat approach to revenue in open source:
| https://meta.m.wikimedia.org/wiki/Wikimedia_Enterprise
| tyingq wrote:
| See https://searchengineland.com/wikipedia-confirms-they-are-
| ste... from 2015. Google's widgets that present Wikipedia
| data do reduce visitors to Wikipedia.
|
| Or see page views on English Wikipedia from 2016-current: htt
| ps://stats.wikimedia.org/#/en.wikipedia.org/reading/total...
| Looks pretty flat, right? Does that seem normal?
|
| As for Wikimedia Enterprise, you do have to read between the
| lines a bit. _" The focus is on organizations that want to
| repurpose Wikimedia content in other contexts, providing data
| services at a large scale"_.
| SamBam wrote:
| The first link doesn't seem quite conclusive (see the part
| at the bottom), and also doesn't give evidence that
| Google's widgets are to blame.
|
| The flattening of users could also be due to a general
| internet-wide reduction in long-form (or even medium-form)
| non-fiction reading. How are page views for The New York
| Times?
|
| Seems like it should be simple to A/B test, though.
| Obviously Google could do it themselves by randomly taking
| away the widget, but would could also see whether referrals
| from non-Google search engines (though they are themselves
| a tiny percentage) continue to increase while Google
| remains flat.
| tyingq wrote:
| Edit: Removed bad "simple english graph", thanks. Though
| the regular english wikipedia traffic is flat from
| 2016-present.
|
| As for NYT, is there a better proxy to compare to?
| There's no public pageview stats and they have a paywall.
| magicalist wrote:
| That first graph is Simple English, not English, and is
| in millions, not billions. They also explicitly call out
| the methodology change in 2015...
| JKCalhoun wrote:
| > In short, I don't think the crawler is the problem.
|
| Except that, allow other companies to crawl/compete, and you
| can take eyeballs away from Google (which may well then return
| eyeballs to Wikipedia so long as the Google competitors don't
| also present scraped data).
| [deleted]
| benatkin wrote:
| That's the result of the crawling, and it preventing
| competition. Google would much prefer that people complain
| about the details while ignoring the root cause.
| tyingq wrote:
| I don't understand that. The crawling access is mostly the
| same as it ever was. Google's SERP pages are not. A mutually
| beneficial search engine that respects it's sources would
| still crawl the same. Google used to be that.
|
| The core problem is incentives:
| http://infolab.stanford.edu/~backrub/google.html _" we
| believe the issue of advertising causes enough mixed
| incentives that it is crucial to have a competitive search
| engine that is transparent and in the academic realm."_
| [deleted]
| benatkin wrote:
| That's incorrect. Before the search oligopolies formed, new
| search engines could start up. There was excite, hotbot,
| altavista, and more. Now they don't have access. Search
| these comments for census.gov.
| tyingq wrote:
| There are companies that do pretty well in this space,
| like ahrefs, for example. They do resort to trickery,
| like proxy clients that look like home computers or cell
| phones. But, if a small entity like ahrefs can do it,
| anyone can do it.
|
| In a nutshell, though, I don't see equal access for all
| crawlers changing anything. Maybe that's the first
| barrier they hit, but it isn't the biggest or hardest one
| by far. Bing has good crawler access, but shit market
| share.
| [deleted]
| dr-detroit wrote:
| So nobody is going to book air travel? I cant hardly follow
| what youre even saying besides google=bad.
| veltas wrote:
| I swear something like 50% of those digests are totally
| incorrect as well. It's amazing they have kept the feature
| because it has never had a very high signal-to-noise ratio. I
| never trust what's presented in these digests without double-
| checking the source page.
| bombcar wrote:
| Have you heard the story of Thomas Running? It's a story
| Google will tell you.
|
| (Search who invented running)
| tyingq wrote:
| I remember when rich snippets (one type of those widgets)
| came out there were a lot of funny examples. One for a common
| query about cancer treatments that pulled data from a dodgy
| holistic site saying that "carrots cured most types of
| cancer" (or something like that).
|
| There was a similar one where Google emphatically claimed a
| US quarter was worth five cents in a pretty and large snippet
| graphic.
| Mauricebranagh wrote:
| I recall in the last uk election google got the infographic
| of party leaders about 60-70% wrong.
|
| And quite often a people also ask refine is just some
| random guys comment from redit.
| BeFlatXIII wrote:
| The most memorable rich snippet humor I've seen is a horse
| breeder sharing a story of how her searches gave snippets
| with my little ponies as the preview image.
| gtm1260 wrote:
| I'm not sure I agree with this. I think airline websites are so
| garbage filled that they've driven people to use the simple
| alternative of the google flights checkout.
|
| It's a bit of a vicious cycle, but In general most websites are
| so chock filled with crap that not having to click into them
| for real is a relief!
| gxs wrote:
| It's not Google's prerogative to scrape a website and display
| its content, no matter how awful the website.
| michaelmrose wrote:
| If 1 airline let me view information in a friendly fashion
| and the other didn't I would do business with the first.
|
| Lest we forget the money in that scenario is from butts in
| seats not clicks on a website. The particular example is
| ill chosen as google is actually taking on a cost, taking
| nothing, and gifting the airline a better ui.
| BEEdwards wrote:
| If you make an awful website that can be scrapped it's a
| matter of when not if someone will take your data and give
| it to your consumers whether your trying to upsell them or
| not...
| cyberpunk wrote:
| BA had some tracking request inline on the "payment
| processing" page which when blocked by my pihole prevents me
| from ever getting to the confirmation page, just have to
| refresh your email and wait for the best.
|
| I have no idea how these companies, which make quite a decent
| amount of money at least up until 2020, can have such utterly
| poor sites.
|
| I once counted some 20+ redirects on a single request during
| this process heh..
| bombcar wrote:
| I don't know what they're doing but most every single sign
| on tool I've seen redirects 10-20 times during the sign on
| process (and then dumps you to the homepage to navigate
| your way back).
| merlinscholz wrote:
| Probably to get first party cookies on a handful of
| domains
| tyingq wrote:
| I'm talking about flight status. Not Google Flights,
| shopping, or booking.
|
| There are events associated with flight status that Google
| doesn't know. Like change fee waivers, cash comp awards to
| take a later or earlier flight, seat upgrades, etc.
| creato wrote:
| Yeah, the Google flights issue is difficult. On one hand, the
| business practice is problematic. On the other hand, Google
| flights is _so_ much better than its competitors it 's
| ridiculous.
|
| If there was a way to split Google flights into a separate
| company and somehow ensure it wouldn't devolve into absolute
| trash like its competitors, that would be a good thing.
| tyingq wrote:
| It was ITA and prior to Google buying them, did a pretty
| good business selling backend flight shopping services to
| aggregators and airlines.
|
| Shopping for flights is a surprisingly technically
| difficult thing to do well.
| ChrisArchitect wrote:
| They're making it easier to search for flights and arrange a
| trip. It's UX and makes me not hate the airlines/travel process
| as much. And I end up buying the flight from the airline
| anyways, and in many cases doing the arranging on the airline
| site in the end once it's determined, so Google is giving that
| back. They're not taking stuff from the airlines, I mean what
| ads and stuff are on the airline sites anyways specifically
| during the search process. Where they are taking away is from
| the Expedia's and other aggregation sites that offer a
| garbage/hodgepodge experience that drives people crazy.
| tyingq wrote:
| You're talking about Google Flights, which is completely
| unrelated to flight status.
| throwaway_kufu wrote:
| They are not just taking away internet traffic, but in the
| flights example, they actually acquired an aggregate
| flight/travel company and so they are actually entering markets
| and competing with their own ad customers.
|
| Then it comes fully circle to Google unfairly using their
| market position vis-a-vis data, search and advertising. It's a
| win-win Google lets the data dictate which markets to enter and
| on one hand they can jack up advertising fees on
| customers/competitors and unfairly build their own service into
| search above both ads and organic results.
| danielscrubs wrote:
| Be careful when using Google Flight, last time I checked they
| use significantly less margins between flights so it's
| shorter trips but much riskier.
| aetherane wrote:
| You can screwed any time you book a connecting flight on
| two different airlines even if the times aren't tight. For
| instance if one is cancelled.
|
| If you use the same airline they will make sure you get to
| the destination.
| HenryBemis wrote:
| > even if the times aren't tight
|
| Depending on the definition of "tight" each of us have. I
| remember having 40mins in Munich, and that is a BIG
| airport. Especially if you disembark on one side of the
| terminal and your flight is on the far/opposite end.
| That's 25-30mins brisk walking. With 5000 people in-
| between you could as well miss your flight. No discussion
| about stopping to get a coffee or a snack.. you'll miss
| your flight.
| matwood wrote:
| That's true, but it can save you a ton of money. You just
| have to be aware of the risks and plan accordingly.
|
| I have typically used this strategy when flying back to
| the US from the EU. Take an EZJet or similar low cost
| airline from random small EU city to a larger EU city
| like Paris, London, Frankfurt, etc... and book the return
| trip to the US from the larger city. I've also been
| forced to do this from some EU cities since there was no
| connecting partner with a US airline.
| hodgesrm wrote:
| The difference is mind-boggling in some cases. On one
| trip in 2019 I had the following coach fair choices for
| SFO - Moscow return trip tickets booked 3 weeks prior to
| departure.
|
| * UA or Lufthansa round trip (single carrier) $3K
|
| * UA round trip SFO - Paris + Aeroflot round trip Paris -
| Moscow: $1K
|
| No amount of search could reduce the gap. I went with the
| second option. The gap is even bigger if you have a route
| with multiple segments.
| throwaway1777 wrote:
| Yeah this strategy is good, but you need to allow a long
| layover like 6 hours if you have to go through
| immigration and change airports for the connection which
| happens pretty often with ryanair and ezjet. It's a big
| pain, but it does save money.
| cbenneh wrote:
| If you're booking each leg with different carrier, I find
| it best to pay the little extra with kiwi.com and they
| give you guarantee for the connection. I missed
| connection twice and they always got me on the next
| flight to the destination for free.
| slymon99 wrote:
| Can you elaborate on this? Do you mean shorter layovers?
| bombcar wrote:
| It sounds like it - and third-party companies will often
| show you flights that involve different companies on the
| different legs - which can leave you in a pickle because
| technically each airline's job is to get you to the end
| of THIER flight, not the entire journey.
| Scoundreller wrote:
| And sometimes with a change of airport!
| foepys wrote:
| I remember when in Germany some budget airlines used to
| say they'd fly to "Frankfurt" (FRA) but actually flew to
| "Frankfurt-Hahn" (HHN) - 115km away. After arrival in HHN
| they put you on a bus to FRA that took about 2 hours.
| SV_BubbleTime wrote:
| Oh don't worry, you have 15 on-paper minutes to go from A1
| to A70 in Detroit... in January... and the shuttle is down.
| marshmallow_12 wrote:
| Aren't there anti trust laws to prevent this kind of thing?
| sangnoir wrote:
| The current anti-trust doctrine in the US has a goal of
| protecting _consumers_ - not competition. What Google is
| doing is arguably great for consumers but awful to their
| competitors /other organizations. Technically, companies
| can simply block Google using robots.txt - but in reality
| that will lose them more money than the current partial
| disintermediation by Google is costing them - and Google
| knows this.
|
| It's a tall order to convince the courts that Google's
| actions consumers, or is illegal: after all, being
| innovative in ways that may end up hurting the competition
| is a key feature of a capitalist society - _proving_ that a
| line has been crossed is really hard, by design.
| speeder wrote:
| consumers are in this case the advertisers.
|
| google has a monopoly on search ads and does enforce it,
| being a drain on the economy since in many fields you
| only succeed if you spend on search ads
| sangnoir wrote:
| > consumers are in this case the advertisers.
|
| If someone could convince the courts that this is
| correct, then I'm sure Google would lose. However, I bet
| dollars to donuts Google's counter-arguement would be
| that the people doing the searching and quickly finding
| information are also consumers, and they outnumber
| advertisers and may be harmed by any proposed remediation
| in favor of advertisers.
| basch wrote:
| googles answer to this at yesterdays hearing..
|
| Search isnt a single category. If you break it down, they
| arent a monopoly. For example. 1/2 of PRODUCT SEARCHES
| begin on Amazon. It's probably hard to argue Google as a
| monopoly if who they see as their main competitor has
| half the market share.
| supernovae wrote:
| Just tell people to stop using google. Go direct.
| zentiggr wrote:
| Upvoted - regardless how pointless some people might
| think this comment is, it really is the ONLY way that
| Google is going to drop out of its aggregate lead
| position.
|
| Enough people realizing Google is trapping and
| cannibalizing traffic to the other sites it feeds off of,
| and choosing to do other things EXCEPT touching Google
| properties, is THE ONLY way they'll be unseated.
|
| No clear legal path to stop a bully means it's an ethical
| / habit path.
|
| Not saying there's any easy way, just that this is it.
| midoBB wrote:
| Anti-trust in the US tend to not hit the big tech players
| as much they do other sectors. Also there is actually a
| debate in the judicial system about the extent of Anti
| trust laws themselves.
| Majromax wrote:
| Antiturst laws are hard to enforce in the United States.
|
| Monopolies themselves aren't illegal. To be convicted of an
| antitrust violation, a firm needs to both have a monopoly
| and needs to be using anticompetitive means to maintain
| that monopoly. The recent "textbook" example was of
| Microsoft, which in the 90s used its dominant position to
| charge computer manufacturers for a Windows license for
| each computer sold, regardless of whether it had Windows
| installed or was a "bare" PC.
|
| Depending on how you define the market, Google may not even
| have a monopoly. It's probably dominant enough in web
| search to count, but if you look at its advertising network
| it competes with Facebook and other ad networks. In the
| realm of travel planning (to pick an example from these
| comments), it's barely a blip.
|
| Furthermore, Google can potentially argue it's not being
| anticompetitive: all businesses use their existing data to
| optimize new products, so Google could claim that it _not_
| doing so would be an artificial straightjacket.
| twiddlebits wrote:
| It's got a monopoly on "search ads" by far.
| arrosenberg wrote:
| It's not that hard, we're just out of practice due to the
| absurd Borkist economic theories we've been operating
| under for 40+ years. The laws are all there if the head
| of the DOJ antitrust division has the gumption to go
| reverse some bad precedents.
|
| > In the realm of travel planning (to pick an example
| from these comments), it's barely a blip.
|
| They used their monopoly in web search to gain non-
| negligible marketshare in entirely unrelated industry.
| That's text book anti-competitive behavior.
|
| Google can argue whatever they want, but the argument
| that they're enabling other businesses is a bad one. It
| casts Google as a private regulator of the economy, which
| is exactly what antitrust laws are intended to deal with.
| pmiller2 wrote:
| Is web search even a "market" independent of ads?
| rijoja wrote:
| yes
| pmiller2 wrote:
| Where's the money?
| samuelizdat wrote:
| That depends, would Google let us know?
| rijoja wrote:
| not if they could avoid it
| adamcstephens wrote:
| Yes, but they lack enforcement.
| kingo55 wrote:
| Even before it gets to that point, they routinely display
| snippets off regular websites and show ads next to it.
|
| Keeping users from clicking through to organic results helps
| them generate more revenue.
| jeffbee wrote:
| You're wrong on a lot of facts here. Google Flights doesn't get
| its data just by crawling, they get it from Sabre, the FAA,
| Eurocontrol, etc. Airlines are, obviously, extremely pleased to
| disseminate this information. Google Flights "gives back" in
| the exact same way as any other travel outlet: they book
| passengers.
|
| As for Wikipedia, the WMF is quite happy that most of their
| traffic is now served by Google. WMF is in the business of
| distributing knowledge, not in the eyeballs business. Serving
| traffic is just a cost for them. The main problem has been that
| the average cost for Wikipedia to serve a page has gone up,
| because many readers read it via Google, and more people who
| visit Wikipedia are logged-in authors, which costs them more to
| serve. I'm sure there's an easy solution to this problem (for
| example, beneficiaries of Wikipedia can donate compute
| facilities and services, or something along those lines).
| tyingq wrote:
| They don't get individual flight status (what I was talking
| about) from Sabre or the FAA or Eurocontrol. I didn't get
| into fares and planned schedules and Google Flights, that's a
| different topic. I was talking about the big widget you get
| for queries on status for a particular flight, which is not
| Google Flights.
|
| They have relented in some ways, rolling out stuff in the
| widget like: _" The airline has issued a change fee waiver
| for this flight. See what options are available on American's
| website"_
|
| But obviously, that kind of stuff isn't shown on Google for
| quite some time after it exists on the source site. And the
| widget pushes the organics off the fold unless you have a
| huge monitor.
|
| As for Wikipedia, I was referring to this:
| https://news.ycombinator.com/item?id=26487993
|
| _" Airlines are, obviously, extremely pleased to disseminate
| this information"_
|
| In the same way that publishers love AMP, yes. They don't
| actually like it, but they are forced to make the best of it.
| jeffbee wrote:
| Oh, status. I was thinking of schedules. Still, what is the
| point for the consumer of being directed to an airline's
| terrible status page? And are they even capable of being
| crawled? Looking at American's site (it was the most
| ghastly airline that sprang to mind) I don't see how a
| crawler would be able to deal with it, and indeed the
| Google snippet for AA flight status, on the aa.com result
| which is far down in the results page, just says "aa.com
| uses cookies" which is about what you'd expect.
|
| In this case, I want to be sent literally anywhere but
| aa.com.
| tyingq wrote:
| _" what is the point for the consumer of being directed
| to an airline's terrible status page?"_
|
| One example...
|
| If you back up a bit, the widget didn't used to tell you
| there was a change fee waiver when the flight was full,
| while aa.com did.
|
| That's an actual, tangible benefit that a consumer might
| want, worth real money. You can also even often "bid" on
| a dollar amount to receive if you're willing to change
| flights. Google doesn't present that info today.
|
| There are more examples. My perspective isn't that Google
| should lead you to aa.com, but I do feel it's a bit
| dishonest that the widget is so large it pushes aa.com
| below the fold. It doesn't need to be that large.
| wbl wrote:
| Does the concergie of a hotel take anything away when he
| informs you that your flight has been delayed?
| onlyrealcuzzo wrote:
| Wikipedia isn't monetized. Doesn't it benefit them if Google is
| serving their content for free and people are finding the
| information they want without having to hit Wikipedia??
|
| And also, isn't Google the largest sponsor for Wikipedia
| already? In 2019 - Google donated $2M [1]. In 2010, Google also
| donated $2m [2].
|
| [1] https://techcrunch.com/2019/01/22/google-org-
| donates-2-milli...
|
| [2] https://en.wikipedia.org/wiki/Wikimedia_Foundation
| minikites wrote:
| Couldn't you make a similar argument about for-profit uses of
| free/libre software? The software serves a useful purpose,
| who cares where it came from?
| dmitriid wrote:
| Google was/is also the largest sponsor of Mozilla. This
| doesn't stop Google from sabotaging Mozilla.
|
| 2 mln is probably Google's hourly profit. For that they get
| one of the biggest knowledge bases in the world. It's
| basically free as far as Google is concerned.
|
| The instant Google becomes confident they can supplant
| Wikipedia, they will.
| billiam wrote:
| NOT a sponsor of Mozilla. Google buys web traffic (as
| default search engine) for ~$300M and turns it into several
| times that $ in ad revenue.
| jedberg wrote:
| > 2 mln is probably Google's hourly profit.
|
| You don't have to guess, their numbers are public. In 2020
| they made $40B in profit, so it takes them about 27 minutes
| to make $2M in profit.
| magicalist wrote:
| > _Google was /is also the largest sponsor of Mozilla. This
| doesn't stop Google from sabotaging Mozilla._
|
| Google isn't a sponsor of Mozilla, they're a customer. Do
| people think Google is "sponsoring" Apple with $1.5 billion
| a year too?
| dmitriid wrote:
| Google being Apple's customer doesn't mean Google isn't
| sponsoring Mozilla.
|
| These are two very different companies with a very
| different relationship with Google. And very different
| influences on Google.
|
| Google _wants_ to be on iOS. It brings customers to
| Google. A lot of them. iOS is possibly more profitable to
| Google than Android even with all the payments Apple
| extracts from them.
|
| Google needs Mozilla so that Google may pretend that
| there's competition in browser space and that they don't
| own standards committees. The latter already isn't really
| true, and Google increasingly doesn't care about the
| former.
| foobarian wrote:
| > they're a customer.
|
| The cynic in me thinks the product is anti-trust
| insurance.
| kelnos wrote:
| Not sure why you're being downvoted; I completely agree
| with what you're saying (modulo questionable usage of
| "sponsor"). If Wikipedia were to try to charge for this use
| of their data, Google would likely make it a priority to
| drop the Wikipedia blurbs, either without replacement, or
| with data sourced elsewhere.
| will4274 wrote:
| > Google would likely make it a priority to drop the
| Wikipedia blurbs, either without replacement, or with
| data sourced elsewhere.
|
| That's an odd way of phrasing things. If Wikipedia were
| to take away free access to their data, Google wouldn't
| be dropping Wikipedia, Wikipedia would be dropping
| Google. This line of thinking "you took this when I was
| giving it away for free, but now I want to charge for it,
| so you are expected to keep paying for it" is incorrect.
| zdragnar wrote:
| Given the scale that google already operates at, I don't
| doubt that they would just take a copy of thr content and
| rebrand it as a google service, complete with user
| contribution.
|
| Then, after two or five years, let it fester then abandon
| it. Nobody gets promoted for keeping well oiled machines
| running.
| dmitriid wrote:
| Remember Knol?
| https://en.wikipedia.org/wiki/Knol?wprov=sfti1
|
| It was actually good for writing stuff when I tried it.
| Never brought in enough traffic. Killed.
| rincebrain wrote:
| Wikimedia recently announced Wikimedia Enterprise for
| "organizations that want to repurpose Wikimedia content in
| other contexts, providing data services at a large scale".
|
| So they're pretty clearly looking to monetize organizations
| which consume their data in a for-profit context.
| dathinab wrote:
| monetizing != for-profit
|
| You could e.g. just cover operational cost and/or improve
| the service quality from it.
| pmiller2 wrote:
| I think they may have meant "(organizations) (which
| consume their data in a for-profit context)."
| tomp wrote:
| Well then they can't nag users to donate to Jimmy Wales'
| trust fund.
| onetimemanytime wrote:
| >> _Google donated $2M [1]. In 2010, Google also donated $2m
| [2]._
|
| $2 Million a year? Now I know why Googlers complained about
| having one less olive in their lunch salad.
|
| How much does Google PROFIT from Wikipedia and how much does
| Wikipedia loses in fundraising when Google fails to send
| users to the info provider?
| lupire wrote:
| Wikipedia is drowning in money so this whole line of
| discussion is weird.
|
| And most of the value of wikipedia is created by its unpaid
| users, not Wikimedia foundation.
| kelnos wrote:
| > _Wikipedia isn 't monetized._
|
| No, but they often ask for donations when you visit the site,
| which people won't see if they just see the in-line blurb
| from Wikipedia on the Google results page.
|
| > _In 2019 - Google donated $2M [1]. In 2010, Google also
| donated $2m [2]._
|
| $2M is a pittance compared to what I expect Google believes
| is the value of their Wikipedia blurbs. If Wikipedia could
| charge for use of this data (which another commenter claims
| they are working on doing), they could easily make orders of
| magnitude more money from Google.
|
| Of course, my expectation is that Google would rather drop
| the Wikipedia blurbs entirely, or source the data elsewhere,
| than pay significantly more.
| tylerhou wrote:
| Unlikely that Wikipedia will be able to charge for content,
| seeing as all of their content is CC-BY-SA licensed.
| https://en.wikipedia.org/wiki/Wikipedia:Licensing_update
|
| They may be able to charge for _bandwidth_ (if you want to
| use a Wikipedia image, you can use Wikipedia 's enterprise
| CDN instead of their own), but their licensing allows me to
| rehost content as long as I follow the attribution &
| sublicensing terms.
|
| Google has no problem operating their own CDNs, so I find
| it unlikely that Wikipedia will be able to monetize Google
| search results in such a manner as you described.
|
| Disclaimer: I work for Google; opinions are my own.
| Siira wrote:
| Large swaths of web are garbage. Wasting people's time and
| attention on visiting pointless sites for something presentable
| in a small box is obviously not economical.
|
| And if some of the sources somehow die? New sources will spring
| up. It doesn't matter.
| dheera wrote:
| > Only a select few crawlers are allowed access to the entire
| web, and Google is given extra special privileges on top of that.
|
| Hmm, so set up a VPN on the Google Cloud so you have a Google IP
| address, use a Google User-Agent, and go!
| jesboat wrote:
| https://developers.google.com/search/docs/advanced/crawling/...
|
| describes the procedure for checking "is this source Google
| it". You couldn't fake it just by running on gcp
| cookiengineer wrote:
| Can we take a moment to talk about this club's business model?
|
| There's not even any information to see what the "private forum
| access" that you have to pay for is about, what kind of people
| are in it...or even to know about what happens with the money.
|
| For me, this sounds like a scam.
|
| I mean, no information about any company. No imprint. No privacy
| policy. No non-profit organization. And just a copy/paste
| wordpress instance.
|
| I mean, srsly. I am building a peer-to-peer network that tries to
| liberate the power of google, specifically, and I would not even
| consider joining this club. And I am the best case scenario of
| the proposed market fit.
| adamdusty wrote:
| They want you to pay them to "research" google's web crawling
| monopoly. It's really just a donation, but they don't frame it
| like that. Probably more credible than using a crowd funding
| website, because it sounds like their pushing for actual
| legislation.
|
| > Meet with legislators and regulators to present our findings
| as well as the mock legislation and regulations. We can't
| expect that we can publish this website or a PDF and then sit
| back while governments just all start moving ahead on their
| own. Part of the process is meeting with legislators and
| regulators and taking the time helping them understand why
| regulating Google in this way is so important. Showing up and
| answering legislators' questions is how we got cited in the
| Congressional Antitrust report and we intend to keep doing
| what's worked so far.
| judge2020 wrote:
| Not being set up as a 527 nonprofit[0] is the biggest red flag
| - no donation or membership money has to be spent for political
| purposes. They also use memberful for their membership/payment
| system, which doesn't require owning a business, so you might
| be paying out to the owner directly instead of to a business
| with its own bank account. Maybe the owner is looking at HN and
| can clarify.
|
| To add, there are a lot of businesses that use the terms
| 'Knucklehead' so finding their business on secretary of state
| business searches might be impossible.
|
| 0: https://www.irs.gov/charities-non-profits/political-
| organiza...
| drivingmenuts wrote:
| How about a system whereby we tell others whether or not we want
| to be crawled/not crawled by them? /s
| [deleted]
| tomc1985 wrote:
| I think the solution here is everybody masquerades as Googlebot
| so we can render the whole thing moot
| quantumofalpha wrote:
| Ignoring robots.txt is trivial, that's why some(many?) sites
| enforce it by verifying source IP and recognize Googlebot from
| its IP addresses - how will you get access to one of those?
| p-sharma wrote:
| What does "recognize Googlebot from its IP addresses" mean?
| If I'm a human and I access a site, I have some other IP than
| Googlebot, how should this side know if I'm a human or
| knuckleheadsbot?
| quantumofalpha wrote:
| if you're claiming to be User-Agent: Googlebot, but your IP
| doesn't seem like it belongs to Google, don't you think
| it's a clear sign that you're FAKING IT?
|
| The check itself could be implemented for example with ASN
| or reverse DNS lookup or hard-coding known Google's IP
| ranges (though that's prone to become stale)
| smarx007 wrote:
| https://developers.google.com/search/docs/advanced/crawling
| /...
| p-sharma wrote:
| Maybe a naive question but what prevents Knuckleheads' from
| ignoring the robots.txt and crawl the side anyway? And if it's so
| easy to do, how does Google have a monopoly on crawling then?
| foobar33333 wrote:
| On smaller sites, nothing usually. But on bigger sites you will
| be blocked. You will probably be blocked even if you do follow
| robots.txt
| judge2020 wrote:
| It's just rude to do so, and there are some technical issues
| with doing that as well (such as crawling admin panel which
| might trigger backend alarms/security alerts). Google also
| doesn't have a legal monopoly on crawling, only a natural
| monopoly thanks to a lot of websites independently choose to
| only allow Google and Bing because of the many issues with
| third-party crawlers (eg. crawling all pages at once, costing
| money/slowing down the site[0]).
|
| 0: https://news.ycombinator.com/item?id=26593722
| jinseokim wrote:
| This has been submitted to HN quite a few times.
|
| https://news.ycombinator.com/item?id=25426662 (Most comments; 11
| comments)
|
| https://news.ycombinator.com/item?id=25417067 (3 comments)
|
| https://news.ycombinator.com/item?id=25546867 (Most recent; 89
| days ago)
|
| https://news.ycombinator.com/item?id=25543859
|
| https://news.ycombinator.com/item?id=25424852
| Darkphibre wrote:
| Hooray! Looks like I'm one of today's lucky 10,000. :)
|
| https://xkcd.com/1053/
| [deleted]
| skinkestek wrote:
| Wasn't aware of that.
|
| Resubmitting interesting content that hasn't got traction
| earlier on is however explicitly allowed in the guidelines
| IIRC.
| pessimizer wrote:
| And linking past threads on the same subject is helpful.
| monkeybutton wrote:
| Interesting that the most comments it got before was 11, and
| today it succeeds and makes it to the front page! This is a
| good illustration of whether or not submissions get any
| traction can be fairly stochastic.
|
| On topic, stack overflow does exactly what the article is
| talking about; They lock down their sitemap and make special
| exceptions for the Google bot:
|
| https://meta.stackexchange.com/a/98087
|
| https://meta.stackexchange.com/questions/33965/how-does-stac...
|
| I can understand SO's reasoning but it only perpetuates the
| incumbents' stranglehold on the internet.
| jszymborski wrote:
| I think it's partly because they create a website which
| reported on the status of the Ever Given which rose to 1. on
| the front page.
|
| I feel like I often see submissions which are, even
| tangentially, related to front page material rise very
| quickly.
|
| Regardless, congrats to Knuckleheads Club for fighting the
| good fight.
| skinkestek wrote:
| You are right, that was how I found it.
| judge2020 wrote:
| > They lock down their sitemap and make special exceptions
| for the Google bot:
|
| Their robots.txt, on the other hand, is more restrictive of
| Googlebot:
|
| https://stackoverflow.com/robots.txt User-
| agent: Googlebot-Image Disallow: /*/ivc/*
| Disallow: /users/flair/ Disallow: /jobs/n/* ..
| tmcw wrote:
| I've definitely scraped by this problem on several occasions.
| Recently I was writing a tool to check outgoing links from my
| site, to see which sites are offline (it's called notfoundbot).
| What I found was that many sites have "DDoS Protection" that
| makes such an effort impossible, other sites whitelist the CuRL
| headers, others like it when you pretend to be a search engine.
|
| Basically writing some code that tests whether "a website is
| currently online or offline" is much, much harder than you think,
| because, yep, the only company that can do that is Google.
| varispeed wrote:
| I disallow scanning on all my projects. After GDPR I also removed
| all analytics - I realised it is just a time sink - instead of
| focusing on content I would often focus on getting the bigger
| numbers. I am not a marketer, so it didn't have much value to me
| and it would just enlarge Google dataset without any payment. I
| get that you cannot find my projects in the search engine. I am
| okay with that :-)
| topspin wrote:
| If the shared cache ever became significant enough to matter it
| would be devastated by marketers, scammers and other abusers.
| Google employs the groomers that make their index at least
| tolerable, if still clearly imperfect. Without that cadre of well
| compensated expertise to win the arms race against such abusers
| the scheme is not feasible.
|
| I suppose this could be crowdsourced if I didn't know about
| politics and how any attempt at delegating the responsibility for
| blessing sites and their indexes would become a controversy.
| Google takes lots of heat about its behavior already, but Google
| is a private entity and can indulge its private prerogatives for
| the most part. Without that independence this couldn't function.
| finnthehuman wrote:
| I don't really understand your comment. Marketers, scammers and
| other abusers already publish to the web with the intention to
| be included in a crawl. Postprocessing crawl data is already a
| thing.
|
| Assuming this hypothetical shared crawl cache were to exist, it
| does not preclude google (and all consumers of that cache)
| doing their own processing downstream of that cache. Does it?
|
| What's the new attack vector?
| topspin wrote:
| > I don't really understand your comment.
|
| If you don't then you fail to appreciate the amount of labor
| it takes to thwart bad actors from ruining indexes. Abusers
| do publish to the web, and we enjoy not wallowing in their
| crap because small army of experienced and expensive people
| at a select few Big Tech companies are actively shielding us
| from it.
|
| It's easy to anticipate the malcontent view; 'Google spends
| all its resources on ads and ranking and we don't need all
| that.' That is naive; if Google completely neglected grooming
| out the bad actors people wouldn't use Google and Google's
| business model wouldn't be viable.
|
| So the obvious question is; where is this mechanism without
| Google et. al? Will the published caches be 99% crap (and
| without an active defense against crap you can bet your life
| it will) and anything derived from it hopelessly polluted? If
| so then it isn't viable.
|
| Now the instinct will be to find a groomer. Guess what;
| that's probably doomed too. No selection will be impartial to
| all, so you get to fight that battle. Good luck.
| finnthehuman wrote:
| >Will the published caches be 99% crap
|
| Yes. It will be exactly as crap as whatever's published on
| the web.
|
| And the utility of google's search engine would be to
| perform their proprietary processing on top of the
| publicly-available crawl results. Analogous to how their
| search is already preforming proprietary processing on top
| of a crawl cache.
|
| >If you don't then you fail to appreciate the amount of
| labor it takes to thwart bad actors from ruining indexes.
|
| Did you miss the part where I said "Assuming this
| hypothetical shared crawl cache were to exist, it does not
| preclude google (and all consumers of that cache) doing
| their own processing downstream of that cache. Does it?"
| herewhere wrote:
| Around a decade ago, I was part of the team responsible for
| msnbot (a web crawler for bing). There used to be robot.txt
| (forgot the extension now). Most of the website was giving 10-20x
| higher limits to googlebot than rest other crawler.
|
| Google definitely has unfair advantage there.
|
| Bing and duckduckgo still provide very reasonable result with
| 10-20x less resources but not at par of google.
| andrewclunn wrote:
| How about an opt-in search engine cache? One where a domain needs
| to agree allow their site to be crawled, but as a result also
| gives said crawler full access? And then that repository would be
| made publicly available to all search engines to use. Sort of an
| AP for searches, that would give a base line that wouldn't
| preclude search engines from going further, but which would
| certainly lower the cost and network traffic for the search
| engines and sites that take advantage of it?
| l72 wrote:
| I tried to set up YaCy [1] at home to index a few of may favorite
| smaller websites, so I could quickly search just them. That
| turned out to be a bad idea. Some ended up blocking my home IP
| address and others reported me to my ISP. None of these sites
| were that large, and I wasn't continuously crawling them...
|
| [1] https://yacy.net/
| slenk wrote:
| I have been running my own Searx instance in AWS for a while
| and have not gotten blocked yet anywhere
| jedimastert wrote:
| How often were you searching?
| l72 wrote:
| I was regularly searching, but I was rarely indexing any of
| the sites. I struggled to even get an initial index of many
| of the sites, due to being blocked or being reported.
| samizdis wrote:
| Coincidentally, this item [1] has just turned up in HN - Common
| Crawl
|
| [1] https://news.ycombinator.com/item?id=26594172
| mrweasel wrote:
| While I don't disagree with the idea that all crawlers should
| have equal access, we also need to address the quality of many
| crawlers.
|
| Google and Microsoft have never hammered any website I've run
| into the ground. Crawlers from other other, smaller, search
| engines have, to the point where it was easier to just block them
| entirely.
|
| Part of the problem is that sites want search engine to index
| their site, but not allow random people just scrapping the entire
| site. So they do the best they can, and forget that Google isn't
| the web. I doubt it's shady deals with Google, it's just small
| teams doing the best they can and sometimes they forget to think
| ideas through, because it's good enough.
| rstupek wrote:
| We've had the Bing crawler make a obscene number of requests
| quite often but fortunately it doesn't bring us down.
| kmeisthax wrote:
| I think this is a problem which should be solved by automatic
| rate-limiting and throttling at the application/caching layer
| (or just individual web server for smaller sites). Requests
| with a non-browser UA get put into a separate bots-only queue
| that drains at a rate of ~1/sec or so. If the queue fills up
| you start sending 429s with random early failures for bots
| (UA/IP/subnet pairs) that are overrepresented in the traffic
| flow.
|
| I don't know if such software exists, but it should. It would
| be a hell of a lot healthier for the web than "everyone but
| Google f*ck off", and it creates an incentive for bots to
| throttle themselves (as they're more likely to get a faster
| response than trying to request as fast as possible).
| NathanKP wrote:
| I suspect that at least some of the bots use web server
| response times and response codes as part of the signal for
| ranking. If your website does not appear capable of handling
| load then it won't rank as highly, because it is not in their
| best interests to have search results that don't load.
| henriquez wrote:
| I'd like to see some data on their claim that website operators
| are giving googlebot special privileges. As far as I can tell it
| would be a huge pain in the ass to block crawler bots from my
| servers, not that I've tried. I have some weird pages that tend
| to get crawlers caught in infinite loops, and I try to give them
| hints with robots.txt but most of the bots don't even respect
| robots.txt.
|
| If I actually wanted to restrict bots, it would be much easier to
| restrict googlebot because they actually follow the rules.
|
| I don't disagree in principle that there should be an open index
| of the web, but for once I don't see Google as a bad actor here.
| throwaway_uat wrote:
| LinkedIn profile/Quora answer are accessible by Google bot
| without signin
| burkaman wrote:
| See figure I.4 on page 24 of this UK government report:
| https://assets.publishing.service.gov.uk/media/5efb1db6e90e0...
|
| Additional evidence here: https://knuckleheads.club/the-
| evidence-we-found-so-far/
| malf wrote:
| What do you think this is used for?
|
| https://developers.google.com/search/docs/advanced/crawling/...
| calimac wrote:
| The studies and data to support their claim is in the first
| paragraph of the article you "read" before posting the
| question.
| Lammy wrote:
| Spoofing your user-agent as googlebot is a common way to bypass
| paywalls, is (was?) a way to read Quora without creating an
| account, etc. Publishers obviously need to send their
| page/article to Google if they want it to be indexed but may
| not want to send the same page content to a normal user:
| https://www.256kilobytes.com/content/show/1934/spoofing-your...
|
| This was common even back in the mid-2000s:
|
| https://www.avivadirectory.com/bethebot/
|
| https://developers.google.com/search/blog/2006/09/how-to-ver...
| soheil wrote:
| It's hilarious to think there exists people who think googlebot
| does not get special treatment from website operators. Here is
| an experiment you can do in a jiffy, write a script that crawls
| any major website and see how many URL fetches it takes before
| your IP gets blocked.
|
| Googlebot has a range of IP addresses that it publicly
| announces so websites can whitelist them.
| quitethelogic wrote:
| > Googlebot has a range of IP addresses that it publicly
| announces so websites can whitelist them.
|
| Google says[1] they do not do this:
|
| "Google doesn't post a public list of IP addresses for
| website owners to allowlist."
|
| [1]https://developers.google.com/search/docs/advanced/crawlin
| g/...
| johncolanduoni wrote:
| From that same page they recommend using a reverse DNS
| lookup (and then a forward DNS lookup on the returned
| domain) to validate that it is google bot. So the effect is
| the same for anyone trying to impersonate googlebot (unless
| they can attack the DNS resolution of the site they're
| scraping I guess).
| Mauricebranagh wrote:
| I have never had that problem running screaming frog on big
| brand sites apart from one or two times.
| dheera wrote:
| Do any of them intersect with Google Cloud IP addresses? If
| so set up a VPN server on Google Cloud.
| WesolyKubeczek wrote:
| I don't scrape a website often, but when I do, I'm using a
| user agent of a major browser.
| tedunangst wrote:
| I don't whitelist googlebot, but I don't block them either
| because their crawler is fairly slow and unobtrusive. Other
| crawlers seem determined to download the entire site in 60
| seconds, and then download it again, and again, until they
| get banned.
| [deleted]
| suicas wrote:
| A company I worked for ~7 years ago ran its own focused web
| crawler (fetching ~10-100m pages per month, targeting certain
| sections of the web).
|
| There were a surprising number of sites out there that
| explicitly blocked access to anyone but Google/Bing at the
| time.
|
| We'd also get a dozen complaints or so a month from sites we'd
| crawled. Mostly upset about us using up their bandwidth, and
| telling us that only Google was allowed to crawl them (though
| having no robots.txt configured to say so).
| luckylion wrote:
| I usually recommend setting only Google/Bing/Yandex/Baidu etc
| to Allow and everything else to Disallow.
|
| Yes, the bad bots don't give a fuck, but even the non-
| malicious bots (ahrefs, moz, some university's search engine
| etc) don't bring any value to me as a site owner, take up
| band width and resources and fill up logs. If you can remove
| them with three lines in your robots.txt, that's less noise.
| Especially universities do, in my opinion, often behave badly
| and are uncooperative when you point out their throttling
| does not work and they're hammering your server. Giving them
| a "Go Away, You Are Not Wanted Here" in a robots.txt works
| for most, and the rest just gets blocked.
| dillondoyle wrote:
| Isn't that the website owners right though? I'm not sure I
| understand the problem here.
|
| If Google is taking traffic and reducing revenue, a company
| can deny in robots.txt. Google will actually follow those
| rules - unlike most others that are supposedly in this 2nd
| class.
| suicas wrote:
| Yup, no problem here, was just making an observation about
| how common such blocking was (and about the fact that some
| people were upset at being crawled by someone other than
| Google, despite not blocking them).
|
| The company did respect robots.txt, though it was initially
| a bit of a struggle to convince certain project managers to
| do so.
| jameshart wrote:
| When you operate commercial sites at scale, bots are a real
| thing you spend real engineering hours thinking about and
| troubleshooting and coding to solve for.
|
| And yes, that means google gets special treatment.
|
| Think about the model for a site like stackoverflow. The
| longest of long tail questions on that site: what's the actual
| lifecycle of that question?
|
| - posted by a random user - scraped by google, bing, et al -
| visited by someone who clicked on a search result on google -
| eventually, answered - hopefully, reindexed by google, bing et
| al - maybe never visited again because the answer now shows up
| on the google SERP
|
| In the lifetime of that question how many times is it accessed
| by a human, compared to the number of times it's requested and
| rerequested by an indexing bot?
|
| What would be the impact on your site of three more bots as
| persistent as google bot? Why should you bother with their
| requests?
|
| So yes, sites care about bot traffic and they care about google
| in particular.
| noxvilleza wrote:
| Google aren't the bad actor in the sense that they are actively
| doing something wrong, but they are definitely benefiting from
| the monopoly that they created and work on maintaining. If this
| continues then nobody will really ever be able to challenge
| them, which means possibly "better" products will fail to
| penetrate the market.
| ehsankia wrote:
| > but for once I don't see Google as a bad actor here.
|
| As inflammatory as the headline of the page looks, they
| literally admit it's not google's fault in the smaller text
| lower down:
|
| "This isn't illegal and it isn't Google's fault, but"
| zmarty wrote:
| A lot of news websites restrict any crawler other than Google.
| And this does not happen only via robots.txt.
| simias wrote:
| Indeed, years ago I had scripts to automatically fetch URLs
| from IRC and I quickly realized that if I didn't spoof the
| user agent of a proper web browser many websites would reject
| the query. Googlebot's UA worked just fine however.
| judge2020 wrote:
| > Googlebot's UA worked just fine however
|
| They obviously don't care enough then - Google says you
| should use rdns to verify that googlebot crawls are
| real[0]. Cloudflare does this automatically now as well for
| customers with WAF (pro plan).
|
| 0: https://developers.google.com/search/docs/advanced/crawl
| ing/...
| staunch wrote:
| Google makes $150+ billion from Google Search per year. Running
| Google Search could be operated for likely (much less than) $10
| billion per year.
|
| So, Google is in effect taxing us all $140 billion per year.
|
| It's not dissimilar from how Wall Street effectively taxes us all
| for an even larger amount.
|
| In both cases, we could use some kind of non-profit open system
| to facilitate web search and stock trading.
|
| The Great Lie that Google is doing a good thing by charging money
| to insert "relevant ads" above the search results is totally
| wrong. If those ads are the most relevant, they should just be
| the top organic results, obviously.
|
| Google mostly solved search 20 years ago. There's really nothing
| that impressive about Google Search in 2021. It should be
| relatively easy to replace it with something open, leveraging the
| massive improvements in hardware and software. It could operate
| like Wikipedia or Archive.org. The hard part is probably getting
| the right team and funding assembled.
| systemBuilder wrote:
| This is not really about Google.
|
| Websites block crawlers because they get abused / crashed by
| Crawlers. In the early days (2000-2010) Google not only got
| banned by some websites, it even got DNS-banned for abusing some
| DNS domains. You see, Google already has already built the
| "megacrawlers" described in this article, it can melt any website
| on the Internet, even Facebook - the largest, and they paid a
| high price for letting the early Google crawlers run free.
|
| Google today has a rate-limit for every single website and DNS
| sub-domain on the internet. For small websites the default is a
| handful of web pages every few seconds. Google has a very slow
| (days) algorithm to increase its crawl rate, and a very fast (1d)
| algorithm to cut the rate limit if it's getting any of the errors
| likely due to website overload.
|
| To summarize, Google has several layers of congestion control
| custom-designed into the crawl application. Most small web
| crawlers have zero.
|
| None of these other crawlers have figured this out, so they abuse
| websites, causing all small-scale crawlers to get banned.
|
| - ex-Google Crawl SRE
| ricardo81 wrote:
| Thank you for those insights, it's a topic I'm interested in.
| Agree with what you're saying about naive bots hitting
| websites/hosts/subnets too hard, in the context of site owners
| being hit by multiple bots for multiple reasons and them
| questioning the return they'll get.
|
| I'd be interested to know more info wrt DNS lookups. Did you
| apply a blanket rate limit on the number of DNS requests you'd
| make to any particular server?
|
| From past experience I know the .uk Nominet servers would temp-
| ban if you were doing more than a few hundred lookups per
| second. At the next host level down, was there a blanket limit
| or was it dependent on the number of domains that nameserver
| was responsible for?
| dbsmith83 wrote:
| I just don't see this working out legally. How would it even
| work?
|
| From the "learn more"
|
| > Sometime soon we will be publishing what we think should happen
| and what we think will happen. These two futures diverge and we
| believe that, while the gap between them exists, it will entrench
| Google's control over the internet further. We believe that
| nothing short of socialization of these resources will work to
| remove Google's control over the internet. Our hope is that in
| publishing this work right now we will let the genie out of the
| bottle and start a process towards socialization that cannot be
| undone.
|
| Sorry, but I deeply skeptical of this. This sounds like the first
| step towards a non-free internet. At the end of the day, it is
| your box on the web, and if you want or don't want
| someone/something to crawl it, that is your call to make.
| marshmallow_12 wrote:
| I have an idea: remove the art of web crawling from the domain of
| a single company and instead create a international group of
| interested parties to run it instead. I'm thinking broadly along
| the lines of the Bluetooth SIG. Maybe it will be a bit more
| complicated, and require international political efforts, but it
| will make the search engine market way more democratic.
| sxp wrote:
| https://knuckleheads.club/the-googlebot-monopoly/ has actual
| details.
|
| > Let's take a look at the robots.txt for census.gov from October
| of 2018 as a specific example to see how robots.txt files
| typically work. This document is a good example of a common
| pattern. The first two lines of the file specify that you cannot
| crawl census.gov unless given explicit permission. The rest of
| the file specifies that Google, Microsoft, Yahoo and two other
| non-search engines are not allowed to crawl certain pages on
| census.gov, but are otherwise allowed to crawl whatever else they
| can find on the website. This tells us that there are two
| different classes of crawlers in the eyes of the operators of
| census.gov: those given wide access, and those that are totally
| denied.
|
| > And, broadly speaking, when we examine the robots.txt files for
| many websites, we find two classes of crawlers. There is Google,
| Microsoft, and other major search engine providers who have a
| good level of access and then there is anyone besides the major
| crawlers or crawlers that have behaved badly in the past that are
| given much less access. Among the privileged, Google clearly
| stands out as the preferred crawler of choice. Google is
| typically given at least as much access as every other crawler,
| and sometimes significantly more access than any other crawler.
| indymike wrote:
| Broadly speaking, robots.txt files are often ignored. I used to
| run a fairly large job ad scraping organization, and we would
| be hired by companies (700 of the fortune 1000 used us) to
| scrape the job ads from their career pages, and then post those
| jobs on job boards. 99 of 100 times, the robots file would
| disallow us to scrape. Since we were being paid by that
| company's HR team to scrape, we just ignored it because getting
| it fixed would take six months and 22 meetings.
| chmod775 wrote:
| > Broadly speaking, robots.txt files are often ignored.
|
| If you wanna go nuclear on people who do that, include an
| invisible link in your html and forbid access to that URL in
| your robots.txt, then block every IP who accesses that URL
| for X amount of time.
|
| Don't do this if you actually rely on search engine traffic
| though. Google may get pissed and send you lots of angry mail
| like "There's a problem with your site".
| jedberg wrote:
| > Don't do this if you actually rely on search engine
| traffic though. Google may get pissed and send you lots of
| angry mail like "There's a problem with your site".
|
| Ah, but of course you would exclude Google's published
| crawler IPs from this restriction, because that is exactly
| what they want you to do.
| TheAdamAndChe wrote:
| Are there any actual repercussions for just ignoring
| robots.txt?
| 5560675260 wrote:
| Your crawler's IP might get banned, eventually.
| asciident wrote:
| There is if you are doing it for work. For example, your
| company could get sued if you are found using that data and
| ignoring the ToS. If you are a public figure, you could get
| your name tarnished as doing something unethical or the media
| may call it "hacking". If you are rereleasing the data then
| you risk getting a takedown notice.
| the_dege wrote:
| Sometimes website admins will also try to report your ips to
| the service provider as a source of attacks (even if not
| true).
| DocTomoe wrote:
| Given how often I've had misbehaving crawlers slow own
| servers in the early 2000s, I do not see how a crawler that
| disobeys robots.txt is not an attempted attack.
| JackFr wrote:
| So from the website's point of view there is no difference
| between 'crawling' and 'scraping'. Census.gov I assume has a
| ton of very useful information which is in the public domain
| which a host of potential companies could monetize by regularly
| scraping census.gov. Census.gov's purpose to make this
| information available to people is served by google, yahoo and
| bing. On the other hand if I have a business which is based on
| that data, in fact I'm at cross purposes to them.
| njharman wrote:
| I'm generally anti business. But I have to disagree. "The
| Public" that the government serves includes businesses.
| Businesses (ignoring corporate personhood bullshit) are owned
| and operated by people.
|
| I do not want the government deciding "what purposes" e.g.
| non-commercial, serve the public good. The public gets to
| decide that. (charging a license for commercial use is maybe
| ok (assuming supporting that use costs government "too
| much").
|
| And I very do not want current situation with the government
| anointing a handful of corporations (the farthest thing from
| the public possible) access and denying everyone else
| including all of the actual public.
| hnbroseph wrote:
| > I do not want the government deciding "what purposes"
| e.g. non-commercial, serve the public good. The public gets
| to decide that.
|
| the public's "decision" on things like this is made
| manifest by government policy, no?
| danShumway wrote:
| In theory. In practice, is every single policy that our
| government upholds currently popular with the majority of
| people?
|
| It's possible to have government policies that the
| majority of people disagree with, that remain for
| complicated reasons related to apathy, lobbying, party
| ideology, or just because those issues get drowned out by
| more important debates.
|
| Government is an extension of the will of the people, but
| the farther out that extension gets, the more divorced
| from the will of the people it's possible to be. That's
| not to say that businesses are immune from that effect
| either -- there are markets where the majority of people
| participating in them aren't happy with what the market
| is offering. All of these systems are abstractions,
| they're ways of trying to get closer to public will, and
| they're all imperfect. But government is particularly
| abstracted, especially because the US is not a direct
| democracy.
|
| I'm personally of the opinion that this discussion is
| moot, because I think that people have a fundamental
| Right to Delegate[0], and I include web scraping public
| content under that right. But ignoring that, because not
| everyone agrees with me that delegation is right,
| allowing the government to unilaterally rule on who isn't
| allowed to access public information is still
| particularly susceptible to abuse above and beyond what
| the market is capable of.
|
| [0]: https://anewdigitalmanifesto.com/#right-to-delegate
| pessimizer wrote:
| A specific case where this favorite-picking by government
| enables corruption: https://en.wikipedia.org/wiki/Nationall
| y_recognized_statisti...
|
| And an example from the quickly-approaching future, when
| there will be Nationally Recognized Media Organizations who
| license "Fact-Checkers," through which posts to public-
| facing will have to be submitted for certification and
| correction.
| marcosdumay wrote:
| Favorite-picking by the government is corruption by
| itself already.
| indymike wrote:
| I used to run a fairly large job ad scraping operation. Our
| scraped data was used by many US state and federal job sites.
| "Scraping" is just using software to load a page and
| extracting content. "Crawling" is just load a page, find
| hyperlinks (hmm... a kind of content), and then crawling
| those links. Crawling is just a kind of scraping.
| vharuck wrote:
| In the case of Census.gov, they offer an API to get the
| data[0]. It's actually pretty nice. Stable, ton of data,
| fairly uniform data structure across the different products.
| Very high rate limits, considering most data only needs
| retrieved once a year. I think they understand the difference
| between crawling and scraping.
|
| [1] https://www.census.gov/data/developers.html
| ricardo81 wrote:
| Having data in the right format as a download or via an API
| would be the best way to go for public data.
|
| If people have to 'scrape' that data from a public resource,
| I'd say they're presenting the data in the wrong way.
| mulmen wrote:
| But Google, Yahoo and Bing are also monetizing the data. Why
| are they allowed to provide "benefits" but "scrapers" are
| not? Why is it wrong to monetize public data?
| jonas21 wrote:
| The census data is available for bulk download, mostly as CSV
| (for example [1]). Scraping census.gov is worse for both the
| Census Bureau (which might have to do an expensive database
| query for each page) and for the scraper (who has to parse
| the page).
|
| Blocking scrapers in robots.txt is more of a way of saying,
| "hey, you're doing it wrong."
|
| It's also worth noting that the original article is out of
| date. The current robots.txt at census.gov is basically wide-
| open [2].
|
| [1] https://www.census.gov/programs-surveys/acs/data/data-
| via-ft...
|
| [2] https://www.census.gov/robots.txt
| foobar33333 wrote:
| Scrapers don't care about robots.txt. I have scraped
| multiple websites in a previous job and the robots.txt
| means nothing. Bigger sites might detect and block you but
| most don't.
| gnramires wrote:
| Perhaps there could be some kind of 'Crawler consortium'?
|
| Under this consortium, website owners would be allowed to
| either allow all crawlers (approved by the consortium) or none
| at all (that is, none that is in the consortium, i.e. you could
| allow a specific researcher or something to crawl your website
| on a case-by-case basis).
|
| This consortium would be composed of the search engines
| (Google, MS, other industry members), as well as government
| appointed individuals and relevant NGOs (electronic frontier
| foundation, etc?). There would be an approval process that
| simply requires your crawl to be ethical and respect bandwidth
| usage. Violations of ethics or bandwidth limits could imply
| temporary or permanent suspension. The consortium could have
| some bargain or regulatory measures to prevent website owners
| from ignoring those competitive and fairness provisions.
| dragonwriter wrote:
| > Perhaps there could be some kind of 'Crawler consortium'?
|
| An industry-wide agreement not to compete for commercially
| valuable access to suppliers of data?
|
| Comprised of companies that are current (and in some cases
| perennial) focusses of antitrust attention?
|
| I think there might be a problem with that plan.
| gnramires wrote:
| Well, yes, and a common solution to anti-trust cases, that
| I know of, is some kind of industry self-regulation. In
| this case I wouldn't trust the industry only to self-
| regulate; hence, they should at invite (while keeping a
| minority but not insignificant position) governments and
| civil society (ngos and other organizations) to
| participate.
|
| Could you better describe your objections?
| neolog wrote:
| I don't see the problem. If a bunch of non-google companies
| pooled resources to make a crawl, that would reduce market
| concentration, not increase it.
| adolph wrote:
| Is it legal for a government entity to issue a robots.txt like
| that? Maybe the line between use and abuse hasn't been
| delinated as well as it needs to be.
| bigwavedave wrote:
| > Is it legal for a government entity to issue a robots.txt
| like that?
|
| I may be wrong (this isn't my area), but I was under the
| impression that robots.txt was just an unofficial convention?
| I'm not saying people should ignore robots.txt, but are there
| legal ramifications if ignored? I'm not asking about
| techniques sites use to discourage crawlers/scrapers, I'm
| specifically wondering if robots.txt has any legal weight.
| vageli wrote:
| Is failure to honor a robots.txt a crime? Or rather, would it
| be unlawful to spoof a user agent to access this publicly
| available data? After the linkedin [0] case it seems
| reasonable to think not.
|
| [0]: https://www.eff.org/deeplinks/2019/09/victory-ruling-
| hiq-v-l...
| Spivak wrote:
| Spoofing user-agents hasn't worked in a long time for
| anything but small operations because search engines
| publish specific IP ranges their scrapers use.
| zepearl wrote:
| Maybe it would be nice if some sort of simple central index of
| "URLs + their last updated timestamp/version/eTag/whatever" would
| exist, updated by the site owners themselves?
| ("push"-notification)
|
| Meaning that whenever a page of a website would be created or
| updated, that website itself would actively update that central
| index, basically saying "I just created/deleted page X" or "I
| just updated the contents of page X".
|
| The consequence would be that...
|
| 1) ...crawlers would not have anymore to actively (re)scan the
| whole Internet to find out if anything has changed, but they
| would only have to query that central index against their own
| list URLs & timestamps to find out what needs to be (re)scanned.
|
| 2) ...websites would not have to just wait&hope that some bot
| would decide to come by to have a look at their sites, nor they
| would have to answer over and over again requests that are just
| meant to check if some content has changed.
| soheil wrote:
| I'm not sure if it is a good thing if there is a public cache of
| everything that Google has. The issue is websites will simply
| stop serving content to Google to protect their content from
| being accessed by their competitors, this in turn will make
| search much worse and will force us back to the pre-search dark
| ages of the internet. The sites may even serve an even more
| crippled version of their content just to get hits but there is
| no doubt search quality will suffer.
|
| We're left with a monopoly that is Google, destroying it now
| could be foolish.
| sesuximo wrote:
| Seems like a private cache of the web would solve the problem?
| Why does it need to be public?
| lisper wrote:
| Seriously? Google _is_ a private cache of the web. That _is_
| the problem.
| sesuximo wrote:
| Google doesn't give anyone access to said cache. I mean one
| crawler with a shared api among competitors. So exactly the
| same as the public cache, but run my a private company and
| accessed for a small fee
| ajcp wrote:
| > run [by] a private company and accessed for a small fee
|
| That is exactly the opposite of a public cache.
| sesuximo wrote:
| Not really. It serves the same function. Either you pay
| this hypothetical company or ??? pays to keep up the
| public one.
| ajcp wrote:
| Just because it serves the same function does not mean
| the implementation is the same. Private military
| contractors and a US infantry squad serve the same
| function, but the implementation completely changes their
| context.
|
| That being said what I think you're arguing for would be
| the implementation of a public utility or private-public
| business. If that's the case then yes, what you're saying
| is correct.
| visarga wrote:
| > Google doesn't give anyone access to said cache.
|
| It would also be useful for deep searches, exceeding the
| 1000 result limit, empowering all sorts of NLP
| applications.
| lisper wrote:
| I don't think you're quite clear on what the words "public"
| and "private" mean. "Public" is not a synonym for "run by
| the government" and "private" is not a synonym for "closed
| to everyone but the owner". Restaurants, for example, are
| generally open to the public, but they are not public. A
| restaurant owner is, with a few exceptions, free to refuse
| service to anyone at any time.
|
| If it's "exactly the same as a public cache" then it's
| public, even if it is managed by a private company. The
| difference is not in who _has_ access, the difference is in
| _who decides_ who has access.
| sesuximo wrote:
| Ok I am not clear then, but I'm less clear after your
| comment! In a public cache, who would you want to decide
| who has access? Is simply saying "anyone who pays has
| access" enough to qualify as public? if so, then I agree
| and this was my (possibly poorly phrased) intention in
| the original comment.
|
| But imo the restaurant model is also fine; in most cases
| people have access and it works.
| lisper wrote:
| > Is simply saying "anyone who pays has access" enough to
| qualify as public?
|
| No because someone has to set the price, which is just an
| indirect method of controlling who has access.
|
| > the restaurant model is also fine
|
| It works for restaurants because there is competition.
| The whole point here is that web crawling/caching is a
| natural monopoly.
|
| A better analogy here would be Apple's app store, or
| copyrighted standards with the force of law [1]. These
| nominally follow the "anyone who pays has access" model
| but they are not public, and the result is the same set
| of problems.
|
| [1] https://www.thebrandprotectionblog.com/public-laws-
| private-s...
| sct202 wrote:
| You can API google search results to make a meta-search
| engine if you want to but it's like $5 / 1k requests.
| twiddlebits wrote:
| Google's TOS prevents blending (alterations, etc.)
| though.
| [deleted]
| ThePhysicist wrote:
| On a related note, Cloudflare just introduced "Super Bot Fight
| Mode" (https://blog.cloudflare.com/super-bot-fight-mode/) which
| is basically a whitelisting approach that will block any
| automated website crawling that doesn't originate from "good
| bots" (they cite Google & Paypal as examples of such bots). So
| basically everyone else is out of luck and will be tarpitted
| (i.e. connections will get slower and slower until pages won't
| load at all), presented with CAPTCHAs or outright blocked. In my
| opinion this will turn the part of the web that Cloudflare
| controls into a walled garden not unlike Twitter or Facebook: In
| theory the content is "public", but if you want to interact with
| it you have to do it on Cloudflare's terms. Quite sad really to
| see this happen to the web.
| judge2020 wrote:
| On the other hand, I do not want my site to go down thanks to a
| few bad 'crawlers' that fork() a thousand http requests every
| second and take down my site, forcing me to do manual blocking
| or pay for a bigger server/scale-out my infrastructure. Why
| should I have to serve them?
| progval wrote:
| You can use the same rate-limiting for all crawlers, Google
| or not.
| dodobirdlord wrote:
| Googlebot is pretty careful and generally doesn't cause
| these problems.
| spijdar wrote:
| Right, then they shouldn't be effected by the rate-
| limiting, as long as its reasonable. If it was applied
| evenly to all clients/crawlers, it'd at least allow the
| possibility for a respectful, well designed crawler to
| compete.
| jedberg wrote:
| The problem is, if you own a website, it takes the same
| amount of resources to handle the crawl from Google and
| FooCrawler even if both are behaving, but I'm going to
| get a lot more ROI out of letting Google crawl, so I'm
| incentivized to block FooCrawler but not Google. In fact,
| the ROI from Google is so high I'm incentivized to devote
| _extra_ resources just for them to crawl faster.
| TameAntelope wrote:
| How hard is it to ask Cloudflare to let you crawl?
| smarx007 wrote:
| It's not Cloudflare who is deciding it. It's the website
| owners who request things like "Super Bot Fight Mode". I
| never enable such things on my CF properties. Mostly it's
| people who manage websites with "valuable" content, e.g.
| shops with prices who desperately want to stop scraping by
| competitors.
| f430 wrote:
| I can say this will give a lot of businesses false sense of
| security. It is already bypassable.
|
| the Web scraping technology that I am aware of has reached
| end game already: Unless you are prepared to authenticate
| every user/visitor to your website with a dollar sign,
| lobby congress to pass a bill to outlaw web scraping, you
| will not be able to stop web scraping in 2021 and beyond.
| kristopolous wrote:
| In the early 90s there were various nascent systems for
| essentially public database interfaces for searching
|
| The idea was that instead of a centralized search, people could
| have fat clients that individually query these apis and then
| aggregate the results on the client machine.
|
| Essentially every query would be a what/where or what/who pair.
| This would focus the results
|
| I really think we need to reboot those core ideas.
|
| We have a manual version today. There's quite a few large
| databases that the crawlers don't get.
|
| The one place for everything approach has the same fundamental
| problems that were pointed out 30 years ago, they've just
| become obvious to everybody now.
| grishka wrote:
| So, one more reason to hate Cloudflare and every single website
| that uses it.
| jakear wrote:
| Or maybe don't "hate" folks who are just trying to put some
| content online and don't want to deal with botnets taking
| down their work? You know, like what the internet was
| intended for.
| grishka wrote:
| Internet was certainly _not_ intended for centralization. I
| hit Cloudflare captchas and error pages so often it 's
| almost sickening. So many things are behind Cloudflare,
| things you least expect to be behind Cloudflare.
| petercooper wrote:
| I wonder what happens to RSS feeds in this situation. Programs
| I run that process RSS feeds will just fetch them over HTTP
| completely headlessly, so if there are any CAPTCHAs, I'm not
| going to see them.
| luckylion wrote:
| That will be interesting to see with regards to legal
| implications. If they (in the website operator's name) block
| access to e.g. privacy info pages to a normal user "by
| accident", that could be a compliance issue.
|
| I don't think it's mass blocking is the right approach in
| general. IPs, even residential, are relatively easy and
| relatively cheap. At some point you're blocking too many normal
| users. Captchas are a strong weapon, but they too have a
| significant cost by annoying the users. Cloudflare could
| theoretically do invisible-invisible captchas by never even
| running any code on the client, but that would be wholesale
| tracking and would probably not fly in the EU.
| dleslie wrote:
| The idea of a public cache available to anyone who wishes to
| index it is ... kind of compelling.
|
| If it was the only indexer allowed, and it was publically
| governed, then enforcing changes to regulation would be a lot
| more straightforward. Imagine if indexing public social media
| profiles was deemed unacceptable, and within days that content
| disappeared from all search engines.
|
| I don't think it'll ever happen, but it's interesting to think
| about.
| tlibert wrote:
| So out law web scrapping entirely?
| simantel wrote:
| Common Crawl is attempting to offer this as a non-profit:
| https://commoncrawl.org
| jackson1442 wrote:
| o/t but what the hell are they doing to scroll on that page?
| I move my fingers a centimeter on my trackpad and the page is
| already scrolled all the way to the bottom.
|
| Hijacking scroll like this is one of the biggest turnoffs a
| website can have for me, up there with being plastered with
| ads and crap. It's ok imo in the context of doing some flashy
| branding stuff (think Google Pixel, Tesla splashes) but
| contentful pages shouldn't ever do this.
| aembleton wrote:
| Add *##+js(aeld, scroll) to your uBO filters. That will
| stop scroll JS for all websites.
| xtracto wrote:
| That would be a very cool use case for something like STORJ or
| IPFS.
| ricardo81 wrote:
| An alternative but similar idea, apply your own algorithms to a
| crawler/index. That's half the problem with these large
| platforms commanding the majority of eyeballs, you search the
| entire web for something and you get results back via a black
| box. Alternatives in general are most definitely a good thing.
|
| Knuckleheads' Club at the very least are doing a great job of
| raising awareness and the potential barriers to entry for
| alternatives.
| ISL wrote:
| Imagine if Donald Trump decided that indexing Joe Biden's
| campaign site was unacceptable.
|
| A mandated singular public cache has potential slippery slopes.
| whimsicalism wrote:
| Imagine if Donald Trump decided to tax campaign donations to
| Joe Biden's campaign at 100%.
|
| I am unconvinced by the "slippery slope" argument being
| deployed by default to any governmental attempt to combat
| tech monopolies.
| ISL wrote:
| This is an argument against centralization more than it is
| against government.
|
| "One index to rule them all" seems more fraught with
| difficulty than, "large cloud providers are unhappy that
| crawlers on the open web are crawling the open web".
| whimsicalism wrote:
| If the impact stopped at "large cloud providers" being
| unhappy, I think that you're correct. But I think we've
| seen considerably downstream "difficulty" for the rest of
| society from search essentially being consolidated into
| one private actor.
| passivate wrote:
| >A mandated singular public cache has potential slippery
| slopes.
|
| That may be, but it seems like everything has a slippery
| slope - if the wrong person gets into power, or if the public
| look the other way/complacence/ignorance/indifference, etc,
| etc. It shouldn't stop us evaluating choices on their merits,
| and there is a lot of merit to entrusting 'core
| infrastructure' type entities to the government - or at-least
| having an option.
| drivingmenuts wrote:
| > If it was the only indexer allowed, and it was publically
| governed
|
| Which would put it under government regulation and be forever
| mired in politics over what was moral, immoral, ethical or
| unethical and all other kerfuffle. To an extent, it's already
| that way, but that would make it worse than it is currently.
| hackeraccount wrote:
| I'd have to look more but maybe running a cache isn't dead
| simple. I can imagine that the benefits of manipulating what's
| in the cache either adding or removing would be very high.
| Google and the others are private companies so they're not
| required to do everything in the public view.
|
| A public cache wouldn't be able - indeed shouldn't - to play
| cat and mouse games with potential opponents. I suspect most of
| the games played require not explaining exactly what you're
| doing.
| sixdimensional wrote:
| Here's an idea... what if search became a peer-to-peer
| standardized protocol that is part of the stack to complement
| DNS? E.g. instead of using DNS as the primary entry point, you
| use a different protocol at that level to do "distributed
| search". DNS would still play a role too, but if "search" was a
| core protocol, the entry point for most people would be
| different.
|
| Similar to some of the concepts of "Linked Data", maybe -
| https://en.wikipedia.org/wiki/Linked_data.
|
| The problem is getting to a standard, it would essentially need
| to be federated search so a standard would have to be
| established (de facto most likely).
|
| Also, indexes and storage, distribution of processing load..
| peer-to-peer search is already a thing, but it doesn't seem to
| be a core function of the Internet.
|
| This is basically the same concept as making an "open" version
| of something that is "closed" in order to compete, I guess.
| rezonant wrote:
| > Let's take a look at the robots.txt for census.gov from October
| of 2018 as a specific example to see how robots.txt files
| typically work. This document is a good example of a common
| pattern. The first two lines of the file specify that you cannot
| crawl census.gov unless given explicit permission.
|
| This was eyebrow-raising. Actually seems like this is not (any
| longer?) true:
|
| https://census.gov/robots.txt:
|
| User-agent: *
|
| User-agent: W3C-checklink
|
| Disallow: /cgi-bin/
|
| Disallow: /libs/
|
| ...
|
| That first line wildcards for any user agent but does nothing
| with it. It should say "Disallow /" on the next line if it
| blocked all unnamed robots. It looks like someone found out about
| it and told the operators, rightfully so, that government
| webpages with public information (especially the census)
| shouldn't have such restrictions. They then removed only the
| second line and left the first. Leaving the first line has no
| impact on the meaning of the file.
| EGreg wrote:
| Or use MaidSAFE where you get paid to serve your website as
| opposed to the other way around.
| hannob wrote:
| I have seen sites behave differently if you use a Googlebot UA,
| but am I missing something or does this merely mean that anyone
| doing something like this
|
| curl -A 'Mozilla/5.0 (compatible; Googlebot/2.1;
| +http://www.google.com/bot.html)'
|
| will get Google-level crawler access?
| kirubakaran wrote:
| That would work on website that have a naive check for just
| user agent. Google also publishes the IP address ranges their
| crawlers run on. Lot of websites check for that, and there's no
| way around that.
|
| https://developers.google.com/search/docs/advanced/crawling/...
| AlphaWeaver wrote:
| This "club" charges a membership fee of $10 a month (or $100 a
| year) to comment.
|
| Does this go to some sort of nonprofit or holding entity that's
| governed by its members? Or do people have to trust the owner?
| mancerayder wrote:
| Any word on or opinions about Brave's initiative to challenge
| search?
| ChrisArchitect wrote:
| dupe/posted earlier etc
|
| I also got confused about this page as there's another project of
| theirs around right now about RIP Google Reader that's on a
| seperate domain...
|
| Funny a site that's all about google this and that doesn't have
| clear URL/pages for their articles that can be linked to easily
| geez
|
| Original post/discussion from the source, 3 months ago:
| https://news.ycombinator.com/item?id=25417067
| slenk wrote:
| https://knuckleheads.club/introduction/
|
| That seems like an easy link?
___________________________________________________________________
(page generated 2021-03-26 23:00 UTC)