[HN Gopher] With the rise of AI, web crawlers are suddenly contr...
___________________________________________________________________
With the rise of AI, web crawlers are suddenly controversial
Author : leephillips
Score : 73 points
Date : 2024-02-18 17:14 UTC (5 hours ago)
(HTM) web link (www.theverge.com)
(TXT) w3m dump (www.theverge.com)
| throwup238 wrote:
| _> For decades, robots.txt governed the behavior of web crawlers.
| But as unscrupulous AI companies seek out more and more data, the
| basic social contract of the web is falling apart._
|
| The basic social contract of the web fell apart long ago when
| almost everyone decided that Google was the only search engine
| worth serving and started aggressively blocking other crawlers.
| zarzavat wrote:
| Add to that the removal of public APIs and things like RSS
| feeds. I would much rather use an API than scrape you, I will
| even pay a small fee, but if you don't provide anything, then
| you're getting scraped.
| Plasmoid wrote:
| I'm out of the loop on this. What has been happening? Do you
| get IP blocked? Rate limited?
| realusername wrote:
| You do and since the rise of Cloudflare, you need their
| approval to create your search engine
| oceanplexian wrote:
| Unfortunately for them, AI is a double edged sword. AI can
| trivially scrape content from a screencap. And there are
| thousands of residential proxies that can't be defeated
| even by companies like CF.
| spencerflem wrote:
| If you have residential proxies, you don't need AI to
| scrape via video, you have the HTML already.
| oceanplexian wrote:
| Companies spend thousands of engineering hours building
| elaborate mechanisms to obfuscate JS, protect APIs, apps,
| trigger captchas, and hide content, all of it completely
| powerless against the current gen of AI tools. If a human
| can read it, a bot can.
| Mr_P wrote:
| Those same companies often invest in accessibility for
| vision-impaired users. I'm not sure you need a screen
| capture to scrape content when the site is designed to be
| navigable with a screen reader.
| aaronrobinson wrote:
| Drama. Crawlers have always been controversial.
| elpocko wrote:
| robots.txt is relevant and effective, as is my DNT header.
| naiv wrote:
| Proxy companies are a big winner now
| linkjuice4all wrote:
| "With the rise of AI, photos of the exterior of your business are
| suddenly controversial"
|
| Many revenue-based websites tried to have it both ways with web
| crawlers wherein they wanted to block automated access or repeat
| viewers while letting first time viewers get a free taste. Others
| have noted that basically Google gets a free pass for all the
| traffic it brings in but everyone else has to respect robots
| declarations.
|
| It seems like a no brainer - if your web server is configured to
| reply to GET requests with a 200 status and some content then
| they get to do pretty much whatever they want with it.
|
| Don't want to give access to everyone? Stop sending your content
| for free and get them to agree to some contract and
| authorize/license their access to your stuff.
| spencerflem wrote:
| I don't think this is that cut and dry. Its perfectly
| reasonable to be fine with your content being linked to in an
| index like google search does, but not fine with the content
| being read by humans for free or used to train AI.
|
| I'm sure the sites would rather that google paid them in
| licencing too, but without changes to our laws that's not going
| to happen
| matthewrobertso wrote:
| I want to research a topic, so I do a google search about it.
| If 100% of the search results are readable to google, so that
| they are indexed, but unreadable to me, due to paywalls, the
| google search is useless.
|
| I'm not sure what the solution is, but paywalled articles in
| search results are bad. If they want to be indexed they
| should have to offer that same indexed content to anyone
| browsing the index.
| carlosjobim wrote:
| If you're doing serious research you pay for the paywall.
| It's not unreadable to you, just like a coke isn't
| undrinkable to you because you have to pay for it.
| hobs wrote:
| It's a bait and switch - you just offered me free cokes
| and then let me know that its after I sign up for a
| subscription service, no thanks!
| spencerflem wrote:
| It's your assumption that everything behind a google link
| ought to be 100% free (ad supported). Other people
| disagree, and Google does not advertise anywhere that
| their list is free content only.
| lmm wrote:
| > Google does not advertise anywhere that their list is
| free content only.
|
| Google does advertise that they index based on the same
| content that's available to anyone viewing the page, and
| has policies against presenting a different version of
| the page to their crawler versus what you're showing to
| visitors.
| carlosjobim wrote:
| No, it's not bait and switch. A book store has an index
| of books they sell, that doesn't mean they're free. I
| expect a high quality search engine to deliver paid
| results if they are the best results.
|
| Should Google Maps remove businesses that charge for
| their products and services from their search results as
| well?
| hobs wrote:
| I wouldn't expect that at all, search engines search the
| content they have available to proffer it to you, that's
| the job.
|
| If by clicking on the thing it does not have the content
| I searched for (how am I even certain I get it when I pay
| you?) I would call that result bad.
|
| If you want to charge for stuff that's great, I recommend
| it, and if you want to give out a free sample or an index
| that's great, but it should be the same to all comers.
| matthewrobertso wrote:
| No I don't, I disable JavaScript and read what they
| served google in the first place.
|
| If I went to a public water fountain and found that
| someone had turned it into a coca cola dispensing
| machine, I wouldn't be happy and wouldn't pay to use it.
|
| "Journalists" creating pay walls, using SEO tactics to
| push their articles into my search results, and then
| trying to extract rent don't deserve money.
| carlosjobim wrote:
| You despise the people writing the content you want to
| read, at the same time that you are demanding to access
| their works for free. Do you also work for free for any
| stranger?
| matthewrobertso wrote:
| No, I don't want to read their content. I want to find an
| answer to my search query.
|
| If the search results are full of paywalled articles that
| claim to have text relevant to my query, but won't show
| me the article because publishers are trying to extract
| money from me, the publishers of those articles have made
| my task harder and shouldn't be rewarded. This is a form
| of spam.
| carlosjobim wrote:
| > I want to find an answer to my search query.
|
| And sometimes the answer is behind a paywall. It's not
| spam at all. On the contrary, spam is always free.
| matthewrobertso wrote:
| There is always a free source with the answer somewhere.
| The trouble comes when the free sources are pushed far
| down in the results by legacy brands.
|
| When this happens, I will continue to pretend to be
| google to access the content they are pushing. If
| publishers want to change this behavior they could try
| not letting Google index it, so I don't need to see it in
| my search results.
| carlosjobim wrote:
| Your argument boils down to "I want free stuff", as I see
| it. Okay, but why in the world should Google care about
| what you want in the search results then? You do not
| bring any value and will not bring any future value.
|
| For other users, they see value in having paywalled
| results if they are the best results, because they do not
| have a block against paying for content.
|
| If you for example search for a movie on Google, they'll
| show you paid options to watch it on streaming services
| or rent it from streaming services. That's good and what
| should be expected from a search engine.
|
| Paying for stuff is how the world works. If a restaurant
| boasts about having the nicest steaks, you're not going
| to get a free steak just to be sure that it's good.
|
| But I really think it is time for a better way to pay for
| content and articles instead of having to subscribe to
| each source.
| phone8675309 wrote:
| Where in the parent post did the poster say they
| "despised" the people writing the content?
| spencerflem wrote:
| i agree, it would be nice for google to have an option to
| avoid paywalled articles, or to specify to it what accounts
| you pay for and allow those only
|
| getting all journalism for free isn't sustainable though
| layer8 wrote:
| I don't like paywalls as well, but in principle payed
| content is justified, and if one is willing to pay for
| relevant content, isn't it better that Google allows one to
| find it? Maybe Google Search should have a switch "show
| only non-paywalled results" (paraphrased) (I'm sure they
| could figure out which content is paywalled if they wanted
| to), but personally I would probably still prefer seeing
| which sources exist even if they are paywalled.
| postalrat wrote:
| It is also reasonable to serve the same content to regular
| visitors as you do to Google?
|
| Is ok to serve a full article to Google and put visitors
| behind wall?
| spencerflem wrote:
| yes, because google is only using that article to decide
| what searches should link to it
| AnthonyMouse wrote:
| Given two pages that both contain the information the
| user wants, they prefer the one that isn't behind a
| paywall, and so they prefer a search engine that puts
| those results first.
|
| Giving the search engine the full content so it will rank
| higher is lying, because that information isn't actually
| there unless you pay, which most users aren't going to
| do.
|
| In theory you could solve this with a requirement for the
| site to disclose that it's doing this, so the search
| engine can have a box that says "exclude paywalled
| content", but then everybody would check the box.
| hobs wrote:
| I deeply 100% disagree with you - if you want to show google
| something else that is a spam tactic that only ends in bad
| outcomes. If your content is good enough to pay for its good
| enough to hide from everyone.
|
| A search based on the results being something otherwise than
| the contents of the search is misleading, a spam tactic, and
| bad.
| CharlesW wrote:
| > _Stop sending your content for free and get them to agree to
| some contract and authorize /license their access to your
| stuff._
|
| Agreed, we need to wrap the public web with DRM immediately. We
| can't expect companies like OpenAI to waste time worrying about
| pesky legalities like copyright and content licensing.
| spencerflem wrote:
| I don't think there's a technical solution that would work
| here. This is a law & enforcement problem
| ducttapecrown wrote:
| That's what the above comment implies if you read it
| sarcastically, they might be in agreement with you!
| spencerflem wrote:
| whoops, duh you're right
| AnthonyMouse wrote:
| It depends what your goal is.
|
| If you're trying to prevent anyone from getting a copy of
| what's on your site, you're probably screwed, because
| technical solutions are hard and legal solutions are only
| going to be violated by The Internet since it isn't all in
| your jurisdiction.
|
| If you're just trying to keep them from putting an
| excessive load on your servers, technical solutions are
| easy. Just give them an API or some other efficient way to
| receive the data.
| BeFlatXIII wrote:
| Perhaps this can accelerate DRM hacking if multiple billions
| of startup money are put behind breaking it to stealthily add
| to the data sets. All the more fun if it's done somewhere
| extradition-proof from the US.
| flir wrote:
| You're not even sure this is a copyright issue in the US yet,
| let alone every other country on the planet. So yes, if you
| don't want it freely consumed, don't freely publish it.
| spencerflem wrote:
| Wouldn't that be worse? Then instead of there being some
| way to get the information (pay them) there's now no way to
| get it.
| flir wrote:
| I'm in "I just want to watch the world burn" mode over
| this right now, to be honest. If all that's left on the
| open internet is SEO spam, and all the chatbots can train
| on is SEO spam, then so be it. We didn't deserve to have
| nice things.
| spencerflem wrote:
| Yea the world sucks, tech companies are making it worse,
| the economy is built to funnel money to whoever can suck
| the most value out of the commons for themselves, climate
| change will ruin everything, hug your favorite endangered
| species while you still can, and the internet is best at
| spreading lies
|
| I don't see how arguing that it should be harder for
| journalists to get paid helps anything though
| flir wrote:
| Accelerationism, obvs.
|
| (Except I'm pretty sure that doesn't work either).
| CharlesW wrote:
| > _You 're not even sure this is a copyright issue in the
| US yet..._
|
| We are sure that OpenAI is building its business on
| copyrighted content, under the defense that to do otherwise
| is "impossible."
|
| The courts have not decided whether this is "fair use".
| However, the four factors make it pretty clear.
| flir wrote:
| Before that you'd have to establish that whatever the
| chatbots are doing is "use" in the sense that "fair use"
| means it. The act of reading a book or looking at a
| painting isn't protected by fair use.
|
| If it is a use that falls under fair use, I'm thinking
| about the Google Books case, where an act that is _far_
| less transformative (digital archiving and search) was
| found to be fair use (https://en.wikipedia.org/wiki/Autho
| rs_Guild,_Inc._v._Google,....)
|
| (It's going to be "who's got more money?", isn't it? Fun
| fun fun.)
| AnthonyMouse wrote:
| > The courts have not decided whether this is "fair use".
| However, the four factors make it pretty clear.
|
| It's funny because you say it's "pretty clear" without
| saying which way you think it goes, which then makes it
| unclear what you think the result is.
| o11c wrote:
| There's a different between "I allow people to access my data"
| and "I allow people to create products based on my data"
| amelius wrote:
| When did robots.txt get a legal status?
|
| Or did it ever?
| franze wrote:
| eBay v. Bidder's Edge (2000)
| nadermx wrote:
| That was overturned,
| https://en.wikipedia.org/wiki/Intel_Corp._v._Hamidi and White
| Buffalo Ventures LLC v. University of Texas at Austin
| calibas wrote:
| > _For decades, robots.txt governed the behavior of web
| crawlers._
|
| It never governed anything, web crawlers were never under any
| obligation to follow robots.txt.
|
| This article seems like they took an existing controversy,
| rebranded it as something new, then blamed in on AI.
| fallingknife wrote:
| > they took an existing controversy, rebranded it as something
| new, then blamed in on <new thing>
|
| Basically describes most of what passes as journalism these
| days.
| franze wrote:
| > web crawlers were never under any obligation to follow
| robots.txt.
|
| other than the fact that you could get successful sued if you
| do not follow them i.e.: eBay v. Bidder's Edge in 2000
| nadermx wrote:
| That was overturned,
| https://en.wikipedia.org/wiki/Intel_Corp._v._Hamidi and White
| Buffalo Ventures LLC v. University of Texas at Austin
| geor9e wrote:
| The Internet Archive never follows robots.txt and they're
| still around
| andybak wrote:
| > But as unscrupulous AI companies seek out more and more data
|
| I'm not sure I'm ready to concede the fundemental value judgement
| being made here. At least I refuse to accept it as a given rather
| then the core issue to be decided.
| micromacrofoot wrote:
| many crawlers have always ignored robots.txt, if you're
| monitoring any moderately visited site you're bound to see random
| spikes of bots hammering your server no matter what text file or
| headers you set
| datavirtue wrote:
| This. I stopped reading after the first few sentences. Whoever
| wrote that verge article is clueless.
|
| Data accessible. Data free. Period.
| lewhoo wrote:
| I don't get it. The crux of it all seems to be that Google isn't
| competing with owners of data it crawls using the very same data.
| The crawl part isn't as much of a controversy as usage, isn't it
| ? The mentioned eBay v. Bidder's Edge (2000) seems to be a
| dispute over usage.
___________________________________________________________________
(page generated 2024-02-18 23:02 UTC)