[HN Gopher] With the rise of AI, web crawlers are suddenly contr...
       ___________________________________________________________________
        
       With the rise of AI, web crawlers are suddenly controversial
        
       Author : leephillips
       Score  : 73 points
       Date   : 2024-02-18 17:14 UTC (5 hours ago)
        
 (HTM) web link (www.theverge.com)
 (TXT) w3m dump (www.theverge.com)
        
       | throwup238 wrote:
       | _> For decades, robots.txt governed the behavior of web crawlers.
       | But as unscrupulous AI companies seek out more and more data, the
       | basic social contract of the web is falling apart._
       | 
       | The basic social contract of the web fell apart long ago when
       | almost everyone decided that Google was the only search engine
       | worth serving and started aggressively blocking other crawlers.
        
         | zarzavat wrote:
         | Add to that the removal of public APIs and things like RSS
         | feeds. I would much rather use an API than scrape you, I will
         | even pay a small fee, but if you don't provide anything, then
         | you're getting scraped.
        
         | Plasmoid wrote:
         | I'm out of the loop on this. What has been happening? Do you
         | get IP blocked? Rate limited?
        
           | realusername wrote:
           | You do and since the rise of Cloudflare, you need their
           | approval to create your search engine
        
             | oceanplexian wrote:
             | Unfortunately for them, AI is a double edged sword. AI can
             | trivially scrape content from a screencap. And there are
             | thousands of residential proxies that can't be defeated
             | even by companies like CF.
        
               | spencerflem wrote:
               | If you have residential proxies, you don't need AI to
               | scrape via video, you have the HTML already.
        
               | oceanplexian wrote:
               | Companies spend thousands of engineering hours building
               | elaborate mechanisms to obfuscate JS, protect APIs, apps,
               | trigger captchas, and hide content, all of it completely
               | powerless against the current gen of AI tools. If a human
               | can read it, a bot can.
        
               | Mr_P wrote:
               | Those same companies often invest in accessibility for
               | vision-impaired users. I'm not sure you need a screen
               | capture to scrape content when the site is designed to be
               | navigable with a screen reader.
        
       | aaronrobinson wrote:
       | Drama. Crawlers have always been controversial.
        
       | elpocko wrote:
       | robots.txt is relevant and effective, as is my DNT header.
        
       | naiv wrote:
       | Proxy companies are a big winner now
        
       | linkjuice4all wrote:
       | "With the rise of AI, photos of the exterior of your business are
       | suddenly controversial"
       | 
       | Many revenue-based websites tried to have it both ways with web
       | crawlers wherein they wanted to block automated access or repeat
       | viewers while letting first time viewers get a free taste. Others
       | have noted that basically Google gets a free pass for all the
       | traffic it brings in but everyone else has to respect robots
       | declarations.
       | 
       | It seems like a no brainer - if your web server is configured to
       | reply to GET requests with a 200 status and some content then
       | they get to do pretty much whatever they want with it.
       | 
       | Don't want to give access to everyone? Stop sending your content
       | for free and get them to agree to some contract and
       | authorize/license their access to your stuff.
        
         | spencerflem wrote:
         | I don't think this is that cut and dry. Its perfectly
         | reasonable to be fine with your content being linked to in an
         | index like google search does, but not fine with the content
         | being read by humans for free or used to train AI.
         | 
         | I'm sure the sites would rather that google paid them in
         | licencing too, but without changes to our laws that's not going
         | to happen
        
           | matthewrobertso wrote:
           | I want to research a topic, so I do a google search about it.
           | If 100% of the search results are readable to google, so that
           | they are indexed, but unreadable to me, due to paywalls, the
           | google search is useless.
           | 
           | I'm not sure what the solution is, but paywalled articles in
           | search results are bad. If they want to be indexed they
           | should have to offer that same indexed content to anyone
           | browsing the index.
        
             | carlosjobim wrote:
             | If you're doing serious research you pay for the paywall.
             | It's not unreadable to you, just like a coke isn't
             | undrinkable to you because you have to pay for it.
        
               | hobs wrote:
               | It's a bait and switch - you just offered me free cokes
               | and then let me know that its after I sign up for a
               | subscription service, no thanks!
        
               | spencerflem wrote:
               | It's your assumption that everything behind a google link
               | ought to be 100% free (ad supported). Other people
               | disagree, and Google does not advertise anywhere that
               | their list is free content only.
        
               | lmm wrote:
               | > Google does not advertise anywhere that their list is
               | free content only.
               | 
               | Google does advertise that they index based on the same
               | content that's available to anyone viewing the page, and
               | has policies against presenting a different version of
               | the page to their crawler versus what you're showing to
               | visitors.
        
               | carlosjobim wrote:
               | No, it's not bait and switch. A book store has an index
               | of books they sell, that doesn't mean they're free. I
               | expect a high quality search engine to deliver paid
               | results if they are the best results.
               | 
               | Should Google Maps remove businesses that charge for
               | their products and services from their search results as
               | well?
        
               | hobs wrote:
               | I wouldn't expect that at all, search engines search the
               | content they have available to proffer it to you, that's
               | the job.
               | 
               | If by clicking on the thing it does not have the content
               | I searched for (how am I even certain I get it when I pay
               | you?) I would call that result bad.
               | 
               | If you want to charge for stuff that's great, I recommend
               | it, and if you want to give out a free sample or an index
               | that's great, but it should be the same to all comers.
        
               | matthewrobertso wrote:
               | No I don't, I disable JavaScript and read what they
               | served google in the first place.
               | 
               | If I went to a public water fountain and found that
               | someone had turned it into a coca cola dispensing
               | machine, I wouldn't be happy and wouldn't pay to use it.
               | 
               | "Journalists" creating pay walls, using SEO tactics to
               | push their articles into my search results, and then
               | trying to extract rent don't deserve money.
        
               | carlosjobim wrote:
               | You despise the people writing the content you want to
               | read, at the same time that you are demanding to access
               | their works for free. Do you also work for free for any
               | stranger?
        
               | matthewrobertso wrote:
               | No, I don't want to read their content. I want to find an
               | answer to my search query.
               | 
               | If the search results are full of paywalled articles that
               | claim to have text relevant to my query, but won't show
               | me the article because publishers are trying to extract
               | money from me, the publishers of those articles have made
               | my task harder and shouldn't be rewarded. This is a form
               | of spam.
        
               | carlosjobim wrote:
               | > I want to find an answer to my search query.
               | 
               | And sometimes the answer is behind a paywall. It's not
               | spam at all. On the contrary, spam is always free.
        
               | matthewrobertso wrote:
               | There is always a free source with the answer somewhere.
               | The trouble comes when the free sources are pushed far
               | down in the results by legacy brands.
               | 
               | When this happens, I will continue to pretend to be
               | google to access the content they are pushing. If
               | publishers want to change this behavior they could try
               | not letting Google index it, so I don't need to see it in
               | my search results.
        
               | carlosjobim wrote:
               | Your argument boils down to "I want free stuff", as I see
               | it. Okay, but why in the world should Google care about
               | what you want in the search results then? You do not
               | bring any value and will not bring any future value.
               | 
               | For other users, they see value in having paywalled
               | results if they are the best results, because they do not
               | have a block against paying for content.
               | 
               | If you for example search for a movie on Google, they'll
               | show you paid options to watch it on streaming services
               | or rent it from streaming services. That's good and what
               | should be expected from a search engine.
               | 
               | Paying for stuff is how the world works. If a restaurant
               | boasts about having the nicest steaks, you're not going
               | to get a free steak just to be sure that it's good.
               | 
               | But I really think it is time for a better way to pay for
               | content and articles instead of having to subscribe to
               | each source.
        
               | phone8675309 wrote:
               | Where in the parent post did the poster say they
               | "despised" the people writing the content?
        
             | spencerflem wrote:
             | i agree, it would be nice for google to have an option to
             | avoid paywalled articles, or to specify to it what accounts
             | you pay for and allow those only
             | 
             | getting all journalism for free isn't sustainable though
        
             | layer8 wrote:
             | I don't like paywalls as well, but in principle payed
             | content is justified, and if one is willing to pay for
             | relevant content, isn't it better that Google allows one to
             | find it? Maybe Google Search should have a switch "show
             | only non-paywalled results" (paraphrased) (I'm sure they
             | could figure out which content is paywalled if they wanted
             | to), but personally I would probably still prefer seeing
             | which sources exist even if they are paywalled.
        
           | postalrat wrote:
           | It is also reasonable to serve the same content to regular
           | visitors as you do to Google?
           | 
           | Is ok to serve a full article to Google and put visitors
           | behind wall?
        
             | spencerflem wrote:
             | yes, because google is only using that article to decide
             | what searches should link to it
        
               | AnthonyMouse wrote:
               | Given two pages that both contain the information the
               | user wants, they prefer the one that isn't behind a
               | paywall, and so they prefer a search engine that puts
               | those results first.
               | 
               | Giving the search engine the full content so it will rank
               | higher is lying, because that information isn't actually
               | there unless you pay, which most users aren't going to
               | do.
               | 
               | In theory you could solve this with a requirement for the
               | site to disclose that it's doing this, so the search
               | engine can have a box that says "exclude paywalled
               | content", but then everybody would check the box.
        
           | hobs wrote:
           | I deeply 100% disagree with you - if you want to show google
           | something else that is a spam tactic that only ends in bad
           | outcomes. If your content is good enough to pay for its good
           | enough to hide from everyone.
           | 
           | A search based on the results being something otherwise than
           | the contents of the search is misleading, a spam tactic, and
           | bad.
        
         | CharlesW wrote:
         | > _Stop sending your content for free and get them to agree to
         | some contract and authorize /license their access to your
         | stuff._
         | 
         | Agreed, we need to wrap the public web with DRM immediately. We
         | can't expect companies like OpenAI to waste time worrying about
         | pesky legalities like copyright and content licensing.
        
           | spencerflem wrote:
           | I don't think there's a technical solution that would work
           | here. This is a law & enforcement problem
        
             | ducttapecrown wrote:
             | That's what the above comment implies if you read it
             | sarcastically, they might be in agreement with you!
        
               | spencerflem wrote:
               | whoops, duh you're right
        
             | AnthonyMouse wrote:
             | It depends what your goal is.
             | 
             | If you're trying to prevent anyone from getting a copy of
             | what's on your site, you're probably screwed, because
             | technical solutions are hard and legal solutions are only
             | going to be violated by The Internet since it isn't all in
             | your jurisdiction.
             | 
             | If you're just trying to keep them from putting an
             | excessive load on your servers, technical solutions are
             | easy. Just give them an API or some other efficient way to
             | receive the data.
        
           | BeFlatXIII wrote:
           | Perhaps this can accelerate DRM hacking if multiple billions
           | of startup money are put behind breaking it to stealthily add
           | to the data sets. All the more fun if it's done somewhere
           | extradition-proof from the US.
        
           | flir wrote:
           | You're not even sure this is a copyright issue in the US yet,
           | let alone every other country on the planet. So yes, if you
           | don't want it freely consumed, don't freely publish it.
        
             | spencerflem wrote:
             | Wouldn't that be worse? Then instead of there being some
             | way to get the information (pay them) there's now no way to
             | get it.
        
               | flir wrote:
               | I'm in "I just want to watch the world burn" mode over
               | this right now, to be honest. If all that's left on the
               | open internet is SEO spam, and all the chatbots can train
               | on is SEO spam, then so be it. We didn't deserve to have
               | nice things.
        
               | spencerflem wrote:
               | Yea the world sucks, tech companies are making it worse,
               | the economy is built to funnel money to whoever can suck
               | the most value out of the commons for themselves, climate
               | change will ruin everything, hug your favorite endangered
               | species while you still can, and the internet is best at
               | spreading lies
               | 
               | I don't see how arguing that it should be harder for
               | journalists to get paid helps anything though
        
               | flir wrote:
               | Accelerationism, obvs.
               | 
               | (Except I'm pretty sure that doesn't work either).
        
             | CharlesW wrote:
             | > _You 're not even sure this is a copyright issue in the
             | US yet..._
             | 
             | We are sure that OpenAI is building its business on
             | copyrighted content, under the defense that to do otherwise
             | is "impossible."
             | 
             | The courts have not decided whether this is "fair use".
             | However, the four factors make it pretty clear.
        
               | flir wrote:
               | Before that you'd have to establish that whatever the
               | chatbots are doing is "use" in the sense that "fair use"
               | means it. The act of reading a book or looking at a
               | painting isn't protected by fair use.
               | 
               | If it is a use that falls under fair use, I'm thinking
               | about the Google Books case, where an act that is _far_
               | less transformative (digital archiving and search) was
               | found to be fair use (https://en.wikipedia.org/wiki/Autho
               | rs_Guild,_Inc._v._Google,....)
               | 
               | (It's going to be "who's got more money?", isn't it? Fun
               | fun fun.)
        
               | AnthonyMouse wrote:
               | > The courts have not decided whether this is "fair use".
               | However, the four factors make it pretty clear.
               | 
               | It's funny because you say it's "pretty clear" without
               | saying which way you think it goes, which then makes it
               | unclear what you think the result is.
        
         | o11c wrote:
         | There's a different between "I allow people to access my data"
         | and "I allow people to create products based on my data"
        
       | amelius wrote:
       | When did robots.txt get a legal status?
       | 
       | Or did it ever?
        
         | franze wrote:
         | eBay v. Bidder's Edge (2000)
        
           | nadermx wrote:
           | That was overturned,
           | https://en.wikipedia.org/wiki/Intel_Corp._v._Hamidi and White
           | Buffalo Ventures LLC v. University of Texas at Austin
        
       | calibas wrote:
       | > _For decades, robots.txt governed the behavior of web
       | crawlers._
       | 
       | It never governed anything, web crawlers were never under any
       | obligation to follow robots.txt.
       | 
       | This article seems like they took an existing controversy,
       | rebranded it as something new, then blamed in on AI.
        
         | fallingknife wrote:
         | > they took an existing controversy, rebranded it as something
         | new, then blamed in on <new thing>
         | 
         | Basically describes most of what passes as journalism these
         | days.
        
         | franze wrote:
         | > web crawlers were never under any obligation to follow
         | robots.txt.
         | 
         | other than the fact that you could get successful sued if you
         | do not follow them i.e.: eBay v. Bidder's Edge in 2000
        
           | nadermx wrote:
           | That was overturned,
           | https://en.wikipedia.org/wiki/Intel_Corp._v._Hamidi and White
           | Buffalo Ventures LLC v. University of Texas at Austin
        
           | geor9e wrote:
           | The Internet Archive never follows robots.txt and they're
           | still around
        
       | andybak wrote:
       | > But as unscrupulous AI companies seek out more and more data
       | 
       | I'm not sure I'm ready to concede the fundemental value judgement
       | being made here. At least I refuse to accept it as a given rather
       | then the core issue to be decided.
        
       | micromacrofoot wrote:
       | many crawlers have always ignored robots.txt, if you're
       | monitoring any moderately visited site you're bound to see random
       | spikes of bots hammering your server no matter what text file or
       | headers you set
        
         | datavirtue wrote:
         | This. I stopped reading after the first few sentences. Whoever
         | wrote that verge article is clueless.
         | 
         | Data accessible. Data free. Period.
        
       | lewhoo wrote:
       | I don't get it. The crux of it all seems to be that Google isn't
       | competing with owners of data it crawls using the very same data.
       | The crawl part isn't as much of a controversy as usage, isn't it
       | ? The mentioned eBay v. Bidder's Edge (2000) seems to be a
       | dispute over usage.
        
       ___________________________________________________________________
       (page generated 2024-02-18 23:02 UTC)