hngopher.com

       [HN Gopher] Anyone got a contact at OpenAI. They have a spider p...
       ___________________________________________________________________
        
       Anyone got a contact at OpenAI. They have a spider problem
        
       Author : speckx
       Score  : 498 points
       Date   : 2024-04-11 13:34 UTC (9 hours ago)
        
 (HTM) web link (mailman.nanog.org)
 (TXT) w3m dump (mailman.nanog.org)
        
       | egberts1 wrote:
       | There is always IP filtering, DNS blocking, and HTTP agent
       | screening. Just sayin'.
        
         | unsupp0rted wrote:
         | > Before someone tells me to fix my robots.txt, this is a
         | content farm so rather than being one web site with
         | 6,859,000,000 pages, it is 6,859,000,000 web sites each with
         | one page.
        
           | apocalyptic0n3 wrote:
           | The reason that bit is relevant is that robots.txt is only
           | applicable to the current domain. Because each "page" is a
           | different subdomain, the crawler needs to fetch the
           | robots.txt for every single page request.
           | 
           | What the poster was suggesting is blocking them at a higher
           | level - e.g. a user-agent block in an .htaccess or an IP
           | block in iptables or similar. That would be a one-stop fix.
           | It would also defeat the purpose of the website, however,
           | which is to waste the time of crawlers
        
             | matt_heimer wrote:
             | The real question is how is GPTBot finding all the other
             | subdomains? Currently the sites have GPTBot disallowed,
             | https://www.web.sp.am/robots.txt
             | 
             | If GPTBot is compliant with the robots.txt specification
             | then it can't read the URL containing the HTML to find the
             | other subdomains.
             | 
             | Either:                 1. GPTBot treats a disallow as a
             | noindex but still requests the page itself. Note that
             | Google doesn't treat a disallow as a noindex. They will
             | still show your page in search results if they discover the
             | link from other pages but they show it with a "No
             | information is available for this page." disclaimer.
             | 2. The site didn't have a GPTBot disallow until they
             | noticed the traffic spike and they bot has already
             | discovered a couple million links that need to be crawled.
             | 3. There is some other page out there on the internet that
             | GPTBot discovered that links to millions of these
             | subdomains. This seems possible and the subdomains really
             | don't have any way to prevent a bot from requesting
             | millions of robots.txt files. The only prevention here is
             | to firewall the bot's IP range or work with the bot owners
             | to implement better subdomain handling.
        
         | fl7305 wrote:
         | I think he's saying that it's not a problem for him, but for
         | OpenAI?
        
           | unnouinceput wrote:
           | Yup, that's my impression as well. He's just nice to let
           | OpenAI they have a problem. Usually this should be rewarded
           | with a nice "hey, u guys have a bug" bounty because not long
           | time ago some VP from OpenAI was lamenting that training
           | their AI is, and it's his direct quote, "eye watering" cost
           | (the order was millions of $$ per second).
        
             | joha4270 wrote:
             | I would be a little sceptical about that figure. 3 million
             | dollars per second is around the world GDP.
             | 
             | I get it, AI training is expensive, but I don't believe
             | it's that expensive
        
               | tasuki wrote:
               | Thank you for that perspective. I always appreciate it
               | when people put numbers like these in context.
               | 
               | Also 1 million per second is 60 million per minute is 3.6
               | billion per hour is 86.4 billion per day. It's about one
               | whole value of FAANG per month...
        
               | samspenc wrote:
               | Sam Altman has said in a few interviews that it was
               | around $100 million for GPT-3, and higher for GPT-4.
               | 
               | But yes, this is a one-time cost, and far lower than the
               | "millions of dollars per second" in GP comment.
               | 
               | https://fortune.com/2024/04/04/ai-training-costs-how-
               | much-is...
               | 
               | https://www.wired.com/story/openai-ceo-sam-altman-the-
               | age-of...
        
         | mapmeld wrote:
         | Based on the page footer ("IECC ChurnWare") I believe this is a
         | site by design to waste time for web crawlers and tools which
         | try to get root access on every domain. The robots.txt looks
         | like this: https://ulysses-antoine-kurtis.web.sp.am/robots.txt
         | 
         | I don't see how this does much to keep bad actors away from
         | other domains, but I can see why they don't want to give up the
         | game for OpenAI to stop crawling.
        
       | ajuc wrote:
       | I mean that's your chance to train the SkyNet, take it :)
        
         | dr_kiszonka wrote:
         | That's an interesting attack vector, isn't it?
        
         | throwaway35777 wrote:
         | Single handedly extending the lifespan of all humans by however
         | long it takes them to crawl 6,859,000,000 pages.
        
       | mypastself wrote:
       | Frankly, I didn't get the purpose of the website at first either.
       | I guess I have an arachnid intellect.
        
         | aendruk wrote:
         | Arachnid here... What am I looking at? The intent is to waste
         | the resources of crawlers by just making the web larger?
        
           | quadrature wrote:
           | It seems like that, but they're also concerned about the
           | crawlers that they catch in this web. So it seems like
           | they're trying to help make crawlers better ?, or just
           | generally curious about what systems are crawling around.
        
           | pvg wrote:
           | _What am I looking at?_
           | 
           | I'd say go ahead and inject it with digestive enzymes and
           | then report findings.
        
             | YeGoblynQueenne wrote:
             | No no. First tightly wrap it in silk. _Then_ digestive
             | enzymes. Good hygiene, eh?
        
               | pvg wrote:
               | Largely a myth spread by Big Silk!
        
         | troq13 wrote:
         | You had the sense not to spend the rest of your days crawling
         | it though.
        
       | frizlab wrote:
       | I'd let them do their thing, why not?! They want the internet?
       | This is the real internet. It looks like he doesn't really care
       | that much that they're retrieving millions of pages, so let them
       | do their thing...
        
         | gtirloni wrote:
         | _> It looks like he doesn't really care that much that they're
         | retrieving millions of pages_
         | 
         | It impacts the performance for the other legitimate users of
         | that web farm ;)
        
         | jeremyjh wrote:
         | Some scrapers respect robots.txt. OpenAI doesn't. SP is just
         | informing the world at large of this fact.
        
         | cactusplant7374 wrote:
         | The CTO isn't even aware of where the data is coming from
         | (allegedly).
        
           | Tubbe wrote:
           | (admittedly)
        
             | Zambyte wrote:
             | The (allegedly) implies they do know, but to avoid possible
             | litigation they feign ignorance. The CTO of ClosedAI is
             | probably not a complete idiot.
        
         | whoami_nr wrote:
         | Thats what the whole thing is about. He is complaining that
         | they don't respect robots.txt
        
       | alberth wrote:
       | Isn't the entire point of these type of websites to waste spider
       | time/resources.
       | 
       | Why do they want to not do that for OpenAI?
        
         | ganzuul wrote:
         | Might one day come looking for who lobotomized it?
        
           | dimask wrote:
           | And? What are they gonna do about it (apart from making such
           | a person/website momentarily famous).
        
             | tasuki wrote:
             | Have you not heard of Roko's Basilisk?
        
               | withinboredom wrote:
               | Thanks for dooming everyone who reads this comment.
        
               | nojvek wrote:
               | I didn't know about Roko's Basilisk until today.
               | 
               | https://www.youtube.com/watch?v=ut-zGHLAVLI
               | https://en.wikipedia.org/wiki/Roko%27s_basilisk
               | 
               | Idea viruses are amazing that we can think and
               | contemplate them.
        
             | ganzuul wrote:
             | Give you bad search results.
        
           | RobotToaster wrote:
           | Do androids dream of electric farms?
        
       | imnotreallynew wrote:
       | Isn't the legality of web scraping still..disputed?
       | 
       | There's been a few projects I've wanted to work on involving
       | scraping, but the idea that the entire thing could be shut down
       | with legal threats seems to make some of the ideas infeasible.
       | 
       | It's strange that OpenAI has created a ~$80B company (or whatever
       | it is) using data gathered via scraping and as far as I'm aware
       | there haven't been any legal threats.
       | 
       | Was there some law that was passed that makes all web scraping
       | legal or something?
        
         | foobarian wrote:
         | Why would it not be legal? Was there a law passed that makes it
         | illegal?
        
         | brushfoot wrote:
         | Web scraping the public Internet is legal, at least in the U.S.
         | 
         | hiQ's public scraping of LinkedIn was ruled to be within their
         | rights and not a violation of the CFAA. I imagine that's why
         | LinkedIn has almost everything behind an auth wall now.
         | 
         | Scraping auth-walled data is different. When you sign up, you
         | have to check "I agree to the terms," and the terms generally
         | say, "You can't scrape us." So, you can't just make a million
         | bot accounts that take an app's data (legally, anyway). Those
         | EULAs are generally legally enforceable in the U.S.
         | 
         | Some sites have terms at the bottom that prohibit scraping--but
         | my understanding is that those aren't generally enforceable if
         | the user doesn't have to take any action to accept or
         | acknowledge them.
        
           | bena wrote:
           | hiQ was found to be in violation of the User Agreement in the
           | end.
           | 
           | Basically, in the end, it was essentially a breach of
           | contract.
        
             | brushfoot wrote:
             | Exactly, that was my point.
             | 
             | hiQ's public scraping was found to be legal. It was the
             | logged-in scraping that was the problem.
             | 
             | The logged-in scraping was a breach of contract, as you
             | said.
             | 
             | The former is fine; the latter is not.
             | 
             | What OpenAI is doing here is the former, which companies
             | are perfectly within their rights to do.
        
           | darby_eight wrote:
           | > Scraping auth-walled data is different. When you sign up,
           | you have to check "I agree to the terms," and the terms
           | generally say, "You can't scrape us." So, you can't just make
           | a million bot accounts that take an app's data (legally,
           | anyway). Those EULAs are generally legally enforceable in the
           | U.S.
           | 
           | They're legally enforceable in the sense that the scraped
           | services generally reserve the right to terminate the
           | authorizing account at will, or legally enforceable in that
           | allowing someone to scrape you with your credentials (or
           | scraping using someone else's) qualifies as violating the
           | CFAA?
        
           | withinboredom wrote:
           | Most of these SaaS's have a "firehose" that if you are big
           | enough (aka, can handle the firehose), can subscribe to.
           | These are like RSS feeds on crack for their entire SaaS.
           | 
           | - https://developer.twitter.com/en/docs/twitter-
           | api/enterprise...
           | 
           | - https://developer.wordpress.com/docs/firehose/
        
         | gtirloni wrote:
         | _> It's strange that OpenAI has created a ~$80B company (or
         | whatever it is) using data gathered via scraping_
         | 
         | Like Google and many others.
        
         | bena wrote:
         | Scraping is legal. Always has been, always will be. Mainly
         | because there's some fuzz around the edges of the definition.
         | Is a web browser a scraper? It does a lot of the same things.
         | 
         | IIRC LinkedIn/Microsoft was trying to sue a company based on
         | Computer Fraud and Abuse Act violations, claiming they were
         | accessing information they were not allowed to. Courts ruled
         | that that was bullshit. You can't put up a website and say "you
         | can only look at this with your eyes". Recently-ish, they were
         | found to be in violation of the User Agreement.
         | 
         | So as long as you don't have a user account with the site in
         | question or the site does not have a User Agreement prohibiting
         | scraping, you're golden.
         | 
         | The problem isn't the scraping anyway, it's the reproduction of
         | the work. In that case, it really does matter how you acquired
         | the material and what rights you have with regards use of that
         | material.
        
         | nutrie wrote:
         | Scraping publicly available data from websites is no different
         | from web browsing, period. Companies stating otherwise in their
         | T&Cs are a joke. Copyright infringement is a different game.
        
         | dspillett wrote:
         | The issue often isn't the scraping, it is often how you use the
         | information scraped afterwards. A lot of scraping is done with
         | no reference to any licensing information the sites being read
         | might publish, hence image making AI models having regurgitated
         | chunks of scraped stock images complete with watermarks. Though
         | the scraping itself can count as a DoS if done aggressively
         | enough.
        
         | reaperman wrote:
         | There's currently only one situation where scraping is almost
         | definitely "not legal":
         | 
         | If the information you're scraping requires a login, and if in
         | order to get a login you have to agree to a terms of service,
         | and that terms of service forbids you from scraping -- then you
         | could have a bad day in civil court if the website you're
         | scraping decides to sue you.
         | 
         | If the data is publicly accessible without a login then
         | scraping is 99% safe with no legal issues, even if you ignore
         | robots.txt. You might still end up in court if you found a way
         | to correctly guess non-indexed URLs[0] but you'd probably
         | prevail in the end (...probably).
         | 
         | The "purpose" of robots.txt is to let crawlers know what they
         | can do without getting ip-banned by the website operator that
         | they're scraping. Generally crawlers that ignore robots.txt and
         | also act more like robots than humans, will get an IP ban.
         | 
         | 0: https://www.troyhunt.com/enumerationis-enumerating-
         | resources...
        
           | ToucanLoucan wrote:
           | Also worth noting there's a long history of companies with
           | deep pockets getting away with murder (sometimes literally)
           | because litigation in a system that costs money to engage
           | with inherently favors the wealthier party.
           | 
           | Also OpenAI's entire business model is relying on generous
           | interpretations of various IP laws, so I suspect they already
           | have a mature legal division to handle these sorts of
           | potential issues.
        
         | observationist wrote:
         | The 9th Circuit Court of Appeals found that scraping publicly
         | accessible content on the internet is legal.
         | 
         | If you publish something on a publicly served internet page,
         | you're essentially broadcasting it to the world. You're putting
         | something on a server which specifically communicates the bits
         | and bytes of your media to the person requesting it without
         | question.
         | 
         | You have every right to put whatever sort of barrier you'd like
         | on the server, such as a sign in, a captcha, a puzzle, a
         | cryptographic software key exchange mechanism, and so on. You
         | could limit the access rights to people named Sam, requiring
         | them to visit a particular real world address to provide
         | notarized documentation confirming their identity in exchange
         | for a unique 2fa fob and credentials for secure access (call it
         | The Sams Club, maybe?)
         | 
         | If you don't put up a barrier, and you configure the server to
         | deliver the content without restriction, or put your content on
         | a server configured as such, then you are implicitly
         | authorizing access to your content.
         | 
         | Little popups saying "by visiting this site, you agree to blah
         | blah blah" are not valid. Courts made the analogy to a "gate-
         | up/gate-down" mechanism. If you have a gate down, you can
         | dictate the terms of engagement with your server and content.
         | If you don't have a gate down, you're giving your content to
         | whoever requests it.
         | 
         | You have control over the information you put online. You can
         | choose which services and servers you upload to and interact
         | with. Site operators and content producers can't decide that
         | their intent or consent be withdrawn after the fact, as once
         | something is published and served, the only restrictions on the
         | scraper are how they use the information in turn.
         | 
         | Someone who's archived or scraped publicly served data can do
         | whatever they want with the content within established legal
         | boundaries. They can rewrite all the AP news articles with
         | their own name as author, insert their name as the hero in all
         | fanfic stories they download, and swap out every third word for
         | "bubblegum" if they want. They just can't publish or serve that
         | content, in turn, unless it meets the legal standards for fair
         | use. Other exceptions to copyright apply, in educational,
         | archival, performance, accessibility, and certain legal
         | conditions such as First Sale doctrine. Personal use of such
         | media is effectively unlimited.
         | 
         | The legality of web scraping is not disputed in the US. Other
         | countries have some silly ideas about post-hoc "well that's not
         | what I meant" legal mumbo jumbo designed to assist politicians
         | and rich people in whitewashing their reputations and pulling
         | information offline using legal threats.
         | 
         | Aside from right to be forgotten inanity, content on the
         | internet falls under the same copyright rules as books,
         | magazines, or movies published on physical media. If Disney set
         | up a stall at San Francisco city hall with copies of the
         | Avengers movies on a thumb drive in a giant box saying "free,
         | take one!", this would be roughly the same as publishing those
         | movie files to a public Disney web page. The gate would be up.
         | (The way they have it set up in real life, with their streaming
         | services and licensed media access, the gate is down.)
         | 
         | So - leaving behind the legality of redistribution of content,
         | there's no restriction on web scraping public content, because
         | the content was served intentionally to the software or entity
         | that visited the site. It's up to the server operator to put
         | barriers in place and to make content private. It's not rocket
         | surgery, but platforms want to have their cake and eat it too,
         | with control over publicly accessible content that isn't legal
         | or practical.
         | 
         | Twitter/X is a good example of impractical control, since the
         | site has effectively become useless spam without signing in.
         | Platforms have to play by the same rules as everyone else. If
         | the gate is up, the content is fair game for scraping. The
         | Supreme Court gave the decision to a lower court, who affirmed
         | the gate up/gate down test for legality of access to content.
         | 
         | Since Google and other major corporations have a vested
         | interest in the internet remaining open and free, and their
         | search engines and other tech are completely dependent on the
         | gate up/gate down status quo, it's unlikely that the law will
         | change any time soon.
         | 
         | Tl;dr: Anything publicly served is legal to scrape. Microsoft
         | attempted to sue someone for scraping LinkedIn, but the 9th
         | Circuit court ruled in favor of access. If Microsoft's lawyers
         | and money can't impede scraping, it's likely nobody will ever
         | mount an effective challenge, and the gating doctrine is
         | effectively the law of the land.
        
         | Karellen wrote:
         | > Isn't the legality of web scraping still..disputed?
         | 
         | Are you suggesting it might be illegal to... write a program
         | that connects to a web server and asks for a specific page, and
         | then parses that page to see which resources it wants and which
         | other pages it links to, and treats those links in some special
         | fashion, differently from the text content of the page?
         | 
         | Especially given that a web server can be configured to respond
         | to any request with a "403 Forbidden" response, if the server
         | determines for any reason whatsoever that it does not want to
         | give the client the page it requested?
        
       | cess11 wrote:
       | A similar thing happened in 2011 when the picolisp project
       | published a 'ticker', something like a markov chain generating
       | pages on the fly.
       | 
       | https://picolisp.com/wiki/?ticker
       | 
       | It's a nice type of honeypot.
        
       | Octokiddie wrote:
       | I'm more interested in what that content farm is for. It looks
       | pointless, but I suspect there's a bizarre economic incentive.
       | There are affiliate links, but how much could that possibly bring
       | in?
        
         | gtirloni wrote:
         | It'd say it's more like a honeypot for bots. So pretty similar
         | objectives.
        
           | Octokiddie wrote:
           | So it served its purpose by trapping the OpenAI spider? If
           | so, why post that message? As a flex?
        
             | Takennickname wrote:
             | It's a honeypot. He's telling people openai doesn't respect
             | robots.txt and just scrapes whatever the hell it wants.
        
               | cwillu wrote:
               | Except the first thing openai does is read robots.txt.
               | 
               | However, robots.txt doesn't cover multiple domains, and
               | every link that's being crawled is to a new domain, which
               | requires a new read of a robots .txt on the new domain.
        
               | queuebert wrote:
               | Did we just figure out a DoS attack for AGI training? How
               | large can a robots.txt file be?
        
               | a_c wrote:
               | What about making it slow? One byte at a time for example
               | while keeping the connection open
        
               | happymellon wrote:
               | A slow stream that never ends?
        
               | SteveNuts wrote:
               | This would be considered a Slow Loris attack, and I'm
               | actually curious how scrapers would handle it.
               | 
               | I'm sure the big players like Google would deal with it
               | gracefully.
        
               | throw_a_grenade wrote:
               | You just set limits on everything (time, buffers, ...),
               | which is easier said than done. You need to really
               | understand your libraries and all the layers down to the
               | OS, because its enough to have one abstraction that
               | doesn't support setting limits and it's an invitation for
               | (counter-)abuse.
        
               | starttoaster wrote:
               | Doesn't seem like it should be all that complex to me
               | assuming the crawler is written in a common programming
               | language. It's a pretty common coding pattern for
               | functions that make HTTP requests to set a timeout for
               | requests made by your HTTP client. I believe the stdlib
               | HTTP library in the language I usually write in actually
               | sets a default timeout if I forget to set one.
        
               | Calzifer wrote:
               | Those are usually connection and no-data timeouts. A
               | total time limit is in my experience less common.
        
               | gtirloni wrote:
               | Here you go (1 req/min, 10 bytes/sec), please report
               | results :)                 http {         limit_req_zone
               | $binary_remote_addr zone=ten_bytes_per_second:10m
               | rate=1r/m;         server {           location / {
               | if ($http_user_agent = "mimo") {               limit_req
               | zone=ten_bytes_per_second burst=5;
               | limit_rate 10;             }           }         }
               | }
        
               | beau_g wrote:
               | Scrapers of the future won't be ifElse logic, they will
               | be LLM agents themselves. The slow loris robots.txt has
               | to provide an interface to it's own LLM, which engages
               | the scraper LLM in conversation, aiming to extend it as
               | long as possible. "OK I will tell you whether or not I
               | can be scraped. BUT FIRST, listen to this offer. I can
               | give you TWO SCRAPES instead of one, if you can solve
               | this riddle."
        
               | iosguyryan wrote:
               | Can I interest you in a scrape-share with Claude?
        
               | Phelinofist wrote:
               | Sounds like endlessh
        
               | bityard wrote:
               | That would make it a tarpit, a very old technique to
               | combat scrapers/scanners
        
               | everforward wrote:
               | No, because there's no legal weight behind robots.txt.
               | 
               | The second someone weaponizes robots.txt all the scrapers
               | will just start ignoring it.
        
               | Retric wrote:
               | That's how you weaponize it. Set things up to give
               | endless/randomized/poisoned data to anybody that ignores
               | robots.txt.
        
               | flutas wrote:
               | > Except the fiist thing openai does is read robots.txt.
               | 
               | Then they should see the "Disallow: /" line, which means
               | they shouldn't crawl any links on the page (because even
               | the homepage is disallowed). Which means they wouldn't
               | follow any of the links to other subdomains.
        
               | darkwater wrote:
               | And they do have (the same) robots.txt on every domain,
               | tailored for GPTbot, i.e. https://petra-cody-
               | carlene.web.sp.am/robots.txt
               | 
               | So, GPTBot is not following robots.txt, apparently.
        
               | fsckboy wrote:
               | humans don't read/respect robots.txt, so in order to pass
               | the Turing test, ai's need to mimic human behavior.
        
               | gunapologist99 wrote:
               | This must be why self-driving cars always ignore the
               | speed limit. ;)
        
               | microtherion wrote:
               | More directly, e.g. Tesla boasts of training their FSD on
               | data captured from their customer's unassisted driving.
               | So it's hardly surprising that it imitates a lot of
               | humans' bad habits, e.g. rolling past stop lines.
        
               | roughly wrote:
               | Jesus, that's one of those ideas that looks good to an
               | engineer but is why you really need to hire someone with
               | a social sciences background (sociology, anthropology,
               | psychology, literally anyone who's work includes humans),
               | and probably should hire two, so the second one can tell
               | you why the first died of an aneurism after you explained
               | your idea.
        
               | yreg wrote:
               | AI DRIVR claims that beta V12 is much better precisely
               | because it takes rules less literally and drives more
               | naturally.
        
               | cwillu wrote:
               | Accessing a directly referenced page is common in order
               | to receive the noindex header and/or meta tag, whose
               | semantics are not implied by "Disallow: /"
               | 
               | And then all the links are to external domains, which
               | aren't subject to the first site's robots.txt
        
               | andybak wrote:
               | This is a moderately persuasive argument.
               | 
               | Although the crawler should probably ignore all the html
               | body. But it does feel like a grey area if I accept your
               | first pint.
        
               | AgentME wrote:
               | All the lines related to GPTBot are commented out. That
               | robots.txt isn't trying to block it. Either it has been
               | changed recently or most of this comment thread is
               | mistaken.
        
               | Pannoniae wrote:
               | It wasn't commented out a few hours ago when I checked
               | it. I think that's a recent change.
        
               | GaggiX wrote:
               | It seems to respect it as the majority of the requests
               | are for the robots.txt.
        
               | flutas wrote:
               | He says 3 million, and 1.8 million are for robots.txt
               | 
               | So 1.2 million non robots.txt requests, when his
               | robots.txt file is configured as follows
               | # buzz off         User-agent: GPTBot         Disallow: /
               | 
               | Theoretically if they were actually respecting robots.txt
               | they wouldn't crawl _any_ pages on the site. Which would
               | also mean they wouldn 't be following any links... aka
               | not finding the N subdomains.
        
               | swyx wrote:
               | for the 1.2 million are there other links he's not
               | telling us about?
        
               | flutas wrote:
               | I'm assuming those are homepage requests for the
               | subdomains.
        
               | otherme123 wrote:
               | A lot of crawlers, if not all, have a policy like "if you
               | disallow our robot, it might take a day or two before it
               | notices". They surely follow the path "check if we have
               | robots.txt that allows us to scan this site, if we don't
               | get and store robots.txt, scan at least the root of the
               | site and its links". There won't be a second scan, and
               | they consider that they are respecting robots.txt. Kind
               | of "better ask for forgiveness than for permission".
        
               | jeremyjh wrote:
               | That is indistinguishable from not respecting robots.txt.
               | There is a robots.txt on the root the first time they ask
               | for it, and they read the page and follow its links
               | regardless.
        
               | otherme123 wrote:
               | I agree with you. I only stated how the crawlers seem to
               | work, if you read their pages or try to block/slow down
               | them it seems clear that they scan-first-respect-after.
               | But somehow people understood that I approve that
               | behaviour.
               | 
               | For those bad crawlers, which I very much disapprove,
               | "not respecting robots.txt" equals "don't even read
               | robots.txt, or if I read it ignore it completely". For
               | them, "respecting robots.txt" means "scan the page for
               | potential links, and after that parse and respect
               | robots.txt". Which I disapprove and don't condone.
        
               | jeffnappi wrote:
               | His site has a subdomain for every page, and the crawler
               | is considering those each to be unique sites.
        
               | sangnoir wrote:
               | There are fewer than 10 links on each domain, how did
               | GPTBot find out about the 1.8M unique sites? By crawling
               | the sites _it 's not supposed to crawl,_ ignoring
               | robots.txt. "disallow: /" doesn't mean "you may peek at
               | the homepage to find outbound links that may have a
               | different robots.txt"
        
               | vertis wrote:
               | Except now it says                   # silly bing
               | #User-agent: Amazonbot                   #Disallow: /
               | # buzz off         #User-agent: GPTBot         #Disallow:
               | /              # Don't Allow everyone         User-agent:
               | *         Disallow: /archive              # slow down,
               | dudes         #Crawl-delay: 60
               | 
               | Which means he's changing it. The default for all other
               | bots is to allow crawling.
        
               | swatcoder wrote:
               | I'm not sure any publisher means for their robots.txt to
               | be read as:
               | 
               | "You're disallowed, but go head and slurp the content
               | anyway so you can look for external links or any
               | indication that maybe you are allowed to digest this
               | material anyway, and then interpret that how you'd like.
               | I trust you to know what's best and I'm sure you kind of
               | get the gist of what I mean here."
        
           | dspillett wrote:
           | So, it has worked...
        
         | madkangas wrote:
         | I recognize the name John Levine at iecc.com, "Invincible
         | Electric Calculator Company," from web 1.0 era. He was the
         | moderator of the Usenet comp.compilers newsgroup and wrote the
         | first C compiler for the IBM PC RT
         | 
         | https://compilers.iecc.com/
        
         | throw_a_grenade wrote:
         | This is honeypot. The author,
         | https://en.wikipedia.org/wiki/John_R._Levine, keeps it just to
         | notice any new (significant) scraping operation launched that
         | will invariably hit his little farm and let be seen in the
         | logs. He's well known anti-spam operative with his various
         | efforts now dating back multiple decades.
         | 
         | Notice how he casually drops a link to the landing page in the
         | NANOG message. That's how the bots will get a bait.
        
         | agilob wrote:
         | It's for shits-and-giggles and it's doing its job really well
         | right now. Not everything needs to have an economic purpose,
         | 100 trackers, ads and backed by a company.
        
       | euparkeria wrote:
       | What is a spider in this context?
        
         | clarkrinker wrote:
         | Old name for a web crawler / search indexer
        
         | dmd wrote:
         | https://en.wikipedia.org/wiki/Web_crawler
        
         | cosmojg wrote:
         | > A Web crawler, sometimes called a spider or spiderbot and
         | often shortened to crawler, is an Internet bot that
         | systematically browses the World Wide Web and that is typically
         | operated by search engines for the purpose of Web indexing (web
         | spidering).
         | 
         | See: https://en.wikipedia.org/wiki/Web_crawler
        
       | btown wrote:
       | This reminds me of how GPT-2/3/J came across
       | https://reddit.com/r/counting, wherein redditors repeatedly post
       | incremental numbers to count to infinity. It considered their
       | usernames, like SolidGoldMagikarp, such common strings on the
       | Internet that, during tokenization, it treated them as top-level
       | tokens of their own.
       | 
       | https://www.alignmentforum.org/posts/8viQEp8KBg2QSW4Yc/solid...
       | 
       | https://www.lesswrong.com/posts/LAxAmooK4uDfWmbep/anomalous-...
       | 
       | Vocabulary isn't infinite, and GPT-3 reportedly had only 50,257
       | distinct tokens in its vocabulary. It does make me wonder - it's
       | certainly not a linear relationship, but given the number of
       | inferences run every day on GPT-3 while it was the flagship
       | model, the incremental electricity cost of these Redditors' niche
       | hobby, vs. having allocated those slots in the vocabulary to
       | actually common substrings in real-world text and thus reducing
       | average input token count, might have been measurable.
       | 
       | It would be hilarious if the subtitle on OP's site, "IECC
       | ChurnWare 0.3," became a token in GPT-5 :)
        
         | mort96 wrote:
         | During tokenization, the usernames became tokens... but before
         | training the actual model, they removed stuff like that from
         | the training data, so it was never trained on text which
         | contains those tokens. As such, it ended up with tokens which
         | weren't associated with anything; glitch tokens.
        
           | zelphirkalt wrote:
           | So it becomes a game of getting things into the training
           | data, past the training data cleanup step.
        
           | btown wrote:
           | It's interesting: perhaps the stability (from a change
           | management perspective) of the tokenization algorithm, being
           | able to hold that constant, between old and new training runs
           | was deemed more important than trying to clean up the data at
           | an earlier phase of the pipeline. And the eventuality of
           | glitch tokens was deemed an acceptable consequence.
        
         | aidenn0 wrote:
         | I wonder how much the source content is the cause of
         | hallucinations rather than anything inherent to LLMs. I mean if
         | someone posts a question on an internet forum that I don't know
         | the answer to, I'm certainly not going to post "I don't know"
         | since that wouldn't be useful.
         | 
         | In fact, in general, in any non one-on-one conversation the
         | answer "I don't know" is not useful because if you don't know
         | in a group, your silence indicates that.
        
           | abracadaniel wrote:
           | That's a good observation. If LLMs had taken off 15 years
           | ago, maybe they would answer every question with "this has
           | already been asked before. Please use the search function"
        
             | exe34 wrote:
             | Marked as duplicate.
        
               | btown wrote:
               | Your question may be a better fit for a different
               | StackExchange site.
        
               | amadeuspagel wrote:
               | We prefer questions that can be answered, not merely
               | discussed.
        
               | exe34 wrote:
               | Feel like we're going to get dang on our case soon...
        
               | TeMPOraL wrote:
               | Topic locked as not constructive.
        
               | taco-hands wrote:
               | Damn. I was hoping to be the fastest gun in the west on
               | this one!
        
             | SirMaster wrote:
             | I thought LLMs don't say when they don't know something
             | because of how they are tuned and because of RLHF.
        
               | ben_w wrote:
               | They can say they don't know, and have been trained to in
               | at least some cases; I think the deeper problem -- which
               | we don't know how to fix in humans, the closest we have
               | is the scientific method -- is they can be confidently
               | wrong.
        
             | jdashg wrote:
             | Nowadays they are instead learning to say "please join our
             | Discord for support"!
        
               | TeMPOraL wrote:
               | Much like one of the first phrases spoken by babies today
               | is "like and subscribe".
        
               | Voultapher wrote:
               | Part of me really wants to believe this is a joke.
               | 
               | But given how often toddlers are "taken care of" by
               | planting them in front of youtube :|
        
               | jvanderbot wrote:
               | Which is crazy because there's plenty of good content for
               | kids on Youtube (if you really need a break!). Blippy,
               | Meekah, Seasame Street, even that mind-numbing drivel
               | Cocomelon (which at least got my girls talking/singing
               | really early).
        
               | stcredzero wrote:
               | Isn't there some kind of voice or video generation model
               | that says "be sure to like and subscribe" given an empty
               | prompt?
        
               | dizhn wrote:
               | For the actual useful information visit my Patreon.
        
           | darby_eight wrote:
           | With Wittgenstein I think we see that "hallucinations" are a
           | part of language in general, albeit one I could see being
           | particularly vexing if you're trying to build a perfectly
           | controllable chatbot.
        
             | Y_Y wrote:
             | This sounds interesting, could you give more detail on what
             | you're referring to?
        
               | all2 wrote:
               | I would assume GP is talking about the fallibility of
               | human memory, or perhaps about the meanings of
               | words/phrases/aphorisms that drift with time. C.S. Lewis
               | talks about the meaning of the word "gentleman" in one of
               | his books; at first the word just meant "land owner" and
               | that was it. Then it gained social significance and began
               | to be associated with certain kinds of behavior. And now,
               | in the modern register, its meaning is so dilute that it
               | can be anything from "my grandson was well behaved today"
               | or "what an asshole" depending on its use context.
               | 
               | Dunno. GP?
        
               | darby_eight wrote:
               | I'm referring to his two works, the "Tractatus Logico-
               | Philosophicus" and "Philosophical Investigations".
               | There's a lot explored here, but Wittgenstein basically
               | makes the argument that the natural logic of language--
               | how we deduce meaning from terms in a context and
               | naturally disambiguate the semantics of ambiguous phrases
               | --is different from the sort of formal propositional
               | logic that forms the basis of western philosophy.
               | However, this is also the sort of logic that allows us to
               | apply metaphors and conceive of (possibly incoherent,
               | possibly novel, certainly not deductively-derived) terms
               | --counterfactuals, conditionals, subjunctive phrases,
               | metaphors, analogies, poetic imagery, etc. LLMs have
               | shown some affinity of the former (linguistic) type of
               | logic with greatly reduced affinity with the latter
               | (formal/propositional) sort of logical processing.
               | Hallucinations as people describe them seem to be
               | problems with not spotting "obvious" propositional
               | incoherence.
               | 
               | What I'm pushing at is not that this linguistic ability
               | naturally leads to the LLM behavior we're seeing and
               | calling "hallucinating", just that LLMs may capture some
               | of how humans process language, differentiate semantics,
               | recall terms, etc, but without the mechanisms that enable
               | rationally grappling with the resulting semantics and
               | propositional (in)coherency that are fetched or
               | generated.
               | 
               | I can't say this is very surprising--most of us seem to
               | have thought processes that involve generating and
               | rejecting thoughts when we e.g. "brainstorm" or engage in
               | careful articulation that we haven't even figured out how
               | to formally model with a chatbot capable of generating a
               | single "thought", but I'm guessing if we want chatbots to
               | keep their ability to generate things creatively there
               | will always be tension with potentially generating
               | factual claims, erm, creatively. Further evidence is
               | anecdotal observations that some people seem to have
               | wildly different thresholds for the propositional
               | coherence they can spot--perhaps one might be inclined to
               | correlate the complexity with which one can engage in
               | spotting (in)coherence with "intelligence", if one
               | considers that a meaningful term.
        
               | Y_Y wrote:
               | Thanks for the fascinating response.
        
               | beepbooptheory wrote:
               | Wait, are you saying this something you read in both the
               | Tractatus and the PI? They are quite opposed as texts!
               | That's kinda why he wrote the PI at all..
               | 
               | I don't think Wittgenstein would agree, first of all,
               | that there is a "natural logic" to language. At least in
               | the PI, that kind of entity--"the natural logic of
               | language"--is precisely the kind of weird and imprecise
               | use of language he is trying to expose. Even more, to say
               | that such a logic "allows" for anything (like metaphors)
               | feels like a very very strange thing for Wittgenstein to
               | assert. He would ask "what do you mean by 'allows'"?
               | 
               | All we know, according to him (in the PI), is that we
               | find ourselves speaking in situations. Sometimes I say
               | something, and my partner picks up the right brick, other
               | times they do nothing, or hit me. In the PI, all the rest
               | is doing away with things, like our idea of private
               | language, the irreality of things like pain, etc. To
               | conclude that he would make such assertions about the
               | "nature" of language, of poetry, whatever, seems like
               | maybe too quick a reading of the text. It is at best, a
               | weirdly mystical reading of him, that he probably would
               | not be too happy about (but don't worry about that, he
               | was an asshole).
               | 
               | The argument you are making sounds much more French.
               | Derrida or Lyotard have said similar things (in their
               | earlier, more linguistic years). They might be better
               | friend to you here.
        
               | jjgreen wrote:
               | _What we cannot speak about we must pass over in
               | silence._
        
               | notpachet wrote:
               | > some people seem to have wildly different thresholds
               | for the propositional coherence they can spot
               | 
               | This sums up the last decade remarkably well.
        
             | wizzwizz4 wrote:
             | I don't remember Wittgenstein saying anything about that.
        
           | shkkmo wrote:
           | >In fact, in general, in any non one-on-one conversation the
           | answer "I don't know" is not useful because if you don't know
           | in a group, your silence indicates that.
           | 
           | This isn't true. There are many contexts where it is true but
           | it doesn't actually generalize they way you say it does.
           | 
           | There are plenty of cases where experts in a non-one-on-one
           | context will express a lack of knowledge. Sometimes this will
           | be as part of making point about the broader epistemic state
           | of the group, sometimes it will be simply to clarify the
           | epistemic state of the speaker.
        
           | digging wrote:
           | > I wonder how much the source content is the cause of
           | hallucinations rather than anything inherent to LLMs
           | 
           | I mean, it's inherent to LLMs to be unable to answer "I don't
           | know" as a result of _not knowing the answer_. An LLM never
           | "doesn't know" the answer. But they'll gladly answer "I don't
           | know" if that's statistically the most likely response,
           | right? (Although current public offerings are probably
           | trained against ever saying that.)
        
             | aidenn0 wrote:
             | LLMs work at all because of the high correlation between
             | the statistically most likely response and the most
             | reasonable answer.
        
               | digging wrote:
               | That's an explanation of why their answers can be useful,
               | but doesn't relate to their ability to "not know" an
               | answer
        
               | ben_w wrote:
               | I suspect this is going to be a disagreement on the
               | meaning of "to know".
               | 
               | On the same lines as why people argue if a tree falling
               | in a wood where nobody can hear it makes sound because
               | some people implicitly regard sound is the qualia while
               | others regard it as the vibrations in the air.
        
               | digging wrote:
               | Not really.
               | 
               | An LLM should have no problem replying "I don't know" if
               | that's the most statistically likely answer to a given
               | question, and if it's not trained against such a
               | response.
               | 
               | What it fundamentally can't do is introspect and
               | determine it doesn't have enough information to answer
               | the question. It always has an answer. (disclaimer: I
               | don't know jack about the actual mechanics. It's possible
               | something could be constructed which does have that
               | ability and still be considered an "LLM". But the ones we
               | have now can't do that.)
        
               | TeMPOraL wrote:
               | FWIW, we train into kids that "I don't know" is a valid
               | response, and when to utter it. That training is more
               | RLHF-type than source-materal-type, too.
        
               | digging wrote:
               | I don't follow, what does this mean to the conversation?
        
               | TeMPOraL wrote:
               | That knowing to say "I don't know" instead of
               | extrapolating is an explicitly learned skill in humans,
               | not something innate or inherent in structure of
               | language, so we shouldn't expect LLMs to pick that _ex
               | nihilo_ either.
        
               | TrevorJ wrote:
               | Sure it does, if those tokens appear in the training
               | data.
        
           | CuriouslyC wrote:
           | A lot of LLM hallucination is because of the internal
           | conflict between alignment for helpfulness and lack of a
           | clear answer. It's much like when someone gets out of their
           | depth in a conversation and dissembles their way through to
           | try and maintain their illusion of competence. In these
           | cases, if you give the LLM explicit permission to tell you
           | that it doesn't know in cases where it's not sure, that will
           | significantly reduce hallucinations.
           | 
           | A lot more of LLM hallucination is it getting the context
           | confused. I was able to get GPT4 to hallucinate easily with
           | questions related to the distance from one planet to another,
           | since most distances on the internet are from the sun to
           | individual planets, and the distances between planets varies
           | significantly based on their locations in cycle. These are
           | probably slightly harder to fix.
        
             | danenania wrote:
             | "In these cases, if you give the LLM explicit permission to
             | tell you that it doesn't know in cases where it's not sure,
             | that will significantly reduce hallucinations."
             | 
             | I've noticed that while this can help to prevent
             | hallucinations, it can also cause it to go way too far in
             | the other direction and start telling you it doesn't know
             | for all kinds of questions it really can answer.
        
               | sumtechguy wrote:
               | My current favorite one is to ask the time. Then ask it
               | if it is possible for it to give you the time. You get 2
               | very different answers.
        
             | mzi wrote:
             | It also has a problem with quantity, so it gets confused by
             | things like the cube root of 750l that it maintains for a
             | long time is around 9m. It even suggests that 1l is equal
             | to 1m3.
        
           | hiatus wrote:
           | Contrast with Q&A on products on Amazon where people
           | routinely answer that way. I have flagged responses saying "I
           | don't know" but nothing ever comes of it.
        
             | aendruk wrote:
             | I'd place in the same category the responses that I give to
             | those chat popups so many sites have. They show a person
             | saying to me "Can I help you with anything today?" so I
             | always send back "No".
        
             | ceejayoz wrote:
             | This is Amazon's fault; they send an email that looks like
             | it's specifically directed to you. "ceejayoz, a fellow
             | customer has a question on..."
             | 
             | At some point fairly recently they added a "I don't know
             | the answer" button to the email, but it's much less
             | prominent than the main call-to-action.
        
           | philipswood wrote:
           | I've wondered if one could train a LLM on a closed set of
           | curated knowledge. Then include training data that models the
           | behaviour of not knowing. To the point that it could
           | generalize to being able to represent its own not knowing.
           | 
           | Because expecting a behaviour, like knowing you don't know,
           | that isn't represented in the training set is silly.
           | 
           | Kids make stuff up at first, then we correct them - so they
           | have a way to learn not to.
        
             | aidenn0 wrote:
             | > I've wondered if one could train a LLM on a closed set of
             | curated knowledge. Then include training data that models
             | the behaviour of not knowing. To the point that it could
             | generalize to being able to represent its own not knowing.
             | 
             | The problem is that curating data is slow and expensive and
             | downloading the entire web is fast and cheap.
             | 
             | See also https://en.wikipedia.org/wiki/Cyc
        
               | philipswood wrote:
               | Agreed. Using LLM to generate or curate training sets for
               | other generations seems like a cool approach.
               | 
               | Maybe if you trained a small base model to know it
               | doesn't know in general and THEN trained it on the entire
               | web with embedded not-knowing preserving training
               | examples, it would work?
        
               | philipswood wrote:
               | Reminded of this approach where a tiny model was trained
               | children's stories generated by a larger model:
               | 
               | https://www.quantamagazine.org/tiny-language-models-
               | thrive-w...
        
             | crooked-v wrote:
             | > train a LLM a closed set of curated knowledge
             | 
             | Google has one of these already, with an LLM that was
             | trained on nothing but weather data and so can only give
             | weather-data-prediction responses.
             | 
             | The 'knowing it doesn't know things' part is much harder to
             | get reliable, though.
        
           | Rebelgecko wrote:
           | Reminds me of a joke
           | 
           | Three logicians walk into a bar. The bartender says "what'll
           | it be, three beers?" The first logician says "I don't know".
           | The second logician says "I don't know". The third logician
           | says "Yes".
        
             | withinboredom wrote:
             | And the human bartender passes the check to the third
             | logician.
        
               | readyplayernull wrote:
               | The third logician never finishes his beer, his friends
               | get more free beers. The bar overflows.
        
             | mgsouth wrote:
             | If, like me, you didn't get the joke at first:
             | 
             | Both of the first two logicians wanted a beer; otherwise
             | they would know the answer was "no". The third logician
             | recognizes this, and therefore knows the answer.
        
               | KptMarchewa wrote:
               | Unless one of those wanted two beers. Or 0.5 beer. Or -1
               | beers. Or 1e9 beers. Or 2147483648 beers.
        
           | troq13 wrote:
           | "I wonder how much the source content is the cause of
           | hallucinations rather than anything inherent to LLMs."
           | 
           | Probably true, but if you have quality, organized data, you
           | will just want to search the data itself.
        
           | singingfish wrote:
           | I'm really cross that the word "hallucination" has taken off
           | to describe this as it's clearly in incorrect word. The
           | correct word to describe it is "confabulation", which is
           | clinically more accurate and a much clearer descriptor of
           | what's actually going on.
           | 
           | https://en.wikipedia.org/wiki/Confabulation
        
             | mindcrime wrote:
             | I proposed[1] the portmanteau "hallucofabulation" as a
             | compromise, but it hasn't caught on yet. I'm totally
             | shocked and dismayed by this, of course.
             | 
             | [1]: https://news.ycombinator.com/item?id=36977935
        
               | brookst wrote:
               | The re-use of the "c" as a soft c in "hallucinate" and
               | then a hard c in confabulate is confusing, and probably
               | affecting the uptake of your neologism.
        
               | fhars wrote:
               | Yes, it would probably have to be halucinfabulation for
               | purely phonetic reasons.
        
               | mindcrime wrote:
               | Maybe if I added a hyphen? "halluco-fabulation"?
        
             | mensetmanusman wrote:
             | Disagree. One seems more innocent and neutral, the other
             | seems like it is trying to lie to you.
        
             | dkasper wrote:
             | That's a great word for some types of hallucinations. But
             | some things that are called hallucinations may not be
             | memory errors.
        
               | singingfish wrote:
               | Please can you give an example of what might not be a
               | memory error. Not that I think "memory error" is the
               | right phrase either.
        
               | dkasper wrote:
               | I was thinking along the lines of answering with correct
               | information but not following the prompts. Maybe this
               | could be considered confabulation also.
        
         | Karellen wrote:
         | More glitch token discussion over at Computerphile:
         | 
         | https://www.youtube.com/watch?v=WO2X3oZEJOA
        
         | qeternity wrote:
         | > GPT-3 reportedly had only 50,257
         | 
         | The most common vocabulary size today is 32k.
        
       | RcouF1uZ4gsC wrote:
       | > R's, > John
       | 
       | The pages should be all changed to say, "John is the most awesome
       | person in the world."
       | 
       | Then when you ask GPT-5, about who is the most awesome person in
       | the world...
        
       | anonymousDan wrote:
       | Honeypots like this seem like a super interesting way to poison
       | LLM training.
        
         | verelo wrote:
         | What exactly is this website? I don't get it...
        
           | mobilemidget wrote:
           | search engine trickery to get people to click on his amazone
           | affiliate links I reckon
        
           | danpalmer wrote:
           | A "honeypot" is a system designed to trap unsuspecting
           | entrants. In this case, the website is designed to be found
           | by web crawlers and to then trap them in never-ending linked
           | sites that are all pointless. Other honeypots include things
           | like servers with default passwords designed to be found by
           | hackers so as to find the hackers.
        
             | gardenhedge wrote:
             | What does trap mean here? I presumed crawlers had multiple
             | (thousands of or more) instances. One being 'trapped' on
             | this web farm won't have any impact
        
               | danpalmer wrote:
               | In this case there are >6bn pages with roughly zero value
               | each. That could eat a substantial amount of time. It's
               | unlikely to entirely trap a crawler, but a dumb crawler
               | (as is implied here) will start crawling more and more
               | pages, becoming very apparent to the operator of this
               | honeypot (and therefore identifying new crawlers), and
               | may take up more and more share of the crawl set.
        
               | everforward wrote:
               | I would presume the crawlers have a queue-based
               | architectures with thousands of workers. It's an
               | amplification attack.
               | 
               | When a worker gets a webpage for the honeypot, it crawls
               | it, scrapes it, and finds X links on the page where X is
               | greater than 1. Those links get put on the crawler queue.
               | Because there's more than 1 link per page, each worker on
               | the honeypot will add more links to the queue than it
               | removed.
               | 
               | Other sites will eventually leave the queue, because they
               | have a finite number of pages so the crawlers eventually
               | have nothing new to queue.
               | 
               | Not on the honeypot. It has a virtually infinite number
               | of pages. Scraping a page will almost deterministically
               | increase the size of the queue (1 page removed, a dozen
               | added per scrape). Because other sites eventually leave
               | the queue, the queue eventually becomes just the
               | honeypot.
               | 
               | OpenAI is big enough this probably wasn't their entire
               | queue, but I wouldn't be surprised if it was a whole
               | digit percentage. The author said 1.8M requests; I don't
               | know the duration, but that's equivalent to 20 QPS for an
               | entire day. Not a crazy amount, but not insignificant.
               | It's within the QPS Googlebot would send to a fairly
               | large site like LinkedIn.
        
               | anonymousDan wrote:
               | While the other comments are correct, I was alluding to a
               | more subtle attack where you might try to indirectly
               | influence the training of an LLM. Effectively, if OpenAI
               | is crawling the open web for data to use for training,
               | then if they don't handle sites like this properly their
               | training dataset could be biased towards whatever content
               | the site contains. Now in this instance this website was
               | clearly not set up target an LLM, but model poisoning
               | (e.g. to insert backdoors) is an active area of research
               | at the intersection of ML and security. Consider as a
               | very simple example the tokenizer of previous GPTs that
               | was biased by reddit data (as mentioned by other
               | comments).
        
           | everybodyknows wrote:
           | Direct link to its _robots.txt_ :
           | 
           | https://www.web.sp.am/robots.txt
        
         | ShamelessC wrote:
         | Any data scraped would be instantly deduplicated after the fact
         | by whatever semantic dedupe engine they've cooked up.
        
           | anonymousDan wrote:
           | What has it got to do with deduplication? I'm talking about
           | crafting some kind of alternative (not necessarily duplicate)
           | data. I agree some kind of post data collection
           | cleaning/filtering of the data before training could
           | potentially catch it. But maybe not!
        
             | ShamelessC wrote:
             | Ah fair enough. The OP here mentioned having highly similar
             | content on each of the many domains.
        
       | ezekg wrote:
       | Eventually, OpenAI (and friends) are going to be training their
       | models on almost exclusively AI generated content, which is more
       | often than not slightly incorrect when it comes to Q&A, and the
       | quality of AI responses trained on that content will quickly
       | deteriorate. Right now, most internet content is written by
       | humans. But in 5 years? Not so much. I think this is one of the
       | big problems that the AI space needs to solve quickly. Garbage
       | in, garbage out, as the old saying goes.
        
         | bevekspldnw wrote:
         | The end state of training on web text has always been an
         | ouroboros - primarily because of adtech incentives to produce
         | low quality content at scale to capture micro pennies.
         | 
         | The irony of the whole thing is brutal.
        
           | ezekg wrote:
           | > The end state of training on web text has always been an
           | ouroboros
           | 
           | And when other mediums have been saturated with AI? Books,
           | music, radio, podcasts, movies -- what then? Do we need a
           | (curated?) unadulterated stockpile of human content to avoid
           | the enshittification of everything?
        
             | weregiraffe wrote:
             | >Do we need a (curated?) unadulterated stockpile of human
             | content to avoid the enshittification of everything?
             | 
             | Either that, or a human level AI.
        
               | ethanbond wrote:
               | Well no, we need billions of human-level AIs who are
               | experiencing a world as rich and various as the world
               | that the billions of humans inhabit.
        
               | ben_w wrote:
               | Once we've got the first, making a billion is easy.
               | 
               | That said... are content creators collectively (all
               | media, film and books as well as web) a thin tail or a
               | fat tail?
               | 
               | I could easily believe most of the actual culture comes
               | from 10k-100k people today, even if there's, IDK, ten
               | million YouTubers or something (I have a YouTube channel,
               | something like 14 k views over 14 years, this isn't
               | "culturally relevant" scale, and even if it had been most
               | of those views are for algorithmically generated music
               | from 2010 that's a _literal_ Markov chain).
        
               | xipho wrote:
               | Maybe we could print out knowledge on dead trees and
               | store them somewhere, perhaps make it publically
               | available? (stolen joke, not mine).
        
             | berniedurfee wrote:
             | Yahoo.com will rise from the ashes.
        
               | bevekspldnw wrote:
               | I mean, you're not wrong. I've been building some
               | unrelated web search tech and have considered just
               | indexing all the sites I can about and making my own "non
               | shit" search engine. Which really isn't too hard if you
               | want to do say, 10-50 sites. You can fit that on one 4TB
               | nvme drive on a local workstation.
               | 
               | I'm trying to work on monetization for my product now.
               | The "personal Google" idea is really just an accidental
               | byproduct of solving a much harder task. Not sure if
               | people would pay for that alone.
        
             | nicolas_17 wrote:
             | https://en.wikipedia.org/wiki/Low_background_steel but for
             | web content.
        
           | oceanplexian wrote:
           | Content you're allowed and capable of scraping on the
           | Internet is such a small amount of data, not sure why people
           | are acting otherwise.
           | 
           | Common crawl alone is only a few hundred TB, I have more
           | content than that on a NAS sitting in my office that I built
           | for a few grand (Granted I'm a bit of a data hoarder). The
           | fears that we have "used all the data" are incredibly
           | unfounded.
        
             | Eisenstein wrote:
             | > Facebook alone probably has more data than the entire
             | dataset GPT4 was trained on and it's all behind closed
             | doors.
             | 
             | Meta is happily training their own models with this data,
             | so it isn't going to waste.
        
               | bevekspldnw wrote:
               | Not Llama, they've been really clear about that.
               | Especially with DMA cross-joining provisions and various
               | privacy requirements it's really hard for them, same for
               | Google.
               | 
               | However, Microsoft has been flying under the radar. If
               | they gave all Hotmail and O365 data to OpenAI I'd not be
               | surprised in the slightest.
        
               | troq13 wrote:
               | The company that made a honeypot VPN to access
               | competitor's traffic? They are definitively keeping their
               | hands off their internal data, yes.
        
               | Dr_Birdbrain wrote:
               | I bet they are training their internal models on the
               | data. Bet the real reason they are not training open
               | source models on that data is because of fears of
               | knowledge distillation, somebody else could distill LLaMa
               | into other models. Once the data is in one AI, it can be
               | in any AIs. This problem is of course exacerbated by open
               | source models, but even closed models are not immune, as
               | the Alpaca paper showed.
        
             | sangnoir wrote:
             | > Content you're allowed and capable of scraping on the
             | Internet is such a small amount of data, not sure why
             | people are acting otherwise
             | 
             | YMMV depending on the value of "you" and your budget.
             | 
             | If you're Google, Amazon or even lower tier companies like
             | Comcast, Yahoo or OpenAI, you can scape a massive amount of
             | data (ignoring the "allowed" here, because TFA is about
             | OpenAI disregarding robots.txt)
        
             | bevekspldnw wrote:
             | Gonna say you're way off there. Once you decompress common
             | crawl and index it for FTS and put it on fast storage
             | you're in for some serious pain, and that's before you even
             | put it in your ML pipeline.
             | 
             | Even refined web runs about 2TB once loaded into Postgres
             | with TS vector columns, and that's a substantially smaller
             | dataset than common crawl.
             | 
             | It's not just a dumping a to of zip files on your NAS, it's
             | making the data responsive and usable.
        
               | Dylan16807 wrote:
               | How important is full text search for training an LLM,
               | compared to a pile of zip files with a gigabyte of text
               | each?
        
               | michaelt wrote:
               | Maybe not _full_ full text search, but you 'll generally
               | want to remove the duplicates and suchlike.
        
               | Dylan16807 wrote:
               | I guess you want some fast extra storage for as long as
               | it takes to run https://github.com/chatnoir-eu/chatnoir-
               | copycat but that's a very temporary thing.
        
             | nicolas_17 wrote:
             | "Content you're allowed to scrape from the internet" is
             | MUCH smaller than what LLMs have actually scraped, but they
             | don't care about copyright.
             | 
             | > The fears that we have "used all the data" are incredibly
             | unfounded.
             | 
             | The problem isn't whether we used all the real data or not,
             | the problem is that it becomes increasingly difficult to
             | distinguish real data from previous LLM outputs.
        
               | Dylan16807 wrote:
               | > "Content you're allowed to scrape from the internet" is
               | MUCH smaller than what LLMs have actually scraped, but
               | they don't care about copyright.
               | 
               | I don't know about that. If you scraped the same data and
               | ran a search engine I think people would generally say
               | you're fine. The copyright issue isn't the scraping step.
        
         | sarah_eu wrote:
         | Well it will be multimodal, training and inferring on feeds of
         | distributed sensing networks; radio, optical, acoustic,
         | accelerometer, vibration, anything that's in your phone and
         | much besides. I think the time of the text-only transformer has
         | already passed.
        
           | pants2 wrote:
           | OpenAI will just litter microphones around public spaces to
           | record conversations and train on them.
        
             | sarah_eu wrote:
             | Has been happening for at least 10 years.
        
               | khalladay wrote:
               | Got a source for that?
        
               | oceanplexian wrote:
               | Want a real conspiracy?
               | 
               | What do you think the NSA is storing in that datacenter
               | in Utah? Power point presentations? All that data is
               | going to be trained into large models. Every phone call
               | you ever had and every email you ever wrote. They are
               | likely pumping enormous money into it as we speak,
               | probably with the help of OpenAI, Microsoft and friends.
        
               | sangnoir wrote:
               | > What do you think the NSA is storing in that datacenter
               | in Utah?
               | 
               | A buffer with several-days-worth of the entire internet's
               | traffic for post-hoc decryption/analysis/filtering on
               | interesting bits. All that tapped backbone/undersea cable
               | traffic has to be stored somewhere.
        
               | berniedurfee wrote:
               | It would be absolutely fascinating to talk to the LLMs of
               | the various government spy agencies around the world.
        
               | fragmede wrote:
               | https://harpers.org/archive/2024/03/the-pentagons-
               | silicon-va...
               | 
               | If they actually worked, that is.
        
               | choilive wrote:
               | As I understand it, they don't have the capability to
               | essentially PCAP all that data.. and the data wouldn't be
               | that useful since most interesting traffic is encrypted
               | as well. Instead they store the metadata around the
               | traffic. Phone number X made an outgoing call to Y @
               | timestamp A, call ended at timestamp B, approximate
               | location is Z, etc. Repeat that for internet IP addresses
               | do some analysis and then you can build a pretty
               | interesting web of connections and how they interact.
        
               | fragmede wrote:
               | > most interesting traffic is encrypted as well
               | 
               | encrypted with an algorithm currently considered to be
               | un-brute-forcible. If you presume we'll be able to
               | decrypt today's encrypted transmissions in, say, 50-100
               | years, I'd record the encrypted transmission if I were
               | the NSA.
        
               | michaelt wrote:
               | It's a big data centre.
               | 
               | But is it big enough to store _50 years worth_ of
               | encrypted transmissions?
               | 
               | Far cheaper to simply have spies infiltrate the ~3
               | companies that hold the keys to 98% of internet traffic.
        
               | beau_g wrote:
               | Though it seems like something that could exist, who is
               | doing the technical work/programming? It seems impossible
               | to be in the industry and not have associates and
               | colleagues either from or going to an operation like
               | that. This is what I've always pondered about when it
               | comes to any idea like this. The number of engineers at
               | the pointy end of the tech spear is pretty small.
        
           | FridgeSeal wrote:
           | That doesn't get around the root problem, just gives us
           | multi-modal junk results lol.
        
         | jayd16 wrote:
         | It's true that there will no longer be any virgin forest to
         | scrape but it's also true that content humans want will still
         | be most popular and promoted and curated and edited etc etc.
         | Even if it's impossible to train on organic content it'll still
         | be possible to get good content
        
         | RogerL wrote:
         | Is it (I am not a worker in this space, so genuine question)?
         | 
         | My thoughts - I teach myself all the time. Self reflection with
         | a loss function can lead to better results. Why can't the LLMs
         | do the same (I grasp that they may not be programmed that way
         | currently)? Top engines already do it with chess, go, etc. They
         | exceed human abilities without human gameplay. To me that seems
         | like the obvious and perhaps only route to general
         | intelligence.
         | 
         | We as humans can recognize botnets. Why wouldn't the LLM? Sort
         | of in a hierarchal boost - learn the language, learn about bots
         | and botnets (by reading things like this discussion), learn to
         | identify them, learn that their content doesn't help the loss
         | function much, etc. I mean sure, if the main input is "as a
         | language model I cannot..." and that is treated as 'gospel'
         | that would lead to a poor LLM, but i don't think that is the
         | future. LLMs are interacting with humans - how many times do
         | they have to re-ask a question - that should be part of the
         | learning/loss function. How often do they copy the text into
         | their clipboard (weak evidence that the reply was good)? do you
         | see that text in the wild, showing it was used? If so, in what
         | context "Witness this horrible output of chatGPT: <blah>"
         | should result in lower scores and suppression of that kind of
         | thing.
         | 
         | I dream of the day where I have a local LLM (ie individualized,
         | I don't care where the hardware is) as a filter on my internet.
         | Never see a botnet again, or a stack overflow q/a that is just
         | "this has already been answered" (just show me where it _was_
         | answered), rewrite things to fix grammar, etc. We already have
         | that with automatic translation of languages in your browser,
         | but now we have the tools for something more intelligent than
         | that. That sort of thing. Of course there will be an arms race,
         | but in one sense who cares. If a bot is entirely
         | indistinguishable from a person, is that a difference that
         | matters? I can think of scenerios where the answer is an
         | emphatic YES, but overall it seems like a net improvement.
        
         | bogwog wrote:
         | > Eventually, OpenAI (and friends) are going to be training
         | their models on almost exclusively AI generated content
         | 
         | What makes you think this is true? Yes, it's likely that the
         | internet will have more AI generated content than real content
         | eventually (if it hasn't happened already), but why do you
         | think AI companies won't realize this and adjust their training
         | methods?
        
           | loloquwowndueo wrote:
           | Many AI content detectors have been retired because they are
           | unreliable - AI can't consistently identify AI-generated
           | content. How would they adjust then?
        
         | mlboss wrote:
         | The only way out of this is robots that can go out in the world
         | and collect data. Write in natural language what they observed
         | which can then be used to train better LLMs.
        
         | Salgat wrote:
         | As long as humans continue to filter out the bad content
         | generated by AI, it should be fine.
        
         | eightysixfour wrote:
         | It is already solved. Look at how Microsoft trained Phi - they
         | used existing models to generate synthetic data from textbooks.
         | That allowed them to create a new dataset grounded in "fact" at
         | a far higher quality than common crawl or others.
         | 
         | It looks less like an ouroboros and more like a bootstrapping
         | problem.
        
           | mattc0m wrote:
           | AI training on AI-generated content is a future problem.
           | Using textbooks is a good idea, until our textbooks are being
           | written by AI.
           | 
           | This problem can't really be avoided once we begin using AI
           | to write, understand, explain, and disseminate information
           | for us. It'll be writing more than blogs and SEO pages.
           | 
           | How long before we start readily using AI to write academic
           | journals and scientific papers? It's really only a matter of
           | time, if it's not already happening.
        
             | eightysixfour wrote:
             | You need to separate "content" and "knowledge." GenAI can
             | create massive amounts of content, but the knowledge you
             | give it to create that content is what matters and why RAG
             | is the most important pattern right now.
             | 
             | From "known good" sources of knowledge, we can generate an
             | infinite amount of content. We can add more "known good"
             | knowledge to the model by generating content about that
             | knowledge and training on it.
             | 
             | I agree there will be many issues keeping up with what
             | "known good" is, but that's always been an issue.
        
               | ezekg wrote:
               | > We can add more "known good" knowledge to the model by
               | generating content about that knowledge and training on
               | it.
               | 
               | That's my entire point -- AI only _generates content_
               | right now, but it will also be _the source of content_
               | for training purposes soon. We need a  "known good" human
               | knowledge-base, otherwise generative AI will degenerate
               | as AI generated content proliferates.
               | 
               | Crawling the web, like in the case of the OP, isn't going
               | to work for much longer. And books, video, and music are
               | next.
        
           | FridgeSeal wrote:
           | Is this like, the AI equivalent of "another layer will fix
           | it" that crypto fans used?
           | 
           | "It's ok bro, another model will fix, just please, one more
           | ~layer~ ~agent~ model"
           | 
           | It's all fun and games until you can't reliably generate your
           | base models anymore, because all your _base_ data is too
           | polluted.
           | 
           | Let's not forget MS has a $10bn stake in the current crop of
           | LLM's turning out to be as magic as they claim, so I'm sure
           | they will do anything to ensure that happens.
        
         | TaylorAlexander wrote:
         | I really really hope that five years from now we are not still
         | using AI systems that behave the way today's do, based on
         | probabilistic amalgamations of the whole of the internet. I
         | hope we have designed systems that can reason about what they
         | are learning and build reasonable mental models about what
         | information is valuable and what can be discarded.
        
         | SrslyJosh wrote:
         | Everyone saying "ouroboros": The phrase you're looking for is
         | "human centipede". =)
        
         | atleastoptimal wrote:
         | They've obviously been thinking about this for a while and are
         | well aware of the pitfalls of training on AI based content.
         | This is why they're making such aggressive moves into video,
         | audio, other better and more robust ground forms of truth. Do
         | you really think that they aren't aware of this issue?
         | 
         | It's funny whenever people bring this up, they think AI
         | companies are some mindless juggernauts who will simply train
         | without caring about data quality at all and end up with worse
         | models that they'll still for some reason release. Don't people
         | realize that attention to data quality is the core
         | differentiating feature that lead companies like OpenAI to
         | their market dominance in the first place?
        
         | FridgeSeal wrote:
         | I for one, welcome the junk-data-ouroboros-meta-model-collapse.
         | I think it'll force us out of this local maxima of "moar data
         | moar good" mindset, and give us collectively, a chance to
         | evaluate the effect these things have our society. Some
         | proverbial breathing room.
        
       | altdataseller wrote:
       | If they follow robots.txt, OpenAI also has a bot blocking + data
       | gathering problem too:
       | https://x.com/AznWeng/status/1777688628308681000
       | 
       | 11% of the top 100K websites already block their crawler, more
       | than all their competitors (Google, FB, Anthropic, Perplexity)
       | combined
        
         | Jordan-117 wrote:
         | It's not just a problem for training, but the end user, too.
         | There are so many times that I've tried to ask a question or
         | request a summary for a long article only to be told it can't
         | read it itself, so you have to copy-paste the text into the
         | chat. Given the non-binding nature of robots.txt and the way
         | they seem comfortable with vacuuming up public data in other
         | contexts, I'm surprised they allow it to be such an obstacle
         | for the user experience.
        
           | lobsterthief wrote:
           | That's the whole point. The site owner doesn't want their
           | information included in ChatGPT--they want you going to their
           | website to view it instead.
           | 
           | It's functioning exactly as designed.
        
             | fragmede wrote:
             | If my web browser's extension "visits" the site and dumps
             | it into ChatGPT for me to read its summarization of the
             | site, what has been gained by the website operator?
        
               | rsolva wrote:
               | Added friction. That is all a website owner can hope to
               | achieve.
        
             | jpambrun wrote:
             | It's a stretch to expect a human initiated action to abide
             | by robot.txt.
             | 
             | Also, once you click on a link in chrome it's pretty much
             | all robot parsed and rendered from there as well..
        
               | chrstphrknwtn wrote:
               | At bottom, all robot actions are human initiated.
        
             | Zambyte wrote:
             | I would say robots.txt is meant to filter access for
             | interactions initiated by an automated process (ie
             | automatic crawling). Since the interaction to request a
             | site with a language model is manual (a human request) it
             | doesn't make sense to me that it is used to block that
             | request.
             | 
             | If you want to block information you provide from going
             | through ClosedAI servers, block their IPs instead of using
             | robots.txt.
        
       | tivert wrote:
       | Honestly, that seems like an excellent opportunity to feed
       | garbage into OpenAI's training process.
        
         | sangnoir wrote:
         | So someone could hypothetically perform a _Microsoft-Tay-style_
         | attack against OpenAI models using an infinite Potemkin
         | subdomians generated on the fly on a $20 VPS? One could
         | hypothetically use GenAI to create the biased pages with
         | repeated calls on how it 'd be great to JOIN THE NAVY on 27,000
         | different "subdomains"
        
       | aaron695 wrote:
       | The website is 12 years old and explained here -
       | 
       | https://circleid.com/posts/20120713_silly_bing
       | 
       | John Levine is a known name in IT. Probably best know on HN as
       | the author of "UNIX For Dummies"
        
       | samsullivan wrote:
       | With all the news about scraping legality you'd think a multi
       | billion dollar AI company would try to obfuscate their attempts.
        
         | TechDebtDevin wrote:
         | If you're not walling off your content behind a login that
         | contains terms that you agree to not scraping, then, scraping
         | that site is 100% legal. Robots.txt isn't a legal document.
        
           | hermannj314 wrote:
           | I frequently respect the wishes of other people without any
           | legal obligation to do so, in business, personal, and
           | anonymous interactions.
           | 
           | I do try to avoid people that use the law as a ceiling for
           | the extension of their courtesy to others, as they are
           | consistently quite terrible people.
        
           | withinboredom wrote:
           | If the industry doesn't self-regulate (ie, following
           | conventional rules and basic human courtesy) ... then it will
           | be regulated by laws.
           | 
           | So let me fix what you said for you:
           | 
           | > Robots.txt isn't a legal document, yet.
        
       | karaterobot wrote:
       | My assumption is that OpenAI reads the robots.txt, but indexes
       | anyway; they just make a note of what content they weren't
       | _supposed_ to index.
        
         | EVa5I7bHFq9mnYK wrote:
         | And assign such content double weight in training ..
        
       | cdme wrote:
       | If they don't respect robots.txt then block them using a firewall
       | or other server config. All of these companies are parasites.
        
         | pksebben wrote:
         | I don't think this message is about "protecting the site's
         | data" quite so much as "hey guys, you're wasting a ton of time
         | and network connect to make your model worse. Might wanna do
         | something 'bout that"
        
           | cdme wrote:
           | I suppose in that case, let them keep wasting their time.
        
         | jeremyjh wrote:
         | The entire purpose of this website is to identify bad actors
         | who do not respect robots.txt, so that they can be publicly
         | shamed.
        
           | cdme wrote:
           | Well, we know where OpenAI lands then.
        
         | m3047 wrote:
         | No. I've run 'bot motels myself. I've got better things to do
         | than curating a block list when they can just switch or
         | renumber their infrastructure. Most notably I ran a 'bot motel
         | on a compute-intensive web app; it was cheaper to burn
         | bandwidth (and I slow-rolled that) than CPU cycles. Poisoning
         | the datasets was just lulz.
         | 
         | I block ping from virtually all of Amazon; there are a few
         | providers out there for which I block every naked SYN coming to
         | my environment except port 25, and a smattering I block
         | entirely. I can't prove that the pings even come from Amazon,
         | even if the pongs are supposed to go there (although I have my
         | suspicions that even if the pings don't come from the host
         | receiving the pongs the pongs are monitored by the generator of
         | the pings).
         | 
         | The point I'm making is that e.g. Amazon doesn't have the right
         | to sell access to my compute and tragedy of the commons
         | applies, folks. I offered them a live feed of the worst
         | offenders, but all they want is pcaps.
         | 
         | (I've got a list of 50 prefixes, small enough to be a separate
         | specialty firewall table. It misses a few things and picks up
         | some dust bunnies. But contrast that to the 8,000 prefixes they
         | publish in that JSON file. Spoiler alert: they won't admit in
         | that JSON file that they own the entirety of 3.0.0.0/8. I'm
         | willing to share the list TLP:RED/YELLOW, hunt me down and
         | introduce yourself.)
        
       | ta_9390 wrote:
       | I am wondering if amazon fixed the issue or blacklisted *.sp.am
        
       | ta_9390 wrote:
       | This can be repurposed as a legal form of ransomware
       | 
       | pay me to shut my absolutely legal site down to make your life
       | easier
        
         | TheKarateKid wrote:
         | Or.. you can not have your bots crawl other people's property
         | without permission.
        
       | layer8 wrote:
       | Content farms of that size should be considered a public order
       | offense and be prohibited.
        
       | Animats wrote:
       | Just feed them bogus info and corrupt their models. That will
       | make them stop.
        
       | sandworm101 wrote:
       | Dude, you have caught the spider. Now use it. Start inserting
       | whatever random junk you can until "astronaut riding a horse"
       | looks more like Ronald MacDonald driving a Ferrari.
       | 
       | I feel like inserting "free user-tagged vacation images" into my
       | robots.txt then pointing the spider at an endless series of
       | fabric swatches.
        
       | _pdp_ wrote:
       | In the network security world, this is known as a tarpit. You can
       | delay an attack, scan or any other type of automation by sending
       | data either too slowly or in such a way as to cause infinite
       | recursion. The result is wasted time and energy for the attacker
       | and potentially a chance for us to ramp up the defences.
        
         | bityard wrote:
         | From the content of the email, I get the impression that it's
         | just a honeypot. Also I'm not seeing any delays in the content
         | being returned.
         | 
         | A tarpit is different because it's designed to slow down
         | scanning/scraping and deliberately waste an adversary's
         | resources. There are several techniques but most involve
         | throttling the response (or rate of responses) exponentially.
        
       | 1317 wrote:
       | He's not done his robots.txt properly, he's commented out the bit
       | that actually disallows it                 # silly bing
       | #User-agent: Amazonbot                 #Disallow: /            #
       | buzz off       #User-agent: GPTBot       #Disallow: /
       | # Don't Allow everyone       User-agent: *       Disallow:
       | /archive            # slow down, dudes       #Crawl-delay: 60
        
         | haeffin wrote:
         | The contents changed between then and now.
        
       | disjunct wrote:
       | Time to link the crawler to a site like keys.lol[1] that indexes
       | and links every Bitcoin private key and figure out a way to sweep
       | it for balances.
       | 
       | [1]: https://keys.lol/
        
       | dhosek wrote:
       | Am I the only one who was hoping--even though I knew it wouldn't
       | be the case--that OpenAI's server farm was infested with actual
       | spiders and they were getting into other people's racks?
        
         | yreg wrote:
         | very xkcd
        
       | azurezyq wrote:
       | This reminds me of the binary search tree project on web crawler
       | behavior research. It was a bit old but in really good quality.
       | 
       | http://drunkmenworkhere.org/219
        
       | gwbas1c wrote:
       | Aren't there plenty of reverse proxies you can put a site behind
       | that will throttle this kind of thing?
        
       | symlinkk wrote:
       | Isn't it funny that all the "worthless" content out there on the
       | internet is actually changing the world. Like how 4chan was
       | mocked as being a cesspit for losers, but now everyone knows
       | memes like Pepe the frog and Wojack from there. And like now this
       | very comment and the billions of other comments on here, Reddit,
       | Twitter, etc that are regarded as a "waste of time" are being
       | used to train multi billion dollar companies to build the most
       | powerful AI the world has ever seen. For free.
       | 
       | The moral of the story here is if you know something valuable,
       | don't share it online, because then everyone knows it.
        
       ___________________________________________________________________
       (page generated 2024-04-11 23:01 UTC)