[HN Gopher] Anyone got a contact at OpenAI. They have a spider p...
___________________________________________________________________
Anyone got a contact at OpenAI. They have a spider problem
Author : speckx
Score : 498 points
Date : 2024-04-11 13:34 UTC (9 hours ago)
(HTM) web link (mailman.nanog.org)
(TXT) w3m dump (mailman.nanog.org)
| egberts1 wrote:
| There is always IP filtering, DNS blocking, and HTTP agent
| screening. Just sayin'.
| unsupp0rted wrote:
| > Before someone tells me to fix my robots.txt, this is a
| content farm so rather than being one web site with
| 6,859,000,000 pages, it is 6,859,000,000 web sites each with
| one page.
| apocalyptic0n3 wrote:
| The reason that bit is relevant is that robots.txt is only
| applicable to the current domain. Because each "page" is a
| different subdomain, the crawler needs to fetch the
| robots.txt for every single page request.
|
| What the poster was suggesting is blocking them at a higher
| level - e.g. a user-agent block in an .htaccess or an IP
| block in iptables or similar. That would be a one-stop fix.
| It would also defeat the purpose of the website, however,
| which is to waste the time of crawlers
| matt_heimer wrote:
| The real question is how is GPTBot finding all the other
| subdomains? Currently the sites have GPTBot disallowed,
| https://www.web.sp.am/robots.txt
|
| If GPTBot is compliant with the robots.txt specification
| then it can't read the URL containing the HTML to find the
| other subdomains.
|
| Either: 1. GPTBot treats a disallow as a
| noindex but still requests the page itself. Note that
| Google doesn't treat a disallow as a noindex. They will
| still show your page in search results if they discover the
| link from other pages but they show it with a "No
| information is available for this page." disclaimer.
| 2. The site didn't have a GPTBot disallow until they
| noticed the traffic spike and they bot has already
| discovered a couple million links that need to be crawled.
| 3. There is some other page out there on the internet that
| GPTBot discovered that links to millions of these
| subdomains. This seems possible and the subdomains really
| don't have any way to prevent a bot from requesting
| millions of robots.txt files. The only prevention here is
| to firewall the bot's IP range or work with the bot owners
| to implement better subdomain handling.
| fl7305 wrote:
| I think he's saying that it's not a problem for him, but for
| OpenAI?
| unnouinceput wrote:
| Yup, that's my impression as well. He's just nice to let
| OpenAI they have a problem. Usually this should be rewarded
| with a nice "hey, u guys have a bug" bounty because not long
| time ago some VP from OpenAI was lamenting that training
| their AI is, and it's his direct quote, "eye watering" cost
| (the order was millions of $$ per second).
| joha4270 wrote:
| I would be a little sceptical about that figure. 3 million
| dollars per second is around the world GDP.
|
| I get it, AI training is expensive, but I don't believe
| it's that expensive
| tasuki wrote:
| Thank you for that perspective. I always appreciate it
| when people put numbers like these in context.
|
| Also 1 million per second is 60 million per minute is 3.6
| billion per hour is 86.4 billion per day. It's about one
| whole value of FAANG per month...
| samspenc wrote:
| Sam Altman has said in a few interviews that it was
| around $100 million for GPT-3, and higher for GPT-4.
|
| But yes, this is a one-time cost, and far lower than the
| "millions of dollars per second" in GP comment.
|
| https://fortune.com/2024/04/04/ai-training-costs-how-
| much-is...
|
| https://www.wired.com/story/openai-ceo-sam-altman-the-
| age-of...
| mapmeld wrote:
| Based on the page footer ("IECC ChurnWare") I believe this is a
| site by design to waste time for web crawlers and tools which
| try to get root access on every domain. The robots.txt looks
| like this: https://ulysses-antoine-kurtis.web.sp.am/robots.txt
|
| I don't see how this does much to keep bad actors away from
| other domains, but I can see why they don't want to give up the
| game for OpenAI to stop crawling.
| ajuc wrote:
| I mean that's your chance to train the SkyNet, take it :)
| dr_kiszonka wrote:
| That's an interesting attack vector, isn't it?
| throwaway35777 wrote:
| Single handedly extending the lifespan of all humans by however
| long it takes them to crawl 6,859,000,000 pages.
| mypastself wrote:
| Frankly, I didn't get the purpose of the website at first either.
| I guess I have an arachnid intellect.
| aendruk wrote:
| Arachnid here... What am I looking at? The intent is to waste
| the resources of crawlers by just making the web larger?
| quadrature wrote:
| It seems like that, but they're also concerned about the
| crawlers that they catch in this web. So it seems like
| they're trying to help make crawlers better ?, or just
| generally curious about what systems are crawling around.
| pvg wrote:
| _What am I looking at?_
|
| I'd say go ahead and inject it with digestive enzymes and
| then report findings.
| YeGoblynQueenne wrote:
| No no. First tightly wrap it in silk. _Then_ digestive
| enzymes. Good hygiene, eh?
| pvg wrote:
| Largely a myth spread by Big Silk!
| troq13 wrote:
| You had the sense not to spend the rest of your days crawling
| it though.
| frizlab wrote:
| I'd let them do their thing, why not?! They want the internet?
| This is the real internet. It looks like he doesn't really care
| that much that they're retrieving millions of pages, so let them
| do their thing...
| gtirloni wrote:
| _> It looks like he doesn't really care that much that they're
| retrieving millions of pages_
|
| It impacts the performance for the other legitimate users of
| that web farm ;)
| jeremyjh wrote:
| Some scrapers respect robots.txt. OpenAI doesn't. SP is just
| informing the world at large of this fact.
| cactusplant7374 wrote:
| The CTO isn't even aware of where the data is coming from
| (allegedly).
| Tubbe wrote:
| (admittedly)
| Zambyte wrote:
| The (allegedly) implies they do know, but to avoid possible
| litigation they feign ignorance. The CTO of ClosedAI is
| probably not a complete idiot.
| whoami_nr wrote:
| Thats what the whole thing is about. He is complaining that
| they don't respect robots.txt
| alberth wrote:
| Isn't the entire point of these type of websites to waste spider
| time/resources.
|
| Why do they want to not do that for OpenAI?
| ganzuul wrote:
| Might one day come looking for who lobotomized it?
| dimask wrote:
| And? What are they gonna do about it (apart from making such
| a person/website momentarily famous).
| tasuki wrote:
| Have you not heard of Roko's Basilisk?
| withinboredom wrote:
| Thanks for dooming everyone who reads this comment.
| nojvek wrote:
| I didn't know about Roko's Basilisk until today.
|
| https://www.youtube.com/watch?v=ut-zGHLAVLI
| https://en.wikipedia.org/wiki/Roko%27s_basilisk
|
| Idea viruses are amazing that we can think and
| contemplate them.
| ganzuul wrote:
| Give you bad search results.
| RobotToaster wrote:
| Do androids dream of electric farms?
| imnotreallynew wrote:
| Isn't the legality of web scraping still..disputed?
|
| There's been a few projects I've wanted to work on involving
| scraping, but the idea that the entire thing could be shut down
| with legal threats seems to make some of the ideas infeasible.
|
| It's strange that OpenAI has created a ~$80B company (or whatever
| it is) using data gathered via scraping and as far as I'm aware
| there haven't been any legal threats.
|
| Was there some law that was passed that makes all web scraping
| legal or something?
| foobarian wrote:
| Why would it not be legal? Was there a law passed that makes it
| illegal?
| brushfoot wrote:
| Web scraping the public Internet is legal, at least in the U.S.
|
| hiQ's public scraping of LinkedIn was ruled to be within their
| rights and not a violation of the CFAA. I imagine that's why
| LinkedIn has almost everything behind an auth wall now.
|
| Scraping auth-walled data is different. When you sign up, you
| have to check "I agree to the terms," and the terms generally
| say, "You can't scrape us." So, you can't just make a million
| bot accounts that take an app's data (legally, anyway). Those
| EULAs are generally legally enforceable in the U.S.
|
| Some sites have terms at the bottom that prohibit scraping--but
| my understanding is that those aren't generally enforceable if
| the user doesn't have to take any action to accept or
| acknowledge them.
| bena wrote:
| hiQ was found to be in violation of the User Agreement in the
| end.
|
| Basically, in the end, it was essentially a breach of
| contract.
| brushfoot wrote:
| Exactly, that was my point.
|
| hiQ's public scraping was found to be legal. It was the
| logged-in scraping that was the problem.
|
| The logged-in scraping was a breach of contract, as you
| said.
|
| The former is fine; the latter is not.
|
| What OpenAI is doing here is the former, which companies
| are perfectly within their rights to do.
| darby_eight wrote:
| > Scraping auth-walled data is different. When you sign up,
| you have to check "I agree to the terms," and the terms
| generally say, "You can't scrape us." So, you can't just make
| a million bot accounts that take an app's data (legally,
| anyway). Those EULAs are generally legally enforceable in the
| U.S.
|
| They're legally enforceable in the sense that the scraped
| services generally reserve the right to terminate the
| authorizing account at will, or legally enforceable in that
| allowing someone to scrape you with your credentials (or
| scraping using someone else's) qualifies as violating the
| CFAA?
| withinboredom wrote:
| Most of these SaaS's have a "firehose" that if you are big
| enough (aka, can handle the firehose), can subscribe to.
| These are like RSS feeds on crack for their entire SaaS.
|
| - https://developer.twitter.com/en/docs/twitter-
| api/enterprise...
|
| - https://developer.wordpress.com/docs/firehose/
| gtirloni wrote:
| _> It's strange that OpenAI has created a ~$80B company (or
| whatever it is) using data gathered via scraping_
|
| Like Google and many others.
| bena wrote:
| Scraping is legal. Always has been, always will be. Mainly
| because there's some fuzz around the edges of the definition.
| Is a web browser a scraper? It does a lot of the same things.
|
| IIRC LinkedIn/Microsoft was trying to sue a company based on
| Computer Fraud and Abuse Act violations, claiming they were
| accessing information they were not allowed to. Courts ruled
| that that was bullshit. You can't put up a website and say "you
| can only look at this with your eyes". Recently-ish, they were
| found to be in violation of the User Agreement.
|
| So as long as you don't have a user account with the site in
| question or the site does not have a User Agreement prohibiting
| scraping, you're golden.
|
| The problem isn't the scraping anyway, it's the reproduction of
| the work. In that case, it really does matter how you acquired
| the material and what rights you have with regards use of that
| material.
| nutrie wrote:
| Scraping publicly available data from websites is no different
| from web browsing, period. Companies stating otherwise in their
| T&Cs are a joke. Copyright infringement is a different game.
| dspillett wrote:
| The issue often isn't the scraping, it is often how you use the
| information scraped afterwards. A lot of scraping is done with
| no reference to any licensing information the sites being read
| might publish, hence image making AI models having regurgitated
| chunks of scraped stock images complete with watermarks. Though
| the scraping itself can count as a DoS if done aggressively
| enough.
| reaperman wrote:
| There's currently only one situation where scraping is almost
| definitely "not legal":
|
| If the information you're scraping requires a login, and if in
| order to get a login you have to agree to a terms of service,
| and that terms of service forbids you from scraping -- then you
| could have a bad day in civil court if the website you're
| scraping decides to sue you.
|
| If the data is publicly accessible without a login then
| scraping is 99% safe with no legal issues, even if you ignore
| robots.txt. You might still end up in court if you found a way
| to correctly guess non-indexed URLs[0] but you'd probably
| prevail in the end (...probably).
|
| The "purpose" of robots.txt is to let crawlers know what they
| can do without getting ip-banned by the website operator that
| they're scraping. Generally crawlers that ignore robots.txt and
| also act more like robots than humans, will get an IP ban.
|
| 0: https://www.troyhunt.com/enumerationis-enumerating-
| resources...
| ToucanLoucan wrote:
| Also worth noting there's a long history of companies with
| deep pockets getting away with murder (sometimes literally)
| because litigation in a system that costs money to engage
| with inherently favors the wealthier party.
|
| Also OpenAI's entire business model is relying on generous
| interpretations of various IP laws, so I suspect they already
| have a mature legal division to handle these sorts of
| potential issues.
| observationist wrote:
| The 9th Circuit Court of Appeals found that scraping publicly
| accessible content on the internet is legal.
|
| If you publish something on a publicly served internet page,
| you're essentially broadcasting it to the world. You're putting
| something on a server which specifically communicates the bits
| and bytes of your media to the person requesting it without
| question.
|
| You have every right to put whatever sort of barrier you'd like
| on the server, such as a sign in, a captcha, a puzzle, a
| cryptographic software key exchange mechanism, and so on. You
| could limit the access rights to people named Sam, requiring
| them to visit a particular real world address to provide
| notarized documentation confirming their identity in exchange
| for a unique 2fa fob and credentials for secure access (call it
| The Sams Club, maybe?)
|
| If you don't put up a barrier, and you configure the server to
| deliver the content without restriction, or put your content on
| a server configured as such, then you are implicitly
| authorizing access to your content.
|
| Little popups saying "by visiting this site, you agree to blah
| blah blah" are not valid. Courts made the analogy to a "gate-
| up/gate-down" mechanism. If you have a gate down, you can
| dictate the terms of engagement with your server and content.
| If you don't have a gate down, you're giving your content to
| whoever requests it.
|
| You have control over the information you put online. You can
| choose which services and servers you upload to and interact
| with. Site operators and content producers can't decide that
| their intent or consent be withdrawn after the fact, as once
| something is published and served, the only restrictions on the
| scraper are how they use the information in turn.
|
| Someone who's archived or scraped publicly served data can do
| whatever they want with the content within established legal
| boundaries. They can rewrite all the AP news articles with
| their own name as author, insert their name as the hero in all
| fanfic stories they download, and swap out every third word for
| "bubblegum" if they want. They just can't publish or serve that
| content, in turn, unless it meets the legal standards for fair
| use. Other exceptions to copyright apply, in educational,
| archival, performance, accessibility, and certain legal
| conditions such as First Sale doctrine. Personal use of such
| media is effectively unlimited.
|
| The legality of web scraping is not disputed in the US. Other
| countries have some silly ideas about post-hoc "well that's not
| what I meant" legal mumbo jumbo designed to assist politicians
| and rich people in whitewashing their reputations and pulling
| information offline using legal threats.
|
| Aside from right to be forgotten inanity, content on the
| internet falls under the same copyright rules as books,
| magazines, or movies published on physical media. If Disney set
| up a stall at San Francisco city hall with copies of the
| Avengers movies on a thumb drive in a giant box saying "free,
| take one!", this would be roughly the same as publishing those
| movie files to a public Disney web page. The gate would be up.
| (The way they have it set up in real life, with their streaming
| services and licensed media access, the gate is down.)
|
| So - leaving behind the legality of redistribution of content,
| there's no restriction on web scraping public content, because
| the content was served intentionally to the software or entity
| that visited the site. It's up to the server operator to put
| barriers in place and to make content private. It's not rocket
| surgery, but platforms want to have their cake and eat it too,
| with control over publicly accessible content that isn't legal
| or practical.
|
| Twitter/X is a good example of impractical control, since the
| site has effectively become useless spam without signing in.
| Platforms have to play by the same rules as everyone else. If
| the gate is up, the content is fair game for scraping. The
| Supreme Court gave the decision to a lower court, who affirmed
| the gate up/gate down test for legality of access to content.
|
| Since Google and other major corporations have a vested
| interest in the internet remaining open and free, and their
| search engines and other tech are completely dependent on the
| gate up/gate down status quo, it's unlikely that the law will
| change any time soon.
|
| Tl;dr: Anything publicly served is legal to scrape. Microsoft
| attempted to sue someone for scraping LinkedIn, but the 9th
| Circuit court ruled in favor of access. If Microsoft's lawyers
| and money can't impede scraping, it's likely nobody will ever
| mount an effective challenge, and the gating doctrine is
| effectively the law of the land.
| Karellen wrote:
| > Isn't the legality of web scraping still..disputed?
|
| Are you suggesting it might be illegal to... write a program
| that connects to a web server and asks for a specific page, and
| then parses that page to see which resources it wants and which
| other pages it links to, and treats those links in some special
| fashion, differently from the text content of the page?
|
| Especially given that a web server can be configured to respond
| to any request with a "403 Forbidden" response, if the server
| determines for any reason whatsoever that it does not want to
| give the client the page it requested?
| cess11 wrote:
| A similar thing happened in 2011 when the picolisp project
| published a 'ticker', something like a markov chain generating
| pages on the fly.
|
| https://picolisp.com/wiki/?ticker
|
| It's a nice type of honeypot.
| Octokiddie wrote:
| I'm more interested in what that content farm is for. It looks
| pointless, but I suspect there's a bizarre economic incentive.
| There are affiliate links, but how much could that possibly bring
| in?
| gtirloni wrote:
| It'd say it's more like a honeypot for bots. So pretty similar
| objectives.
| Octokiddie wrote:
| So it served its purpose by trapping the OpenAI spider? If
| so, why post that message? As a flex?
| Takennickname wrote:
| It's a honeypot. He's telling people openai doesn't respect
| robots.txt and just scrapes whatever the hell it wants.
| cwillu wrote:
| Except the first thing openai does is read robots.txt.
|
| However, robots.txt doesn't cover multiple domains, and
| every link that's being crawled is to a new domain, which
| requires a new read of a robots .txt on the new domain.
| queuebert wrote:
| Did we just figure out a DoS attack for AGI training? How
| large can a robots.txt file be?
| a_c wrote:
| What about making it slow? One byte at a time for example
| while keeping the connection open
| happymellon wrote:
| A slow stream that never ends?
| SteveNuts wrote:
| This would be considered a Slow Loris attack, and I'm
| actually curious how scrapers would handle it.
|
| I'm sure the big players like Google would deal with it
| gracefully.
| throw_a_grenade wrote:
| You just set limits on everything (time, buffers, ...),
| which is easier said than done. You need to really
| understand your libraries and all the layers down to the
| OS, because its enough to have one abstraction that
| doesn't support setting limits and it's an invitation for
| (counter-)abuse.
| starttoaster wrote:
| Doesn't seem like it should be all that complex to me
| assuming the crawler is written in a common programming
| language. It's a pretty common coding pattern for
| functions that make HTTP requests to set a timeout for
| requests made by your HTTP client. I believe the stdlib
| HTTP library in the language I usually write in actually
| sets a default timeout if I forget to set one.
| Calzifer wrote:
| Those are usually connection and no-data timeouts. A
| total time limit is in my experience less common.
| gtirloni wrote:
| Here you go (1 req/min, 10 bytes/sec), please report
| results :) http { limit_req_zone
| $binary_remote_addr zone=ten_bytes_per_second:10m
| rate=1r/m; server { location / {
| if ($http_user_agent = "mimo") { limit_req
| zone=ten_bytes_per_second burst=5;
| limit_rate 10; } } }
| }
| beau_g wrote:
| Scrapers of the future won't be ifElse logic, they will
| be LLM agents themselves. The slow loris robots.txt has
| to provide an interface to it's own LLM, which engages
| the scraper LLM in conversation, aiming to extend it as
| long as possible. "OK I will tell you whether or not I
| can be scraped. BUT FIRST, listen to this offer. I can
| give you TWO SCRAPES instead of one, if you can solve
| this riddle."
| iosguyryan wrote:
| Can I interest you in a scrape-share with Claude?
| Phelinofist wrote:
| Sounds like endlessh
| bityard wrote:
| That would make it a tarpit, a very old technique to
| combat scrapers/scanners
| everforward wrote:
| No, because there's no legal weight behind robots.txt.
|
| The second someone weaponizes robots.txt all the scrapers
| will just start ignoring it.
| Retric wrote:
| That's how you weaponize it. Set things up to give
| endless/randomized/poisoned data to anybody that ignores
| robots.txt.
| flutas wrote:
| > Except the fiist thing openai does is read robots.txt.
|
| Then they should see the "Disallow: /" line, which means
| they shouldn't crawl any links on the page (because even
| the homepage is disallowed). Which means they wouldn't
| follow any of the links to other subdomains.
| darkwater wrote:
| And they do have (the same) robots.txt on every domain,
| tailored for GPTbot, i.e. https://petra-cody-
| carlene.web.sp.am/robots.txt
|
| So, GPTBot is not following robots.txt, apparently.
| fsckboy wrote:
| humans don't read/respect robots.txt, so in order to pass
| the Turing test, ai's need to mimic human behavior.
| gunapologist99 wrote:
| This must be why self-driving cars always ignore the
| speed limit. ;)
| microtherion wrote:
| More directly, e.g. Tesla boasts of training their FSD on
| data captured from their customer's unassisted driving.
| So it's hardly surprising that it imitates a lot of
| humans' bad habits, e.g. rolling past stop lines.
| roughly wrote:
| Jesus, that's one of those ideas that looks good to an
| engineer but is why you really need to hire someone with
| a social sciences background (sociology, anthropology,
| psychology, literally anyone who's work includes humans),
| and probably should hire two, so the second one can tell
| you why the first died of an aneurism after you explained
| your idea.
| yreg wrote:
| AI DRIVR claims that beta V12 is much better precisely
| because it takes rules less literally and drives more
| naturally.
| cwillu wrote:
| Accessing a directly referenced page is common in order
| to receive the noindex header and/or meta tag, whose
| semantics are not implied by "Disallow: /"
|
| And then all the links are to external domains, which
| aren't subject to the first site's robots.txt
| andybak wrote:
| This is a moderately persuasive argument.
|
| Although the crawler should probably ignore all the html
| body. But it does feel like a grey area if I accept your
| first pint.
| AgentME wrote:
| All the lines related to GPTBot are commented out. That
| robots.txt isn't trying to block it. Either it has been
| changed recently or most of this comment thread is
| mistaken.
| Pannoniae wrote:
| It wasn't commented out a few hours ago when I checked
| it. I think that's a recent change.
| GaggiX wrote:
| It seems to respect it as the majority of the requests
| are for the robots.txt.
| flutas wrote:
| He says 3 million, and 1.8 million are for robots.txt
|
| So 1.2 million non robots.txt requests, when his
| robots.txt file is configured as follows
| # buzz off User-agent: GPTBot Disallow: /
|
| Theoretically if they were actually respecting robots.txt
| they wouldn't crawl _any_ pages on the site. Which would
| also mean they wouldn 't be following any links... aka
| not finding the N subdomains.
| swyx wrote:
| for the 1.2 million are there other links he's not
| telling us about?
| flutas wrote:
| I'm assuming those are homepage requests for the
| subdomains.
| otherme123 wrote:
| A lot of crawlers, if not all, have a policy like "if you
| disallow our robot, it might take a day or two before it
| notices". They surely follow the path "check if we have
| robots.txt that allows us to scan this site, if we don't
| get and store robots.txt, scan at least the root of the
| site and its links". There won't be a second scan, and
| they consider that they are respecting robots.txt. Kind
| of "better ask for forgiveness than for permission".
| jeremyjh wrote:
| That is indistinguishable from not respecting robots.txt.
| There is a robots.txt on the root the first time they ask
| for it, and they read the page and follow its links
| regardless.
| otherme123 wrote:
| I agree with you. I only stated how the crawlers seem to
| work, if you read their pages or try to block/slow down
| them it seems clear that they scan-first-respect-after.
| But somehow people understood that I approve that
| behaviour.
|
| For those bad crawlers, which I very much disapprove,
| "not respecting robots.txt" equals "don't even read
| robots.txt, or if I read it ignore it completely". For
| them, "respecting robots.txt" means "scan the page for
| potential links, and after that parse and respect
| robots.txt". Which I disapprove and don't condone.
| jeffnappi wrote:
| His site has a subdomain for every page, and the crawler
| is considering those each to be unique sites.
| sangnoir wrote:
| There are fewer than 10 links on each domain, how did
| GPTBot find out about the 1.8M unique sites? By crawling
| the sites _it 's not supposed to crawl,_ ignoring
| robots.txt. "disallow: /" doesn't mean "you may peek at
| the homepage to find outbound links that may have a
| different robots.txt"
| vertis wrote:
| Except now it says # silly bing
| #User-agent: Amazonbot #Disallow: /
| # buzz off #User-agent: GPTBot #Disallow:
| / # Don't Allow everyone User-agent:
| * Disallow: /archive # slow down,
| dudes #Crawl-delay: 60
|
| Which means he's changing it. The default for all other
| bots is to allow crawling.
| swatcoder wrote:
| I'm not sure any publisher means for their robots.txt to
| be read as:
|
| "You're disallowed, but go head and slurp the content
| anyway so you can look for external links or any
| indication that maybe you are allowed to digest this
| material anyway, and then interpret that how you'd like.
| I trust you to know what's best and I'm sure you kind of
| get the gist of what I mean here."
| dspillett wrote:
| So, it has worked...
| madkangas wrote:
| I recognize the name John Levine at iecc.com, "Invincible
| Electric Calculator Company," from web 1.0 era. He was the
| moderator of the Usenet comp.compilers newsgroup and wrote the
| first C compiler for the IBM PC RT
|
| https://compilers.iecc.com/
| throw_a_grenade wrote:
| This is honeypot. The author,
| https://en.wikipedia.org/wiki/John_R._Levine, keeps it just to
| notice any new (significant) scraping operation launched that
| will invariably hit his little farm and let be seen in the
| logs. He's well known anti-spam operative with his various
| efforts now dating back multiple decades.
|
| Notice how he casually drops a link to the landing page in the
| NANOG message. That's how the bots will get a bait.
| agilob wrote:
| It's for shits-and-giggles and it's doing its job really well
| right now. Not everything needs to have an economic purpose,
| 100 trackers, ads and backed by a company.
| euparkeria wrote:
| What is a spider in this context?
| clarkrinker wrote:
| Old name for a web crawler / search indexer
| dmd wrote:
| https://en.wikipedia.org/wiki/Web_crawler
| cosmojg wrote:
| > A Web crawler, sometimes called a spider or spiderbot and
| often shortened to crawler, is an Internet bot that
| systematically browses the World Wide Web and that is typically
| operated by search engines for the purpose of Web indexing (web
| spidering).
|
| See: https://en.wikipedia.org/wiki/Web_crawler
| btown wrote:
| This reminds me of how GPT-2/3/J came across
| https://reddit.com/r/counting, wherein redditors repeatedly post
| incremental numbers to count to infinity. It considered their
| usernames, like SolidGoldMagikarp, such common strings on the
| Internet that, during tokenization, it treated them as top-level
| tokens of their own.
|
| https://www.alignmentforum.org/posts/8viQEp8KBg2QSW4Yc/solid...
|
| https://www.lesswrong.com/posts/LAxAmooK4uDfWmbep/anomalous-...
|
| Vocabulary isn't infinite, and GPT-3 reportedly had only 50,257
| distinct tokens in its vocabulary. It does make me wonder - it's
| certainly not a linear relationship, but given the number of
| inferences run every day on GPT-3 while it was the flagship
| model, the incremental electricity cost of these Redditors' niche
| hobby, vs. having allocated those slots in the vocabulary to
| actually common substrings in real-world text and thus reducing
| average input token count, might have been measurable.
|
| It would be hilarious if the subtitle on OP's site, "IECC
| ChurnWare 0.3," became a token in GPT-5 :)
| mort96 wrote:
| During tokenization, the usernames became tokens... but before
| training the actual model, they removed stuff like that from
| the training data, so it was never trained on text which
| contains those tokens. As such, it ended up with tokens which
| weren't associated with anything; glitch tokens.
| zelphirkalt wrote:
| So it becomes a game of getting things into the training
| data, past the training data cleanup step.
| btown wrote:
| It's interesting: perhaps the stability (from a change
| management perspective) of the tokenization algorithm, being
| able to hold that constant, between old and new training runs
| was deemed more important than trying to clean up the data at
| an earlier phase of the pipeline. And the eventuality of
| glitch tokens was deemed an acceptable consequence.
| aidenn0 wrote:
| I wonder how much the source content is the cause of
| hallucinations rather than anything inherent to LLMs. I mean if
| someone posts a question on an internet forum that I don't know
| the answer to, I'm certainly not going to post "I don't know"
| since that wouldn't be useful.
|
| In fact, in general, in any non one-on-one conversation the
| answer "I don't know" is not useful because if you don't know
| in a group, your silence indicates that.
| abracadaniel wrote:
| That's a good observation. If LLMs had taken off 15 years
| ago, maybe they would answer every question with "this has
| already been asked before. Please use the search function"
| exe34 wrote:
| Marked as duplicate.
| btown wrote:
| Your question may be a better fit for a different
| StackExchange site.
| amadeuspagel wrote:
| We prefer questions that can be answered, not merely
| discussed.
| exe34 wrote:
| Feel like we're going to get dang on our case soon...
| TeMPOraL wrote:
| Topic locked as not constructive.
| taco-hands wrote:
| Damn. I was hoping to be the fastest gun in the west on
| this one!
| SirMaster wrote:
| I thought LLMs don't say when they don't know something
| because of how they are tuned and because of RLHF.
| ben_w wrote:
| They can say they don't know, and have been trained to in
| at least some cases; I think the deeper problem -- which
| we don't know how to fix in humans, the closest we have
| is the scientific method -- is they can be confidently
| wrong.
| jdashg wrote:
| Nowadays they are instead learning to say "please join our
| Discord for support"!
| TeMPOraL wrote:
| Much like one of the first phrases spoken by babies today
| is "like and subscribe".
| Voultapher wrote:
| Part of me really wants to believe this is a joke.
|
| But given how often toddlers are "taken care of" by
| planting them in front of youtube :|
| jvanderbot wrote:
| Which is crazy because there's plenty of good content for
| kids on Youtube (if you really need a break!). Blippy,
| Meekah, Seasame Street, even that mind-numbing drivel
| Cocomelon (which at least got my girls talking/singing
| really early).
| stcredzero wrote:
| Isn't there some kind of voice or video generation model
| that says "be sure to like and subscribe" given an empty
| prompt?
| dizhn wrote:
| For the actual useful information visit my Patreon.
| darby_eight wrote:
| With Wittgenstein I think we see that "hallucinations" are a
| part of language in general, albeit one I could see being
| particularly vexing if you're trying to build a perfectly
| controllable chatbot.
| Y_Y wrote:
| This sounds interesting, could you give more detail on what
| you're referring to?
| all2 wrote:
| I would assume GP is talking about the fallibility of
| human memory, or perhaps about the meanings of
| words/phrases/aphorisms that drift with time. C.S. Lewis
| talks about the meaning of the word "gentleman" in one of
| his books; at first the word just meant "land owner" and
| that was it. Then it gained social significance and began
| to be associated with certain kinds of behavior. And now,
| in the modern register, its meaning is so dilute that it
| can be anything from "my grandson was well behaved today"
| or "what an asshole" depending on its use context.
|
| Dunno. GP?
| darby_eight wrote:
| I'm referring to his two works, the "Tractatus Logico-
| Philosophicus" and "Philosophical Investigations".
| There's a lot explored here, but Wittgenstein basically
| makes the argument that the natural logic of language--
| how we deduce meaning from terms in a context and
| naturally disambiguate the semantics of ambiguous phrases
| --is different from the sort of formal propositional
| logic that forms the basis of western philosophy.
| However, this is also the sort of logic that allows us to
| apply metaphors and conceive of (possibly incoherent,
| possibly novel, certainly not deductively-derived) terms
| --counterfactuals, conditionals, subjunctive phrases,
| metaphors, analogies, poetic imagery, etc. LLMs have
| shown some affinity of the former (linguistic) type of
| logic with greatly reduced affinity with the latter
| (formal/propositional) sort of logical processing.
| Hallucinations as people describe them seem to be
| problems with not spotting "obvious" propositional
| incoherence.
|
| What I'm pushing at is not that this linguistic ability
| naturally leads to the LLM behavior we're seeing and
| calling "hallucinating", just that LLMs may capture some
| of how humans process language, differentiate semantics,
| recall terms, etc, but without the mechanisms that enable
| rationally grappling with the resulting semantics and
| propositional (in)coherency that are fetched or
| generated.
|
| I can't say this is very surprising--most of us seem to
| have thought processes that involve generating and
| rejecting thoughts when we e.g. "brainstorm" or engage in
| careful articulation that we haven't even figured out how
| to formally model with a chatbot capable of generating a
| single "thought", but I'm guessing if we want chatbots to
| keep their ability to generate things creatively there
| will always be tension with potentially generating
| factual claims, erm, creatively. Further evidence is
| anecdotal observations that some people seem to have
| wildly different thresholds for the propositional
| coherence they can spot--perhaps one might be inclined to
| correlate the complexity with which one can engage in
| spotting (in)coherence with "intelligence", if one
| considers that a meaningful term.
| Y_Y wrote:
| Thanks for the fascinating response.
| beepbooptheory wrote:
| Wait, are you saying this something you read in both the
| Tractatus and the PI? They are quite opposed as texts!
| That's kinda why he wrote the PI at all..
|
| I don't think Wittgenstein would agree, first of all,
| that there is a "natural logic" to language. At least in
| the PI, that kind of entity--"the natural logic of
| language"--is precisely the kind of weird and imprecise
| use of language he is trying to expose. Even more, to say
| that such a logic "allows" for anything (like metaphors)
| feels like a very very strange thing for Wittgenstein to
| assert. He would ask "what do you mean by 'allows'"?
|
| All we know, according to him (in the PI), is that we
| find ourselves speaking in situations. Sometimes I say
| something, and my partner picks up the right brick, other
| times they do nothing, or hit me. In the PI, all the rest
| is doing away with things, like our idea of private
| language, the irreality of things like pain, etc. To
| conclude that he would make such assertions about the
| "nature" of language, of poetry, whatever, seems like
| maybe too quick a reading of the text. It is at best, a
| weirdly mystical reading of him, that he probably would
| not be too happy about (but don't worry about that, he
| was an asshole).
|
| The argument you are making sounds much more French.
| Derrida or Lyotard have said similar things (in their
| earlier, more linguistic years). They might be better
| friend to you here.
| jjgreen wrote:
| _What we cannot speak about we must pass over in
| silence._
| notpachet wrote:
| > some people seem to have wildly different thresholds
| for the propositional coherence they can spot
|
| This sums up the last decade remarkably well.
| wizzwizz4 wrote:
| I don't remember Wittgenstein saying anything about that.
| shkkmo wrote:
| >In fact, in general, in any non one-on-one conversation the
| answer "I don't know" is not useful because if you don't know
| in a group, your silence indicates that.
|
| This isn't true. There are many contexts where it is true but
| it doesn't actually generalize they way you say it does.
|
| There are plenty of cases where experts in a non-one-on-one
| context will express a lack of knowledge. Sometimes this will
| be as part of making point about the broader epistemic state
| of the group, sometimes it will be simply to clarify the
| epistemic state of the speaker.
| digging wrote:
| > I wonder how much the source content is the cause of
| hallucinations rather than anything inherent to LLMs
|
| I mean, it's inherent to LLMs to be unable to answer "I don't
| know" as a result of _not knowing the answer_. An LLM never
| "doesn't know" the answer. But they'll gladly answer "I don't
| know" if that's statistically the most likely response,
| right? (Although current public offerings are probably
| trained against ever saying that.)
| aidenn0 wrote:
| LLMs work at all because of the high correlation between
| the statistically most likely response and the most
| reasonable answer.
| digging wrote:
| That's an explanation of why their answers can be useful,
| but doesn't relate to their ability to "not know" an
| answer
| ben_w wrote:
| I suspect this is going to be a disagreement on the
| meaning of "to know".
|
| On the same lines as why people argue if a tree falling
| in a wood where nobody can hear it makes sound because
| some people implicitly regard sound is the qualia while
| others regard it as the vibrations in the air.
| digging wrote:
| Not really.
|
| An LLM should have no problem replying "I don't know" if
| that's the most statistically likely answer to a given
| question, and if it's not trained against such a
| response.
|
| What it fundamentally can't do is introspect and
| determine it doesn't have enough information to answer
| the question. It always has an answer. (disclaimer: I
| don't know jack about the actual mechanics. It's possible
| something could be constructed which does have that
| ability and still be considered an "LLM". But the ones we
| have now can't do that.)
| TeMPOraL wrote:
| FWIW, we train into kids that "I don't know" is a valid
| response, and when to utter it. That training is more
| RLHF-type than source-materal-type, too.
| digging wrote:
| I don't follow, what does this mean to the conversation?
| TeMPOraL wrote:
| That knowing to say "I don't know" instead of
| extrapolating is an explicitly learned skill in humans,
| not something innate or inherent in structure of
| language, so we shouldn't expect LLMs to pick that _ex
| nihilo_ either.
| TrevorJ wrote:
| Sure it does, if those tokens appear in the training
| data.
| CuriouslyC wrote:
| A lot of LLM hallucination is because of the internal
| conflict between alignment for helpfulness and lack of a
| clear answer. It's much like when someone gets out of their
| depth in a conversation and dissembles their way through to
| try and maintain their illusion of competence. In these
| cases, if you give the LLM explicit permission to tell you
| that it doesn't know in cases where it's not sure, that will
| significantly reduce hallucinations.
|
| A lot more of LLM hallucination is it getting the context
| confused. I was able to get GPT4 to hallucinate easily with
| questions related to the distance from one planet to another,
| since most distances on the internet are from the sun to
| individual planets, and the distances between planets varies
| significantly based on their locations in cycle. These are
| probably slightly harder to fix.
| danenania wrote:
| "In these cases, if you give the LLM explicit permission to
| tell you that it doesn't know in cases where it's not sure,
| that will significantly reduce hallucinations."
|
| I've noticed that while this can help to prevent
| hallucinations, it can also cause it to go way too far in
| the other direction and start telling you it doesn't know
| for all kinds of questions it really can answer.
| sumtechguy wrote:
| My current favorite one is to ask the time. Then ask it
| if it is possible for it to give you the time. You get 2
| very different answers.
| mzi wrote:
| It also has a problem with quantity, so it gets confused by
| things like the cube root of 750l that it maintains for a
| long time is around 9m. It even suggests that 1l is equal
| to 1m3.
| hiatus wrote:
| Contrast with Q&A on products on Amazon where people
| routinely answer that way. I have flagged responses saying "I
| don't know" but nothing ever comes of it.
| aendruk wrote:
| I'd place in the same category the responses that I give to
| those chat popups so many sites have. They show a person
| saying to me "Can I help you with anything today?" so I
| always send back "No".
| ceejayoz wrote:
| This is Amazon's fault; they send an email that looks like
| it's specifically directed to you. "ceejayoz, a fellow
| customer has a question on..."
|
| At some point fairly recently they added a "I don't know
| the answer" button to the email, but it's much less
| prominent than the main call-to-action.
| philipswood wrote:
| I've wondered if one could train a LLM on a closed set of
| curated knowledge. Then include training data that models the
| behaviour of not knowing. To the point that it could
| generalize to being able to represent its own not knowing.
|
| Because expecting a behaviour, like knowing you don't know,
| that isn't represented in the training set is silly.
|
| Kids make stuff up at first, then we correct them - so they
| have a way to learn not to.
| aidenn0 wrote:
| > I've wondered if one could train a LLM on a closed set of
| curated knowledge. Then include training data that models
| the behaviour of not knowing. To the point that it could
| generalize to being able to represent its own not knowing.
|
| The problem is that curating data is slow and expensive and
| downloading the entire web is fast and cheap.
|
| See also https://en.wikipedia.org/wiki/Cyc
| philipswood wrote:
| Agreed. Using LLM to generate or curate training sets for
| other generations seems like a cool approach.
|
| Maybe if you trained a small base model to know it
| doesn't know in general and THEN trained it on the entire
| web with embedded not-knowing preserving training
| examples, it would work?
| philipswood wrote:
| Reminded of this approach where a tiny model was trained
| children's stories generated by a larger model:
|
| https://www.quantamagazine.org/tiny-language-models-
| thrive-w...
| crooked-v wrote:
| > train a LLM a closed set of curated knowledge
|
| Google has one of these already, with an LLM that was
| trained on nothing but weather data and so can only give
| weather-data-prediction responses.
|
| The 'knowing it doesn't know things' part is much harder to
| get reliable, though.
| Rebelgecko wrote:
| Reminds me of a joke
|
| Three logicians walk into a bar. The bartender says "what'll
| it be, three beers?" The first logician says "I don't know".
| The second logician says "I don't know". The third logician
| says "Yes".
| withinboredom wrote:
| And the human bartender passes the check to the third
| logician.
| readyplayernull wrote:
| The third logician never finishes his beer, his friends
| get more free beers. The bar overflows.
| mgsouth wrote:
| If, like me, you didn't get the joke at first:
|
| Both of the first two logicians wanted a beer; otherwise
| they would know the answer was "no". The third logician
| recognizes this, and therefore knows the answer.
| KptMarchewa wrote:
| Unless one of those wanted two beers. Or 0.5 beer. Or -1
| beers. Or 1e9 beers. Or 2147483648 beers.
| troq13 wrote:
| "I wonder how much the source content is the cause of
| hallucinations rather than anything inherent to LLMs."
|
| Probably true, but if you have quality, organized data, you
| will just want to search the data itself.
| singingfish wrote:
| I'm really cross that the word "hallucination" has taken off
| to describe this as it's clearly in incorrect word. The
| correct word to describe it is "confabulation", which is
| clinically more accurate and a much clearer descriptor of
| what's actually going on.
|
| https://en.wikipedia.org/wiki/Confabulation
| mindcrime wrote:
| I proposed[1] the portmanteau "hallucofabulation" as a
| compromise, but it hasn't caught on yet. I'm totally
| shocked and dismayed by this, of course.
|
| [1]: https://news.ycombinator.com/item?id=36977935
| brookst wrote:
| The re-use of the "c" as a soft c in "hallucinate" and
| then a hard c in confabulate is confusing, and probably
| affecting the uptake of your neologism.
| fhars wrote:
| Yes, it would probably have to be halucinfabulation for
| purely phonetic reasons.
| mindcrime wrote:
| Maybe if I added a hyphen? "halluco-fabulation"?
| mensetmanusman wrote:
| Disagree. One seems more innocent and neutral, the other
| seems like it is trying to lie to you.
| dkasper wrote:
| That's a great word for some types of hallucinations. But
| some things that are called hallucinations may not be
| memory errors.
| singingfish wrote:
| Please can you give an example of what might not be a
| memory error. Not that I think "memory error" is the
| right phrase either.
| dkasper wrote:
| I was thinking along the lines of answering with correct
| information but not following the prompts. Maybe this
| could be considered confabulation also.
| Karellen wrote:
| More glitch token discussion over at Computerphile:
|
| https://www.youtube.com/watch?v=WO2X3oZEJOA
| qeternity wrote:
| > GPT-3 reportedly had only 50,257
|
| The most common vocabulary size today is 32k.
| RcouF1uZ4gsC wrote:
| > R's, > John
|
| The pages should be all changed to say, "John is the most awesome
| person in the world."
|
| Then when you ask GPT-5, about who is the most awesome person in
| the world...
| anonymousDan wrote:
| Honeypots like this seem like a super interesting way to poison
| LLM training.
| verelo wrote:
| What exactly is this website? I don't get it...
| mobilemidget wrote:
| search engine trickery to get people to click on his amazone
| affiliate links I reckon
| danpalmer wrote:
| A "honeypot" is a system designed to trap unsuspecting
| entrants. In this case, the website is designed to be found
| by web crawlers and to then trap them in never-ending linked
| sites that are all pointless. Other honeypots include things
| like servers with default passwords designed to be found by
| hackers so as to find the hackers.
| gardenhedge wrote:
| What does trap mean here? I presumed crawlers had multiple
| (thousands of or more) instances. One being 'trapped' on
| this web farm won't have any impact
| danpalmer wrote:
| In this case there are >6bn pages with roughly zero value
| each. That could eat a substantial amount of time. It's
| unlikely to entirely trap a crawler, but a dumb crawler
| (as is implied here) will start crawling more and more
| pages, becoming very apparent to the operator of this
| honeypot (and therefore identifying new crawlers), and
| may take up more and more share of the crawl set.
| everforward wrote:
| I would presume the crawlers have a queue-based
| architectures with thousands of workers. It's an
| amplification attack.
|
| When a worker gets a webpage for the honeypot, it crawls
| it, scrapes it, and finds X links on the page where X is
| greater than 1. Those links get put on the crawler queue.
| Because there's more than 1 link per page, each worker on
| the honeypot will add more links to the queue than it
| removed.
|
| Other sites will eventually leave the queue, because they
| have a finite number of pages so the crawlers eventually
| have nothing new to queue.
|
| Not on the honeypot. It has a virtually infinite number
| of pages. Scraping a page will almost deterministically
| increase the size of the queue (1 page removed, a dozen
| added per scrape). Because other sites eventually leave
| the queue, the queue eventually becomes just the
| honeypot.
|
| OpenAI is big enough this probably wasn't their entire
| queue, but I wouldn't be surprised if it was a whole
| digit percentage. The author said 1.8M requests; I don't
| know the duration, but that's equivalent to 20 QPS for an
| entire day. Not a crazy amount, but not insignificant.
| It's within the QPS Googlebot would send to a fairly
| large site like LinkedIn.
| anonymousDan wrote:
| While the other comments are correct, I was alluding to a
| more subtle attack where you might try to indirectly
| influence the training of an LLM. Effectively, if OpenAI
| is crawling the open web for data to use for training,
| then if they don't handle sites like this properly their
| training dataset could be biased towards whatever content
| the site contains. Now in this instance this website was
| clearly not set up target an LLM, but model poisoning
| (e.g. to insert backdoors) is an active area of research
| at the intersection of ML and security. Consider as a
| very simple example the tokenizer of previous GPTs that
| was biased by reddit data (as mentioned by other
| comments).
| everybodyknows wrote:
| Direct link to its _robots.txt_ :
|
| https://www.web.sp.am/robots.txt
| ShamelessC wrote:
| Any data scraped would be instantly deduplicated after the fact
| by whatever semantic dedupe engine they've cooked up.
| anonymousDan wrote:
| What has it got to do with deduplication? I'm talking about
| crafting some kind of alternative (not necessarily duplicate)
| data. I agree some kind of post data collection
| cleaning/filtering of the data before training could
| potentially catch it. But maybe not!
| ShamelessC wrote:
| Ah fair enough. The OP here mentioned having highly similar
| content on each of the many domains.
| ezekg wrote:
| Eventually, OpenAI (and friends) are going to be training their
| models on almost exclusively AI generated content, which is more
| often than not slightly incorrect when it comes to Q&A, and the
| quality of AI responses trained on that content will quickly
| deteriorate. Right now, most internet content is written by
| humans. But in 5 years? Not so much. I think this is one of the
| big problems that the AI space needs to solve quickly. Garbage
| in, garbage out, as the old saying goes.
| bevekspldnw wrote:
| The end state of training on web text has always been an
| ouroboros - primarily because of adtech incentives to produce
| low quality content at scale to capture micro pennies.
|
| The irony of the whole thing is brutal.
| ezekg wrote:
| > The end state of training on web text has always been an
| ouroboros
|
| And when other mediums have been saturated with AI? Books,
| music, radio, podcasts, movies -- what then? Do we need a
| (curated?) unadulterated stockpile of human content to avoid
| the enshittification of everything?
| weregiraffe wrote:
| >Do we need a (curated?) unadulterated stockpile of human
| content to avoid the enshittification of everything?
|
| Either that, or a human level AI.
| ethanbond wrote:
| Well no, we need billions of human-level AIs who are
| experiencing a world as rich and various as the world
| that the billions of humans inhabit.
| ben_w wrote:
| Once we've got the first, making a billion is easy.
|
| That said... are content creators collectively (all
| media, film and books as well as web) a thin tail or a
| fat tail?
|
| I could easily believe most of the actual culture comes
| from 10k-100k people today, even if there's, IDK, ten
| million YouTubers or something (I have a YouTube channel,
| something like 14 k views over 14 years, this isn't
| "culturally relevant" scale, and even if it had been most
| of those views are for algorithmically generated music
| from 2010 that's a _literal_ Markov chain).
| xipho wrote:
| Maybe we could print out knowledge on dead trees and
| store them somewhere, perhaps make it publically
| available? (stolen joke, not mine).
| berniedurfee wrote:
| Yahoo.com will rise from the ashes.
| bevekspldnw wrote:
| I mean, you're not wrong. I've been building some
| unrelated web search tech and have considered just
| indexing all the sites I can about and making my own "non
| shit" search engine. Which really isn't too hard if you
| want to do say, 10-50 sites. You can fit that on one 4TB
| nvme drive on a local workstation.
|
| I'm trying to work on monetization for my product now.
| The "personal Google" idea is really just an accidental
| byproduct of solving a much harder task. Not sure if
| people would pay for that alone.
| nicolas_17 wrote:
| https://en.wikipedia.org/wiki/Low_background_steel but for
| web content.
| oceanplexian wrote:
| Content you're allowed and capable of scraping on the
| Internet is such a small amount of data, not sure why people
| are acting otherwise.
|
| Common crawl alone is only a few hundred TB, I have more
| content than that on a NAS sitting in my office that I built
| for a few grand (Granted I'm a bit of a data hoarder). The
| fears that we have "used all the data" are incredibly
| unfounded.
| Eisenstein wrote:
| > Facebook alone probably has more data than the entire
| dataset GPT4 was trained on and it's all behind closed
| doors.
|
| Meta is happily training their own models with this data,
| so it isn't going to waste.
| bevekspldnw wrote:
| Not Llama, they've been really clear about that.
| Especially with DMA cross-joining provisions and various
| privacy requirements it's really hard for them, same for
| Google.
|
| However, Microsoft has been flying under the radar. If
| they gave all Hotmail and O365 data to OpenAI I'd not be
| surprised in the slightest.
| troq13 wrote:
| The company that made a honeypot VPN to access
| competitor's traffic? They are definitively keeping their
| hands off their internal data, yes.
| Dr_Birdbrain wrote:
| I bet they are training their internal models on the
| data. Bet the real reason they are not training open
| source models on that data is because of fears of
| knowledge distillation, somebody else could distill LLaMa
| into other models. Once the data is in one AI, it can be
| in any AIs. This problem is of course exacerbated by open
| source models, but even closed models are not immune, as
| the Alpaca paper showed.
| sangnoir wrote:
| > Content you're allowed and capable of scraping on the
| Internet is such a small amount of data, not sure why
| people are acting otherwise
|
| YMMV depending on the value of "you" and your budget.
|
| If you're Google, Amazon or even lower tier companies like
| Comcast, Yahoo or OpenAI, you can scape a massive amount of
| data (ignoring the "allowed" here, because TFA is about
| OpenAI disregarding robots.txt)
| bevekspldnw wrote:
| Gonna say you're way off there. Once you decompress common
| crawl and index it for FTS and put it on fast storage
| you're in for some serious pain, and that's before you even
| put it in your ML pipeline.
|
| Even refined web runs about 2TB once loaded into Postgres
| with TS vector columns, and that's a substantially smaller
| dataset than common crawl.
|
| It's not just a dumping a to of zip files on your NAS, it's
| making the data responsive and usable.
| Dylan16807 wrote:
| How important is full text search for training an LLM,
| compared to a pile of zip files with a gigabyte of text
| each?
| michaelt wrote:
| Maybe not _full_ full text search, but you 'll generally
| want to remove the duplicates and suchlike.
| Dylan16807 wrote:
| I guess you want some fast extra storage for as long as
| it takes to run https://github.com/chatnoir-eu/chatnoir-
| copycat but that's a very temporary thing.
| nicolas_17 wrote:
| "Content you're allowed to scrape from the internet" is
| MUCH smaller than what LLMs have actually scraped, but they
| don't care about copyright.
|
| > The fears that we have "used all the data" are incredibly
| unfounded.
|
| The problem isn't whether we used all the real data or not,
| the problem is that it becomes increasingly difficult to
| distinguish real data from previous LLM outputs.
| Dylan16807 wrote:
| > "Content you're allowed to scrape from the internet" is
| MUCH smaller than what LLMs have actually scraped, but
| they don't care about copyright.
|
| I don't know about that. If you scraped the same data and
| ran a search engine I think people would generally say
| you're fine. The copyright issue isn't the scraping step.
| sarah_eu wrote:
| Well it will be multimodal, training and inferring on feeds of
| distributed sensing networks; radio, optical, acoustic,
| accelerometer, vibration, anything that's in your phone and
| much besides. I think the time of the text-only transformer has
| already passed.
| pants2 wrote:
| OpenAI will just litter microphones around public spaces to
| record conversations and train on them.
| sarah_eu wrote:
| Has been happening for at least 10 years.
| khalladay wrote:
| Got a source for that?
| oceanplexian wrote:
| Want a real conspiracy?
|
| What do you think the NSA is storing in that datacenter
| in Utah? Power point presentations? All that data is
| going to be trained into large models. Every phone call
| you ever had and every email you ever wrote. They are
| likely pumping enormous money into it as we speak,
| probably with the help of OpenAI, Microsoft and friends.
| sangnoir wrote:
| > What do you think the NSA is storing in that datacenter
| in Utah?
|
| A buffer with several-days-worth of the entire internet's
| traffic for post-hoc decryption/analysis/filtering on
| interesting bits. All that tapped backbone/undersea cable
| traffic has to be stored somewhere.
| berniedurfee wrote:
| It would be absolutely fascinating to talk to the LLMs of
| the various government spy agencies around the world.
| fragmede wrote:
| https://harpers.org/archive/2024/03/the-pentagons-
| silicon-va...
|
| If they actually worked, that is.
| choilive wrote:
| As I understand it, they don't have the capability to
| essentially PCAP all that data.. and the data wouldn't be
| that useful since most interesting traffic is encrypted
| as well. Instead they store the metadata around the
| traffic. Phone number X made an outgoing call to Y @
| timestamp A, call ended at timestamp B, approximate
| location is Z, etc. Repeat that for internet IP addresses
| do some analysis and then you can build a pretty
| interesting web of connections and how they interact.
| fragmede wrote:
| > most interesting traffic is encrypted as well
|
| encrypted with an algorithm currently considered to be
| un-brute-forcible. If you presume we'll be able to
| decrypt today's encrypted transmissions in, say, 50-100
| years, I'd record the encrypted transmission if I were
| the NSA.
| michaelt wrote:
| It's a big data centre.
|
| But is it big enough to store _50 years worth_ of
| encrypted transmissions?
|
| Far cheaper to simply have spies infiltrate the ~3
| companies that hold the keys to 98% of internet traffic.
| beau_g wrote:
| Though it seems like something that could exist, who is
| doing the technical work/programming? It seems impossible
| to be in the industry and not have associates and
| colleagues either from or going to an operation like
| that. This is what I've always pondered about when it
| comes to any idea like this. The number of engineers at
| the pointy end of the tech spear is pretty small.
| FridgeSeal wrote:
| That doesn't get around the root problem, just gives us
| multi-modal junk results lol.
| jayd16 wrote:
| It's true that there will no longer be any virgin forest to
| scrape but it's also true that content humans want will still
| be most popular and promoted and curated and edited etc etc.
| Even if it's impossible to train on organic content it'll still
| be possible to get good content
| RogerL wrote:
| Is it (I am not a worker in this space, so genuine question)?
|
| My thoughts - I teach myself all the time. Self reflection with
| a loss function can lead to better results. Why can't the LLMs
| do the same (I grasp that they may not be programmed that way
| currently)? Top engines already do it with chess, go, etc. They
| exceed human abilities without human gameplay. To me that seems
| like the obvious and perhaps only route to general
| intelligence.
|
| We as humans can recognize botnets. Why wouldn't the LLM? Sort
| of in a hierarchal boost - learn the language, learn about bots
| and botnets (by reading things like this discussion), learn to
| identify them, learn that their content doesn't help the loss
| function much, etc. I mean sure, if the main input is "as a
| language model I cannot..." and that is treated as 'gospel'
| that would lead to a poor LLM, but i don't think that is the
| future. LLMs are interacting with humans - how many times do
| they have to re-ask a question - that should be part of the
| learning/loss function. How often do they copy the text into
| their clipboard (weak evidence that the reply was good)? do you
| see that text in the wild, showing it was used? If so, in what
| context "Witness this horrible output of chatGPT: <blah>"
| should result in lower scores and suppression of that kind of
| thing.
|
| I dream of the day where I have a local LLM (ie individualized,
| I don't care where the hardware is) as a filter on my internet.
| Never see a botnet again, or a stack overflow q/a that is just
| "this has already been answered" (just show me where it _was_
| answered), rewrite things to fix grammar, etc. We already have
| that with automatic translation of languages in your browser,
| but now we have the tools for something more intelligent than
| that. That sort of thing. Of course there will be an arms race,
| but in one sense who cares. If a bot is entirely
| indistinguishable from a person, is that a difference that
| matters? I can think of scenerios where the answer is an
| emphatic YES, but overall it seems like a net improvement.
| bogwog wrote:
| > Eventually, OpenAI (and friends) are going to be training
| their models on almost exclusively AI generated content
|
| What makes you think this is true? Yes, it's likely that the
| internet will have more AI generated content than real content
| eventually (if it hasn't happened already), but why do you
| think AI companies won't realize this and adjust their training
| methods?
| loloquwowndueo wrote:
| Many AI content detectors have been retired because they are
| unreliable - AI can't consistently identify AI-generated
| content. How would they adjust then?
| mlboss wrote:
| The only way out of this is robots that can go out in the world
| and collect data. Write in natural language what they observed
| which can then be used to train better LLMs.
| Salgat wrote:
| As long as humans continue to filter out the bad content
| generated by AI, it should be fine.
| eightysixfour wrote:
| It is already solved. Look at how Microsoft trained Phi - they
| used existing models to generate synthetic data from textbooks.
| That allowed them to create a new dataset grounded in "fact" at
| a far higher quality than common crawl or others.
|
| It looks less like an ouroboros and more like a bootstrapping
| problem.
| mattc0m wrote:
| AI training on AI-generated content is a future problem.
| Using textbooks is a good idea, until our textbooks are being
| written by AI.
|
| This problem can't really be avoided once we begin using AI
| to write, understand, explain, and disseminate information
| for us. It'll be writing more than blogs and SEO pages.
|
| How long before we start readily using AI to write academic
| journals and scientific papers? It's really only a matter of
| time, if it's not already happening.
| eightysixfour wrote:
| You need to separate "content" and "knowledge." GenAI can
| create massive amounts of content, but the knowledge you
| give it to create that content is what matters and why RAG
| is the most important pattern right now.
|
| From "known good" sources of knowledge, we can generate an
| infinite amount of content. We can add more "known good"
| knowledge to the model by generating content about that
| knowledge and training on it.
|
| I agree there will be many issues keeping up with what
| "known good" is, but that's always been an issue.
| ezekg wrote:
| > We can add more "known good" knowledge to the model by
| generating content about that knowledge and training on
| it.
|
| That's my entire point -- AI only _generates content_
| right now, but it will also be _the source of content_
| for training purposes soon. We need a "known good" human
| knowledge-base, otherwise generative AI will degenerate
| as AI generated content proliferates.
|
| Crawling the web, like in the case of the OP, isn't going
| to work for much longer. And books, video, and music are
| next.
| FridgeSeal wrote:
| Is this like, the AI equivalent of "another layer will fix
| it" that crypto fans used?
|
| "It's ok bro, another model will fix, just please, one more
| ~layer~ ~agent~ model"
|
| It's all fun and games until you can't reliably generate your
| base models anymore, because all your _base_ data is too
| polluted.
|
| Let's not forget MS has a $10bn stake in the current crop of
| LLM's turning out to be as magic as they claim, so I'm sure
| they will do anything to ensure that happens.
| TaylorAlexander wrote:
| I really really hope that five years from now we are not still
| using AI systems that behave the way today's do, based on
| probabilistic amalgamations of the whole of the internet. I
| hope we have designed systems that can reason about what they
| are learning and build reasonable mental models about what
| information is valuable and what can be discarded.
| SrslyJosh wrote:
| Everyone saying "ouroboros": The phrase you're looking for is
| "human centipede". =)
| atleastoptimal wrote:
| They've obviously been thinking about this for a while and are
| well aware of the pitfalls of training on AI based content.
| This is why they're making such aggressive moves into video,
| audio, other better and more robust ground forms of truth. Do
| you really think that they aren't aware of this issue?
|
| It's funny whenever people bring this up, they think AI
| companies are some mindless juggernauts who will simply train
| without caring about data quality at all and end up with worse
| models that they'll still for some reason release. Don't people
| realize that attention to data quality is the core
| differentiating feature that lead companies like OpenAI to
| their market dominance in the first place?
| FridgeSeal wrote:
| I for one, welcome the junk-data-ouroboros-meta-model-collapse.
| I think it'll force us out of this local maxima of "moar data
| moar good" mindset, and give us collectively, a chance to
| evaluate the effect these things have our society. Some
| proverbial breathing room.
| altdataseller wrote:
| If they follow robots.txt, OpenAI also has a bot blocking + data
| gathering problem too:
| https://x.com/AznWeng/status/1777688628308681000
|
| 11% of the top 100K websites already block their crawler, more
| than all their competitors (Google, FB, Anthropic, Perplexity)
| combined
| Jordan-117 wrote:
| It's not just a problem for training, but the end user, too.
| There are so many times that I've tried to ask a question or
| request a summary for a long article only to be told it can't
| read it itself, so you have to copy-paste the text into the
| chat. Given the non-binding nature of robots.txt and the way
| they seem comfortable with vacuuming up public data in other
| contexts, I'm surprised they allow it to be such an obstacle
| for the user experience.
| lobsterthief wrote:
| That's the whole point. The site owner doesn't want their
| information included in ChatGPT--they want you going to their
| website to view it instead.
|
| It's functioning exactly as designed.
| fragmede wrote:
| If my web browser's extension "visits" the site and dumps
| it into ChatGPT for me to read its summarization of the
| site, what has been gained by the website operator?
| rsolva wrote:
| Added friction. That is all a website owner can hope to
| achieve.
| jpambrun wrote:
| It's a stretch to expect a human initiated action to abide
| by robot.txt.
|
| Also, once you click on a link in chrome it's pretty much
| all robot parsed and rendered from there as well..
| chrstphrknwtn wrote:
| At bottom, all robot actions are human initiated.
| Zambyte wrote:
| I would say robots.txt is meant to filter access for
| interactions initiated by an automated process (ie
| automatic crawling). Since the interaction to request a
| site with a language model is manual (a human request) it
| doesn't make sense to me that it is used to block that
| request.
|
| If you want to block information you provide from going
| through ClosedAI servers, block their IPs instead of using
| robots.txt.
| tivert wrote:
| Honestly, that seems like an excellent opportunity to feed
| garbage into OpenAI's training process.
| sangnoir wrote:
| So someone could hypothetically perform a _Microsoft-Tay-style_
| attack against OpenAI models using an infinite Potemkin
| subdomians generated on the fly on a $20 VPS? One could
| hypothetically use GenAI to create the biased pages with
| repeated calls on how it 'd be great to JOIN THE NAVY on 27,000
| different "subdomains"
| aaron695 wrote:
| The website is 12 years old and explained here -
|
| https://circleid.com/posts/20120713_silly_bing
|
| John Levine is a known name in IT. Probably best know on HN as
| the author of "UNIX For Dummies"
| samsullivan wrote:
| With all the news about scraping legality you'd think a multi
| billion dollar AI company would try to obfuscate their attempts.
| TechDebtDevin wrote:
| If you're not walling off your content behind a login that
| contains terms that you agree to not scraping, then, scraping
| that site is 100% legal. Robots.txt isn't a legal document.
| hermannj314 wrote:
| I frequently respect the wishes of other people without any
| legal obligation to do so, in business, personal, and
| anonymous interactions.
|
| I do try to avoid people that use the law as a ceiling for
| the extension of their courtesy to others, as they are
| consistently quite terrible people.
| withinboredom wrote:
| If the industry doesn't self-regulate (ie, following
| conventional rules and basic human courtesy) ... then it will
| be regulated by laws.
|
| So let me fix what you said for you:
|
| > Robots.txt isn't a legal document, yet.
| karaterobot wrote:
| My assumption is that OpenAI reads the robots.txt, but indexes
| anyway; they just make a note of what content they weren't
| _supposed_ to index.
| EVa5I7bHFq9mnYK wrote:
| And assign such content double weight in training ..
| cdme wrote:
| If they don't respect robots.txt then block them using a firewall
| or other server config. All of these companies are parasites.
| pksebben wrote:
| I don't think this message is about "protecting the site's
| data" quite so much as "hey guys, you're wasting a ton of time
| and network connect to make your model worse. Might wanna do
| something 'bout that"
| cdme wrote:
| I suppose in that case, let them keep wasting their time.
| jeremyjh wrote:
| The entire purpose of this website is to identify bad actors
| who do not respect robots.txt, so that they can be publicly
| shamed.
| cdme wrote:
| Well, we know where OpenAI lands then.
| m3047 wrote:
| No. I've run 'bot motels myself. I've got better things to do
| than curating a block list when they can just switch or
| renumber their infrastructure. Most notably I ran a 'bot motel
| on a compute-intensive web app; it was cheaper to burn
| bandwidth (and I slow-rolled that) than CPU cycles. Poisoning
| the datasets was just lulz.
|
| I block ping from virtually all of Amazon; there are a few
| providers out there for which I block every naked SYN coming to
| my environment except port 25, and a smattering I block
| entirely. I can't prove that the pings even come from Amazon,
| even if the pongs are supposed to go there (although I have my
| suspicions that even if the pings don't come from the host
| receiving the pongs the pongs are monitored by the generator of
| the pings).
|
| The point I'm making is that e.g. Amazon doesn't have the right
| to sell access to my compute and tragedy of the commons
| applies, folks. I offered them a live feed of the worst
| offenders, but all they want is pcaps.
|
| (I've got a list of 50 prefixes, small enough to be a separate
| specialty firewall table. It misses a few things and picks up
| some dust bunnies. But contrast that to the 8,000 prefixes they
| publish in that JSON file. Spoiler alert: they won't admit in
| that JSON file that they own the entirety of 3.0.0.0/8. I'm
| willing to share the list TLP:RED/YELLOW, hunt me down and
| introduce yourself.)
| ta_9390 wrote:
| I am wondering if amazon fixed the issue or blacklisted *.sp.am
| ta_9390 wrote:
| This can be repurposed as a legal form of ransomware
|
| pay me to shut my absolutely legal site down to make your life
| easier
| TheKarateKid wrote:
| Or.. you can not have your bots crawl other people's property
| without permission.
| layer8 wrote:
| Content farms of that size should be considered a public order
| offense and be prohibited.
| Animats wrote:
| Just feed them bogus info and corrupt their models. That will
| make them stop.
| sandworm101 wrote:
| Dude, you have caught the spider. Now use it. Start inserting
| whatever random junk you can until "astronaut riding a horse"
| looks more like Ronald MacDonald driving a Ferrari.
|
| I feel like inserting "free user-tagged vacation images" into my
| robots.txt then pointing the spider at an endless series of
| fabric swatches.
| _pdp_ wrote:
| In the network security world, this is known as a tarpit. You can
| delay an attack, scan or any other type of automation by sending
| data either too slowly or in such a way as to cause infinite
| recursion. The result is wasted time and energy for the attacker
| and potentially a chance for us to ramp up the defences.
| bityard wrote:
| From the content of the email, I get the impression that it's
| just a honeypot. Also I'm not seeing any delays in the content
| being returned.
|
| A tarpit is different because it's designed to slow down
| scanning/scraping and deliberately waste an adversary's
| resources. There are several techniques but most involve
| throttling the response (or rate of responses) exponentially.
| 1317 wrote:
| He's not done his robots.txt properly, he's commented out the bit
| that actually disallows it # silly bing
| #User-agent: Amazonbot #Disallow: / #
| buzz off #User-agent: GPTBot #Disallow: /
| # Don't Allow everyone User-agent: * Disallow:
| /archive # slow down, dudes #Crawl-delay: 60
| haeffin wrote:
| The contents changed between then and now.
| disjunct wrote:
| Time to link the crawler to a site like keys.lol[1] that indexes
| and links every Bitcoin private key and figure out a way to sweep
| it for balances.
|
| [1]: https://keys.lol/
| dhosek wrote:
| Am I the only one who was hoping--even though I knew it wouldn't
| be the case--that OpenAI's server farm was infested with actual
| spiders and they were getting into other people's racks?
| yreg wrote:
| very xkcd
| azurezyq wrote:
| This reminds me of the binary search tree project on web crawler
| behavior research. It was a bit old but in really good quality.
|
| http://drunkmenworkhere.org/219
| gwbas1c wrote:
| Aren't there plenty of reverse proxies you can put a site behind
| that will throttle this kind of thing?
| symlinkk wrote:
| Isn't it funny that all the "worthless" content out there on the
| internet is actually changing the world. Like how 4chan was
| mocked as being a cesspit for losers, but now everyone knows
| memes like Pepe the frog and Wojack from there. And like now this
| very comment and the billions of other comments on here, Reddit,
| Twitter, etc that are regarded as a "waste of time" are being
| used to train multi billion dollar companies to build the most
| powerful AI the world has ever seen. For free.
|
| The moral of the story here is if you know something valuable,
| don't share it online, because then everyone knows it.
___________________________________________________________________
(page generated 2024-04-11 23:01 UTC)