[HN Gopher] Web scraping for me, but not for thee
___________________________________________________________________
Web scraping for me, but not for thee
Author : mhb
Score : 219 points
Date : 2023-08-25 17:42 UTC (5 hours ago)
(HTM) web link (blog.ericgoldman.org)
(TXT) w3m dump (blog.ericgoldman.org)
| waydegg wrote:
| It's interesting seeing the reactions from other websites/orgs
| after OpenAI publicly announced GPTBot. Tons of people blocking
| GPTBot outright (made a small page that track this:
| https://wayde.gg/websites-blocking-openai)
| dontupvoteme wrote:
| Is there any legal issue with a spider trap designed to poison
| LLMs?
| version_five wrote:
| I wonder if blocking gptbot is a good signal that a website has
| non LLM generated content on it, and is therefore good training
| data...
| [deleted]
| fasterik wrote:
| _> Let's look at what Microsoft is doing right now, as an
| example. In the last couple of weeks, Microsoft updated its
| general terms of use to prohibit scraping, harvesting, or similar
| extraction methods of its AI services. Also in the couple of
| weeks, Microsoft affiliate OpenAI released a product called
| GPTbot, which is designed to scrape the entire internet. And
| while they don't admit this publicly, OpenAI has almost certainly
| already scraped the entire non-authwalled-Internet and used it is
| training data for GPT-3, ChatGPT, and GPT-4. Nonetheless, without
| any obvious hints of irony, OpenAI's own terms of use prohibits
| scraping._
|
| I don't understand why this demonstrates hypocrisy. There is a
| big difference between crawling the publicly accessible web
| (which legitimate search engines do all the time) and scraping an
| authenticated web application or API.
| Atotalnoob wrote:
| Or that while Microsoft is an investor in openai, they do not
| control openai
| einpoklum wrote:
| Perhaps it's legal to scrape in some world states outside the
| USA?
| dontupvoteme wrote:
| A good question. Japan came out and declared that copyright
| does not apply to training AI.
|
| There must be a good chunk of the world that doesn't have any
| laws forbidding it. This isn't under the jurisdiction of WIPO
| or anything like that, it's just a completely insane evolution
| of anglo common law
| SenAnder wrote:
| > Mark Lemley observed this happening nearly 20 years ago, in his
| prescient, seminal article, "Terms of Use.": _The problem is that
| the shift from property law to contract law takes the job of
| defining the Web site owner's rights out of the hands of the law
| and into the hands of the site owner._
|
| With "contracts" of adhesion proliferating, and how impossible it
| has become to exist in the modern world without acceding to them
| (something as simple as buying a new SSD involves agreeing to
| one), this problem is getting worse by the day.
|
| The law is becoming increasingly irrelevant, and more and more we
| are ruled by one-sided "contracts" from giant companies that are
| in a position to push them on us.
| cvalka wrote:
| Contractual law in the modern era regularly and persistently
| undermines private property rights. Mandatory arbitration
| clauses make it worse.
| Buttons840 wrote:
| Well said. This reminds me of my own thoughts:
|
| There are two ways of thinking about what a webpage is:
|
| 1) A web page is a billboard
|
| 2) A web page is a pamphlet
|
| If a webpage is a billboard, then it is morally wrong for me to
| paint over those sections of the billboard that I do not like
| (i.e., using an ad-blocker).
|
| If a webpage is a pamphlet, then I'm free to cut it up and re-
| arrange it however I want. Naturally, those with knowledge to
| cut and re-arrange are more likely to take this view.
|
| It's fair to say that Amazon.com contains Amazon's webpage, and
| that Amazon owns that web page. And yet, I've never once viewed
| Amazon.com without using an electronic device owned by myself
| or another non-Amazon entities. Amazon.com doesn't exist on a
| billboard, it requires the use of electronic devices owned by
| other people. What rights do the owners of those electronic
| devices have? Any?
| nre wrote:
| > With "contracts" of adhesion proliferating, and how
| impossible it has become to exist in the modern world without
| acceding to them (something as simple as buying a new SSD
| involves agreeing to one), this problem is getting worse by the
| day.
|
| The craziest example of this is how all these contracts are
| appearing in the _physical_ world as well. There are stores
| that actually have a sign indicating that entering the store
| constitutes acceptance of contract terms (with a QR code that
| you presumably can scan with your phone to read the contract).
| I 've also seen public parks with the same thing basically
| indicating that entry binds you to a legal agreement to not sue
| the park/follow posted rules/etc.
| golemiprague wrote:
| [dead]
| actuallyalys wrote:
| Public parks saying that is kind of strange because the city
| or town could already set the rules by enacting an ordinance.
| Presumably they could also delegate that authority to the
| parks department. I suppose Parks might be doing it because
| the city council or mayor isn't enacting the ordinances they
| want.
| profile53 wrote:
| There's a dead reply to your post saying that this occurs
| because of the insanely litigious nature of the USA. I think
| it's worth highlighting -- business/property owners are
| essentially trying to use contract law to route around the
| fact that the US legal system is broken with regards to civil
| litigation and throwing out bogus cases. For example, having
| a private pool in your own back yard can make you liable for
| someone else's child breaking in and injuring themselves in
| your pool because you not having enough barriers to stop them
| means you allowed the access.
| Eisenstein wrote:
| > the US legal system is broken with regards to civil
| litigation
|
| And the problem with that has a lot do with corporations.
| For instance, if you are a pedestrian and get hit by a car
| and end up in the hospital, in a lot of places in the USA
| your health insurance will not cover you at all -- you have
| to sue the driver and get compensated from their auto
| insurance. The logical method would be for your insurance
| to cover you and then the health insurance would recover
| costs through appropriate parties.
|
| It is the same with ridiculous lawsuits like the aunt who
| sued her sister because the nephew jumped on her and threw
| out her back. In order to recoup medical costs she _had_ to
| sue her sister since the sister had homeowner 's insurance.
|
| You can't entirely blame the legal system when the
| corporations are using it to perpetuate the problem for
| their own gains at the expense of everyone.
| mindslight wrote:
| [delayed]
| giraffe_lady wrote:
| And the litigiousness is downstream of having freakish
| medical expenses and no universal safety net. An accident
| can incur costs you could work you whole life to pay off so
| of course there's a complex adversarial social system built
| around those consequences.
| pulvinar wrote:
| What's needed to counter these is for customers to have their
| own contract of adhesion that simply says if the company is to
| accept them as a customer, then the company's own contract is
| null-and-void. This would be backed by a legal team in
| something like a customer's union or insurance that people
| would subscribe to for a monthly fee. This contract would be as
| enforceable (or not) as the company's, leveling the playing
| field. It would no longer matter what they put in their fine
| print since you wouldn't need to read it.
|
| If a company doesn't accept the customer's contract or won't
| let you bypass their own, you walk away -- no sale. Other
| companies will get your business.
| deepsun wrote:
| > But the content that they're trying to protect isn't theirs --
| it belongs to their users.
|
| Kinda. Yes, Facebook says that content belongs to users
| (otherwise they'd have harder time explaining they are not liable
| when it's illegal), but users also agree to give Facebook "non-
| exclusive, transferable, sub-licensable, royalty-free, worldwide
| license to use any IP content that you post on or in connection
| with Facebook."
|
| For example, if a user deleted their* content, Facebook can still
| use it and show to their friends. That's why it's "kinda".
| sib wrote:
| That doesn't change who the content belongs to. It just gives
| some rights to FB. Any, in fact, without something like
| "perpetual" and/or "irrevocable" in there, it doesn't imply
| that they could keep using it after you deleted it (or that you
| couldn't revoke a grant of rights.)
| jeremyjh wrote:
| A license is not ownership. Anyway that part of the article is
| just context - none of what you describe constitutes the legal
| basis for the suits or rulings discussed in it - it's just
| explaining why property law isn't being used.
| waynesonfire wrote:
| Did you read the posted sign? "no walking on the road outside
| my property"
| antonf wrote:
| > For example, if a user deleted their* content, Facebook can
| still use it and show to their friends. That's why it's
| "kinda".
|
| I don't think it is correct. If you asked Facebook to remove
| your data from platform, it will be a GDPR (and probably CCPA,
| etc...) violation for Facebook to not delete your data within 1
| month.
| dclowd9901 wrote:
| The primary grounds on which these cases rest is some nebulous
| understanding of contractual agreement.
|
| I have two thoughts:
|
| - EULAs aren't written for companies to sign.
|
| - I think EULAs are garbage anyway. They're completely one sided
| and in most cases probably illegal or wouldn't hold up in court
| if anyone actually had the resources to fight one.
|
| Imo, the burden of ensuring someone has read and understands a
| EULA should be on the company who creates it and they should not
| be enforceable unless they can prove the person understood the
| EULA entirely before accessing the site. EULAs are not a business
| agreement. They're some kind of corporate pseudo-law companies
| try to attach to the usage of a product. But what other product
| in the world has a big list of rules that come with it that way
| how you can use it (or be sued)?
|
| So how does this all come back to this "company vs company
| scraping"? If you put it on the web, and you don't have REAL
| copyright on the content (that is, you didn't make it yourself),
| you have no right to protect it from "theft."
|
| PS yes, I know John Deere doesn't let its customers work on its
| tractors but that's some bullshit too.
| msie wrote:
| The first company that came to mind from reading the title was
| Google.
| version_five wrote:
| Good example from the Allen Institute discussed last week
| https://news.ycombinator.com/item?id=37181415
|
| They "released" a dataset scraped from public domain stuff under
| a license that restricts how people can use it
| [deleted]
| sneak wrote:
| If you think about it, if free lending libraries and web search
| indices did not exist, and you tried to create them today, you
| would get sued into oblivion.
| karaterobot wrote:
| The perceived hypocrisy sort of goes away when you stop thinking
| about it as a collaboration or a community of equals, and instead
| think of it as a competition, which is what it is. You would not
| say of a football team "oh, it's okay for you to try to score a
| goal on me, but if I try to score a goal on you, suddenly you're
| blocking the ball?!"
|
| Naturally, they're going to say "web scraping uses resources,
| stop it!" but then keep web scraping in the background.
|
| To be clear, it's bad behavior, it's just not hypocritical
| behavior, as it's completely in keeping with what amoral
| corporations locked in constant battle would be expected to do:
| maximize benefits to themselves while minimizing benefits to
| others.
| philipov wrote:
| Hypocrisy doesn't require one to believe what they say and
| utter it in good faith but fall short of those ideals in
| practice. Equivocating about football teams doesn't change that
| one is trying to impose standards on another without holding
| oneself equally to them. It is still hypocrisy, even if they do
| it amorally in bad faith. _Especially_ then. What matters is
| what policy you espouse; you don 't get a pass for not really
| believing what you say. The _implication_ is that hypocrites
| are acting in bad faith.
| runesofdoom wrote:
| The problem with that sort "that's what amoral corporations do"
| reasoning is that corporations are _permitted to exist_ because
| of the idea that they do contribute to the net public good.
| Once that idea is out the window, then there 's no reason for
| society not to treat corporations as the hungry Lovecraftian
| nightmares they are and obliterate them with fire and
| steamship.
| KieranMac wrote:
| I think the difference is that defeating the other team is the
| point of sports, whereas at least ostensibly the law is
| supposed to provide a set of coherent rules for businesses to
| compete against each other. Trademarks are defined according to
| certain legal rules, and if you have one, this is how they
| provide you with a limited monopoly in a certain context.
| Allowing businesses to define property law through contracts
| lets people define the rules however they want. And that leads
| to irrational results.
| gnomewascool wrote:
| That's a very interesting comparison (thanks!), but I'm not
| sure if it's the correct framing. Making scraping technically
| difficult would be equivalent to trying to score a goal (so
| still not great, for the rest of the world, but probably not
| hypocritical).
|
| Trying to prevent certain classes of behaviours via legal means
| is more like trying to prevent certain types of play, by
| appealing to the referee, while still doing them yourself.
| Clearly, this often does happen in sports, but _is_ generally
| seen as hypocritical.
| jjoonathan wrote:
| > football team
|
| In football, the rules have been extensively tuned to promote a
| fair fight.
|
| Perhaps we should do a bit more of that sort of thing in
| corporate law.
| Kareem71 wrote:
| The problem is as this article points out is democratically
| elected courts should not be choosing winners in a capitalistic
| competition
| karaterobot wrote:
| Only commenting on how we should expect corporations to act,
| or more accurately why we should not be surprised at their
| behavior.
| hattmall wrote:
| As the article states the issue is with courts not companies.
| We need a state actor to pass a law similar to the weapons laws
| of other states that guarantee a right to scrape. Then all the
| scraping companies setup shop in that state. If a service
| doesn't want their data scraped they need to make sure that it
| doesn't get sent into that state. Ideally a large enough state
| that companies wouldn't want to block.
| nosecreek wrote:
| Agreed that legal clarity is important - especially for
| smaller players. I've built a significant hobby site that
| relies fairly heavily on scraping (grocery price comparison
| site). I believe what I'm doing is morally okay, and also
| that big players wouldn't run into any issues, but when it's
| just me (or even if it was a small company) the legal 'grey
| area' makes it a much bigger risk.
| tomcam wrote:
| I like what you're saying, but how do you provide for the
| existence of evil or simply incompetent scrapers who drag the
| system down due to incompetence?
| backtoyoujim wrote:
| There are clearly more than two teams in these issues. It is
| not a game, and it is not football.
|
| It is a public policy issue also outpaces "competition" which
| is merely a subject change.
| karaterobot wrote:
| Football is metaphorical in this case.
| autoexec wrote:
| > Naturally, they're going to say "web scraping uses resources,
| stop it!"
|
| that's the expected cost of publishing something to the public
| internet. People are going to access it. No one has a right to
| complain when people access something that was put there for
| the public to see. Scrappers can be dicks about it too, they
| can get lazy and endlessly hammer away at some server or
| repeatedly pull down the same content because they messed up,
| but we don't need need litigation for that. If something raises
| to the level of DoS that's already covered under existing laws.
|
| > it's completely in keeping with what amoral corporations
| locked in constant battle would be expected to do: maximize
| benefits to themselves while minimizing benefits to others.
|
| Maybe we need to rethink giving some of these corporations the
| privilege of corporate personhood if they are just going to
| make things worse for everyone else while only enriching
| themselves. We don't need to allow parasites and pillagers to
| take whatever they want at our expense.
| mrkeen wrote:
| > Scrappers can be dicks about it too
|
| It's not always about individual bad actors. You can have
| lots of small players causing problems. I wonder how many
| python developers there are right now trying to make their
| own offline copy of stackoverflow.com.
|
| Wikipedia has a great defence against this. They ask you not
| to scrape, and at the same time, provide torrents of the data
| (https://meta.wikimedia.org/wiki/Data_dump_torrents)
| rzzzt wrote:
| Stack Exchange also provides one:
| https://archive.org/details/stackexchange
|
| There was a hiccup around June but that seems resolved now:
| https://meta.stackexchange.com/questions/389922/june-2023-d
| a...
| Klonoar wrote:
| Hmmm, I'm a bit confused on something. The HiQ vs LinkedIn case,
| to my knowledge, went through the following stages:
|
| - LinkedIn sues HiQ, Ninth Circuit sides with HiQ
|
| - LinkedIn pushes to Supreme Court, Supreme Court vacates citing
| Van Buren
|
| - Ninth Circuit re-reviews and _affirms their decision_
|
| - LinkedIn moves to get the injunction preventing them from
| blocking HiQ dissolved, which is granted
|
| - A mixed judgement is finally issued in Nov 2022 ultimately
| resulting in a private settlement
|
| Where exactly does this leave things at? I feel like everyone
| loves to cite this case but never goes into the finer details.
|
| Reading a summary of the mixed judgement from Nov 2022, it looks
| like maybe the issue came from HiQ using people to log in and
| thus the ToS came into play...? If I'm reading correctly, it
| looks like the court did eventually side with LinkedIn in stating
| that HiQ violated the ToS.
|
| https://www.natlawreview.com/article/court-finds-hiq-breache...
|
| _Edit: Formatting._
| dontupvoteme wrote:
| What is the legal precedent of a mixed ruling? I was unaware
| such a thing was even possible.
| KieranMac wrote:
| Not a mixed judgment in Nov. 22. It was a massive defeat for
| hiQ Labs. Read the permanent injunction issued by the court.
| Klonoar wrote:
| Interesting. You appear to be a lawyer or in that realm, so
| I'm curious your take on it - though I also understand if you
| don't want to publicly make statements or anything.
|
| i.e is the common take that people have of "scraping is legal
| after HiQ vs LinkedIn" just completely wrong?
| dontupvoteme wrote:
| >Read the permanent injunction issued by the court. Happen to
| have a link?
|
| The question that matters is if this establishes any
| precedence.
| KieranMac wrote:
| I don't. I just have a .pdf. Email me at
| Kieran(at)McCarthyLG(dot)com if you want a copy.
| zarazas wrote:
| How is the situatuon legally and ethically for you to use scraped
| data as text embeddings for a commercial product?
| kazinator wrote:
| > _Some of the biggest companies on earth--including Meta and
| Microsoft--take aggressive, litigious approaches to prohibiting
| web scraping on their own properties, while taking liberal
| approaches to scraping data on other companies' properties._
|
| Umm, no; author needs to study the word "hypocrisy" more deeply
| than a cursory glance in the dictionary.
|
| Doing something to others, while defending against the same
| thing, is not hypocrisy.
|
| For instance, soccer player isn't a hypocrite for defending
| against the ball going into his net, while trying to put it into
| the other team's net.
|
| A soldier on the war front isn't a hypocrite for shooting, while
| also taking cover and dodging bullets.
|
| These subjects are not hypocrites because they are not acting in
| one way, while preaching that they, or others, ought to be acting
| in a different way.
|
| Microsoft would be hypocrites if they published an official
| statement such that nobody who engages in web scraping has the
| right to defend their own site against web scraping, because that
| would not resemble their actual behavior and position which could
| be inferred from their behavior. (Is there such a statement
| somewhere?)
|
| For hypocrisy to take place, you have to actually preach that you
| and others should behave in a certain way, and then not actually
| behave in exactly that way. If you only act, and don't preach,
| you cannot be a hypocrite.
|
| Moreover, your team's net is not the same object as another
| team's net. If a soccer player loudly professes "it is morally
| wrong for anyone to kick the ball into our net", but then kicks
| the ball into the other team's net, that is not hypocrisy. His
| statement references only his own net; he didn't proclaim that
| it's wrong to kick any ball into any net whatsoever.
| KieranMac wrote:
| Scraping other sites while prohibiting it on your own is "do
| what I say, not what I do" behavior, which I think is a fair,
| consensus understanding of what it means to be hypocritical.
| MBCook wrote:
| I see two issues. Web scraping is clearly a business model
| problem, and that's partly due to scale.
|
| If you give away your content for free and expect ads to sustain
| you, that will start failing once others get the value out of
| your content without seeing the ads. Examples are ad blockers,
| answers embedded in Google results, Stack Overflow clones, and
| things like ChatGPT.
|
| If ads weren't your business model you wouldn't be using revenue
| from it.
|
| The other issue is scale, and I don't know how to address it.
|
| It's easy for someone (say the government) to have a friendly
| policy and say "you can use dig in a park" thinking it's useful
| to campers and such.
|
| But when someone shows up with a professional strip mining crew,
| things are different.
|
| If you run a site providing quality information for free, making
| money off book sales or professional services or such can be a
| good living. Even if answers end up in the Google answer box,
| more complicated stuff or analysis still requires a visit to read
| and people can start following you from there.
|
| But if ChatGPT or whatever can "read" your stuff and give out 80%
| of the value without anyone even knowing it came from you, you're
| screwed. Your business model no longer works. Any kind of "give
| away good information" business model fails. Same issue artists
| are now seeing.
|
| And I don't know how you fix that without some kind of ban. But
| unless every country everywhere enforces one... you have to work
| with the lowest common denominator and lock all your content up.
| No web search. No Google answers. No chat GPT. "Please don't
| scrape me" in robots.txt won't work.
| [deleted]
| tshaddox wrote:
| It's interesting, because it's essentially the same exact
| discussion as traditional copyrights (e.g. for books). The only
| difference is that book authors are generally not giving away
| their books for free on their personal website. Copyrights are
| the attempt to protect the business model of authors who want
| to sell copies of something that are otherwise extremely easy
| and cheap to copy. Attempts to legally limit web scraping are
| an attempt to protect the business of model of creators who
| want to _give away for free_ copies of things that are easy and
| cheap to copy, but _only_ we come directly to the creator to
| get our free copy.
| drunkencoder wrote:
| You're right . That's why scraping must be unlimited and legal
| for all. Any information accessible from internet should be
| legal to refine. Thus also us using GPT services to train our
| own models, scraping anything that's publicly accessible. Our
| only defense is competing services that refines the data even
| more than any general llm. The solution is almost never
| regulation but competition. Fair competition
| giraffe_lady wrote:
| You're making an idealogical argument but not confronting any
| of the business problems raised in the other comment.
| antonf wrote:
| > You're right . That's why scraping must be unlimited and
| legal for all.
|
| Unlimited scraping makes some of privacy regulations moot.
| Such as right to erasure (ability to delete personal data
| from a platform).
| fluoridation wrote:
| Not exactly. You can request a site to erase all the data
| it has on you, but not that they erase the memories of
| everyone who has seen this data. How is this any different?
| text0404 wrote:
| scale
| fluoridation wrote:
| So at which scale does the copying of data lower privacy,
| such that humans looking at it and potentially
| screenshooting it doesn't, but automated processes
| copying it does?
| nawgz wrote:
| Your tone implies you're serious, but I struggle to
| believe anyone could possibly equate persisting digital
| media with recalling a memory.
|
| In case you really need an example to elucidate, consider
| reproducing an image. A scraper can quite literally
| accomplish that, trivially; a great artist would still be
| limited in multiple facets of the recreation, such that
| even one with the best memory and hand would find
| themselves far short of pixel-perfect.
| brendamn wrote:
| How many people who have seen that data are acting as a
| service to share it, at scale?
| lelandbatey wrote:
| I don't think that's true. "Right to erasure" still works
| just as well as it always has, but you might need to ask
| the folks who have scraped and are re-sharing your
| information to also delete your personal data. That's not
| an unreasonable thing to have happen, nor is it an
| unreasonable thing to expect.
|
| Let's suppose an embarrassing image of Person X is shared
| on Facebook and Person X uses their right to erasure with
| Facebook to delete their profile. Facebook has no control
| over the folks who may have downloaded or screenshot-ed
| that photo and turned it into subsequent memes. Likewise,
| if someone straight up scrapes and re-shares, that's not
| Facebook responsibility.
|
| What I _don 't_ want to see happen is for:
|
| 1. Facebook to make it somehow impossible for anyone to
| ever copy or screenshot that or any photo, preventing
| anyone from ever doing anything with photos on Facebook
| without Facebook's explicit permission. This would seem to
| be quite the loss of user agency for very little society
| wide benefit (also, how would they do this?)
|
| 2. Facebook to somehow "control" that photo so closely that
| Facebook is able to remotely revoke folk's copies and
| screenshots of said photo in the spirit of "abiding by a
| persons right to erasure"; that'd be a huge overreach, but
| seems like the only other way to approach this (though
| "how" is also an open question).
|
| Even asserting that "unlimited scraping makes some privacy
| regulations moot" seems like an implication that we can
| only have privacy laws by going towards situation #1, and
| that doesn't seem accurate given that folks can use
| existing privacy laws to remove content from any
| distributor (as long as they're compliant).
| MetaWhirledPeas wrote:
| > If you give away your content for free and expect ads to
| sustain you, that will start failing once others get the value
| out of your content without seeing the ads
|
| I don't think a paywall would fix this. One paid account is all
| a scraper needs. It couldn't really even be rate-limited if
| it's just "reading" articles as they become available. After
| the data is acquired it can be dispensed. If directly posting
| it violates copyright, then obscuring it behind AI will do the
| trick just fine.
| MBCook wrote:
| But it stops being trivial. Now to scape websites en mass you
| have to automate signing up for them, probably paying for it.
|
| And unlike now to sign up you have to agree to a very
| enforceable EULA.
|
| So instead of going to court with "FunAI read my public
| website and is making money off it which I don't think that
| should be fair use", you have "FunAI violated a contract they
| signed and committed fraud by lying on signup".
|
| Seems to me that's much easier.
|
| There will always be people who get the content for free
| somehow. You don't have to stop 100%. Even stopping 95% would
| be a lot better than the current 0%.
| rvnx wrote:
| Sweet memories of Facebook, which was spamming all the contacts
| to invite them to join Facebook.
| ilrwbwrkhv wrote:
| Same as LinkedIn. In fact that was LinkedIn's actual growth
| hack. Now they sell books talking about other things.
___________________________________________________________________
(page generated 2023-08-25 23:00 UTC)