[HN Gopher] So you want to scrape like the big boys (2021)
___________________________________________________________________
So you want to scrape like the big boys (2021)
Author : aragonite
Score : 165 points
Date : 2024-04-27 02:56 UTC (20 hours ago)
(HTM) web link (incolumitas.com)
(TXT) w3m dump (incolumitas.com)
| blantonl wrote:
| This tends to be a _very unpopular opinion_ around here, but in
| almost all cases I find Internet scraping to be unethical and
| downright malicious. I 'm not saying all cases, but I'm saying
| almost.
|
| A lot of the actors involved tend to be hustle culture types who
| think they are OWED your data, regardless of the ethics, laws,
| being a good citizen, whatever. They will blatantly disregard
| terms of service and hide behind massive setups such as these to
| circumvent protection etc.
|
| And the problem is, if you run any sort of business or service
| that is data oriented, there will be thousands of people that
| will do this, which will cause you to devote enormous amounts of
| time, effort, money, and infrastructure just to mitigate the
| issues involved with data scraping. That's before you are even
| addressing whether or not these people are "stealing" your data.
| People who feel they are entitled to the crux of your business
| aren't bothered by being nice in the way they take it - they'll
| launch services that will cripple infrastructure.
|
| Whenever I deal with a scraping process that decides it wants my
| entire business, and it wants all of it RIGHT NOW, or in 5
| minutes, I want to find the person and sit them down in a room
| and tell them "hey, develop your own ideas and business. Ok?
| Thanks"
|
| And if you think this was a problem before, it's exponentially
| worse over the past few months with every Tom, Susan, and Harry
| deciding they must have all your data to train their new LLM AI
| model. By the thousands.
| vouaobrasil wrote:
| I absolutely agree. In fact, I think the problem is that like
| everything, there is an optimal point for efficiency, and
| crossing that line by making things "too easy" when it comes to
| data means too much power for one person to handle ethically.
| Absolute power may corrupt absolutely, but near absolutely
| power also corrupts quite nicely, too.
|
| In short, we should have limits to amount of scraping possible,
| simply because humans can never be trusted past a certain point
| to remain ethical. After all, ethics at its first approximation
| is only a mechanism to improve societal cohesiveness, and it
| only works as long as the person doesn't have enough power to
| "do away" with society.
| jumby wrote:
| Would you make the same argument of the inverse: data
| gathering?
| flir wrote:
| There's a lot of local history locked up in facebook's
| nostalgia groups. I want to archive it in an open format.
|
| I want to grab new rental listings and put them in an RSS feed,
| so I only look at each one once.
|
| That's my uses for data scraping right now. If that destroys
| someone's business, I don't actually care. Maybe it's selfish,
| but my right to re-format data for my own convenience outweighs
| their right to make a profit.
| throwaway11460 wrote:
| Not that I think you shouldn't do it or you're doing
| something wrong, but describing it as a _right_ irks me the
| wrong way. You don 't have any right to expect someone else's
| computers to work for you.
| flir wrote:
| I'm not sure how to phrase it except in terms of competing
| rights, but I take your point.
|
| At the point where I'm scraping, the data's on my computer
| though.
| solarkraft wrote:
| You could call them _interests_.
|
| It's often in a business's interest to format data in a
| specific way to make money, for example interlacing it
| with ads.
| flir wrote:
| Nice.
| blantonl wrote:
| _If that destroys someone 's business, I don't actually care.
| Maybe it's selfish, but my right to re-format data for my own
| convenience outweighs their right to make a profit._
|
| Exhibit A
| flir wrote:
| Yeah, it's as unsympathetic framing of my position as I can
| offer.
|
| But it's basically the same question as adblockers: Can I
| do what I want with the 1's and 0's on my own machine?
|
| I'm not going to accept that I owe anyone a business model.
| blantonl wrote:
| I'm not going to disagree with your use case here.
|
| But I'm going to assume that you have some level of a
| conscious and you don't really mean you could give 3
| shits about someone else's hard work so you can have some
| satisfaction at home. Because at face value that's
| exactly what you said.
| flir wrote:
| No, I think that's fair. Unsympathetic framing, but not
| inaccurate. It's that whole "information wants to be
| free" thing.
| brigadier132 wrote:
| > hustle culture types
|
| It seems like you have this imaginary strawman that you hate
| and it seems like that's the foundation of why you dislike
| this.
| blantonl wrote:
| No. The foundation of why I dislike it is simple. If I own
| some data, then I get to dictate the terms of how that data
| is used. Period.
|
| "Hustle culture types" is simply a little anecdote about the
| types that would look you in the eye and tell you they are
| entitled to disregard what I said above. They'll usually wrap
| it in some altruistic bs to justify as well.
| throwaway11460 wrote:
| Why do you put it on the open internet if you don't want
| machines to find and read it?
|
| ToS is nice but you can't expect that it applies - the user
| (of the machine doing the scraping) might be a child which
| makes the potential contract automatically void, for
| example. Also, there are people under jurisdictions where
| such things have no power, or that don't recognize your
| rights to the data.
|
| And the whole thing of putting data out publicly and then
| just expecting machines to see the pile of data and go "oh
| so where do I sign the ToS?" is weird...
|
| Just put it behind a rate limited API key...
| blantonl wrote:
| What makes you think putting data on the Internet all the
| sudden means I unilaterally surrender the rights to my
| intellectual property?
|
| If I choose to make my data available to some businesses
| to make discovery of it easier, and I choose to decline
| to allow others to unilaterally copy my data to develop a
| different business, that's my right. And it is unethical
| and unreasonable for any other person to assume otherwise
| that they are entitled to the same rights I granted
| someone else.
|
| If I own some data, I get to the be arbitrator of the
| who/what/when/where on the use of the data. Period.
| throwaway11460 wrote:
| Sure, you can do whatever you like. Cut the connection if
| you don't like it. But I can do whatever I like too -
| read the data that your machine sent me, for example. If
| your machine sends my machine data it's IMHO reasonable
| to expect that you don't care about me having it _unless
| we agreed otherwise_. But in many countries ToS is not
| considered a legal contract at all - just having it on
| your site somewhere is not enough. Sometimes not even
| having users check the ToS checkmark would form a valid
| contract.
|
| There are many kinds of data that can't be owned at all.
| Actually it's the other way around - there is a very
| small subset of data that can be owned. You can try to
| cover it under some kind of a non-disclosure clause in a
| contract, but again - a contract would have to exist.
| blantonl wrote:
| Look, you are trying to argue that you might want to take
| some data from me and use it in a personal, non-
| commercial sense. Cool.
|
| The entire purpose of the OP article is to develop a
| system to directly circumvent data access and protection
| mechanisms for profit. Pure and simple.
|
| Spare me the altruistic BS. No one is developing and
| utilizing a cluster of freaking _distributed servers with
| forty 4G modems_ to do anything other than steal data
| from services that don 't want their data stolen, so they
| can _use it for profit_
|
| You have to call a spade a spade here.
| throwaway11460 wrote:
| What I'm saying is - your machine is fully capable of
| providing just the right amount of data to fulfill your
| purposes. If you don't like people taking it all, don't
| build a machine that gives it to them at 1 Gb/s. Stuff
| about some ToS or rights or IP ownership is just noise.
| xyzzyman wrote:
| > What makes you think putting data on the Internet all
| the sudden means I unilaterally surrender the rights to
| my intellectual property?
|
| Because intellectual property doesn't exist.
| layer8 wrote:
| Scraping doesn't imply IP violation.
| Starman_Jones wrote:
| As an analogy, imagine that a gardener builds a beautiful
| flower garden, bisected by a cute stone path, which she
| invites the public to view freely, save for a single
| restriction; a sign reading "keep off the flower beds."
|
| There is a well-understood social contract here. I should
| not drive my car along the path, even if don't crush the
| flowers. I shouldn't walk on the flower beds, even if
| that sign isn't legally enforceable. And if a runaway
| lawnmower, RC car, or some other machine of mine does end
| up in the garden, I am responsible, because it was my
| machine.
|
| With websites, there is even a TOS specifically for
| scrapers - robots.txt. The fact that it is easy to bypass
| or ignore is no excuse for actually bypassing or ignoring
| it.
|
| The anonymity of the Internet functions as a ring of
| Gyges, where since people don't face consequences (even
| social ones), they feel entitled to do as they will.
| However, just because you can do something does not mean
| you have a right to do something.
| throwaway11460 wrote:
| Robots.txt is definitely not any kind of ToS - some
| people (Google) said they will respect it. No reason to
| expect people even knowing about the concept -
| practically nobody knows about it, not even most
| developers.
|
| And again - there are countries where any ToS without
| explicit signature or other kind of legal agreement don't
| apply at all.
|
| Just like writing "by using the toilet you agree to
| transfer your soul for infinity" on a piece of toilet
| paper taped somewhere in the vicinity of a toilet gives
| you nothing - even if it was a more reasonable contract,
| nobody agreed to anything.
|
| As for your other point, I think this is more like
| standing next to a highway with a sign that reads "don't
| drive cars here" and expecting people to stop and turn
| around. They didn't even see your sign at their speed and
| it's kinda unreasonable to expect they would be checking
| for that kind of a sign on a highway. At least make it
| properly - big, red, reflective (e.g. a Connection Reset,
| or at least 403 Forbidden).
| Starman_Jones wrote:
| Yes, there is no legal enforcement mechanism behind
| robots.txt. Nor do I particularly want there to be.
| However, most people agree that reasonable requests made
| regarding the use of someone's property should be
| followed. The capability to do something without
| consequences is not the same as the right to do
| something.
|
| Our gardener should not need to build a brick wall around
| their public garden to keep your lawnmower out.
| photochemsyn wrote:
| I think this analogy would be improved if the sign said
| "Please don't take any pictures." This is far more
| restrictive than a sign saying "Please don't take any
| seeds or cuttings." The latter is more understandable
| because such activity damages the flower garden
| (particularly if everyone starts taking seeds and
| cuttings).
|
| Now let's say a photographer visits the flower garden,
| takes images, and sells them online as post cards? As
| long as the photographer is not hindering other people
| (flooding the site with repeat requests, in the analogy),
| it doesn't seem to be a problem.
|
| On the other hand, let's say we don't have a flower
| garden, we have an art gallery or a street artist's
| display - or the pages of a recently published book. Now
| the issue is distributing copyrighted material without
| paying the creator... but what if there's a broad social
| consensus that copyright is out of control and should
| have been radically shortened decades ago?
|
| The vast majority of data being scraped is not
| copyrightable creative work, however, so as long as
| you're not obnoxiously hammering a site, scraping seems
| perfectly ethical.
| Dah00n wrote:
| >If I own some data, then I get to dictate the terms of how
| that data is used. Period.
|
| What if you got that data from me/users and I/we claim the
| same rights (like GDPR for example)? Will you still honour
| ownership as above?
| some1else wrote:
| Serving HTML will get you scraped. Your terms don't
| overrule fair use.
| malwrar wrote:
| If your business is just that you have a bundle of information
| and expose it over an open website, I'm not really sure how
| you're able to maintain a mentality that you are somehow
| entitled to ownership of that information. You already put it
| out there, it's now public, any illusion to exclusivity is now
| gone because anyone could come along at any time and make a
| copy without your knowledge. A moral position on this issue is
| even more confusing to me. Do you think that you e.g. own the
| knowledge on which radio frequencies are used where? Do you
| think you have a moral claim on ownership of (presumably
| unpaid) user-submitted information? I think the only legitimate
| moral grievance you have is high traffic volumes from
| inconsiderate scrapers.
| blantonl wrote:
| _Do you think you have a moral claim on ownership of
| (presumably unpaid) user-submitted information?_
|
| You damn right I do. I own, develop, and maintain the entire
| system that enabled the body of works to exist in the first
| place.
|
| Do you think that you have a claim on ownership of the data
| because you drove by, saw what you liked, and decided that
| now you'll just rip the baton out of my hand?
| malwrar wrote:
| > You damn right I do. I own, develop, and maintain the
| entire system that enabled the body of works to exist in
| the first place.
|
| I don't think that meets the bar. Running a website is
| absolutely not equivalent to the collective effort people
| put in to populate that website with the information that
| actually gives the overall artifact its value. There is a
| large history of outrage when similar information
| repository websites with user-generated content violate
| expectations of openness. Nevermind the fact that the
| actual information itself isn't even private or
| proprietary, just obscure and distributed.
|
| > Do you think that you have a claim on ownership of the
| data because you drove by, saw what you liked, and decided
| that now you'll just rip the baton out of my hand?
|
| I wouldn't claim ownership nor want to, when I scrape stuff
| I usually just want information in a different format. I'm
| confused as to how you think you can even "own" data to
| begin with. Suppose that your users uploaded songs instead
| of RF info, do you believe you own their music solely
| because they chose to share it on your site? Do you think
| your users would believe that?
| blantonl wrote:
| _I'm confused as to how you think you can even "own" data
| to begin with._
|
| It's actually very simple. If I'm in a _position_ to
| restrict access to the data, then I own it, unless there
| is some legal authority that has jurisdiction over me
| that says I _must_ make it available to the public.
| malwrar wrote:
| Operating a website doesn't automatically put you in that
| position, as evidenced by the fact that scraping does not
| require your consent to be possible. Ultimately there's
| little practical difference between someone's eyes
| viewing information and a program viewing that same
| information, a copy has been made in some form. Scraping
| a new site takes maybe a few hours of python to
| accomplish, the barrier is low.
| blantonl wrote:
| I don't think you understand. If I decide as the owner of
| a site, that I don't want you scraping my business and I
| block you, then I am in that position. I'm automatically
| in that position because I can implement the blocks
| necessary to uphold the the terms of use of my business,
| or I can just do it for arbitrary reasons. Maybe you are
| hammering my server. Maybe I'm in a bad mood this morning
| and don't like that you're using Python.
|
| I can unilaterally decide whether or not you use my
| business, in any way shape or form, even if I just don't
| like you, as long as I don't violate any laws
| (discrimination etc).
| malwrar wrote:
| I absolutely understand, it's just not hard to make
| scraper traffic appear as (or be) legitimate browser
| traffic and/or simply distributed across numerous IPs.
| Other technical controls all have trivial circumvention
| methods. There is legal precedent (at least in the US)
| suggesting that scraping public information may be
| permissible under law (see HiQ Labs v. LinkedIn).
| Scrapers only ever need to succeed once.
|
| Under these circumstances, how can a website operator
| feel any sense of practical control over scrapers?
| icehawk wrote:
| Given that you haven't fixed your problem with scrapers
| (given the complaints you're making right in this
| thread.) It's obvious you're not in a position to
| restrict the data-- otherwise you'd not be complaining
| about scrapers, and thus you don't own it.
| jMyles wrote:
| > Do you think that you have a claim on ownership of the
| data because you drove by, saw what you liked, and decided
| that now you'll just rip the baton out of my hand?
|
| Are you just trolling at this point?
|
| _You are handing the baton over_ in an HTTP response. If
| you don't want to do that, then change the logic of your
| server.
|
| Good grief man.
| camgunz wrote:
| I think your basic arguments are either:
|
| - scraping is immoral
|
| - we should bake DRM into the internet
|
| There's no technical or legal difference between a scraping
| or web request, and I can't really believe that you think
| that non-scraping web requests are immoral, so I think that
| probably isn't your argument.
|
| Moving onto DRM, I think most people don't want it baked
| into the internet. I think individual entities can choose
| to use it if they want--that's basically how you protect
| against scraping, so I think people irritated by having
| their content copied and thus devalued (or their ads
| replaced) should probably just do that.
| greenbandit wrote:
| I use web scraping to identify and monitor fraud.
|
| Exhibit A: https://archive.ph/0ZUA8
|
| This website is used to recruit people to set up "lead
| generation" Google Business Profiles and leave paid reviews.
|
| Exhibit B: https://archive.ph/WWZuw
|
| This is an example of the Craigslist ad used to initially
| attract people to the website above.
|
| Exhibit C: https://archive.ph/wip/7Xig4
|
| This is one of the Google Maps contributors which left paid
| reviews.
|
| If you start with the reviews on that profile, you'll find a
| network of Google Business Profiles for fake service-area
| businesses connected through paid reviews.
|
| Web scraping allows me to collect this type of data at scale.
|
| I also use scraping to monitor the status of fake listings. If
| they are removed, the actor behind them will often get them
| reinstated. This allows me to report them again.
| blantonl wrote:
| I don't care if you use Web scraping to solve the Israeli /
| Palestinian conflict. You're not _entitled_ to anyone 's
| data, computers, services, etc because you've decided for
| altruistic reasons that it is appropriate.
|
| Cool use case. Love it. Fascinating stuff. But if Google told
| you to stop, would you? Or would you instead decide to build
| a 5 server cluster of 200 4G modems spread across continents
| to continue your work? Because if you did I would assume that
| you've decided to move on from a cute little altruistic
| process into a commercial use of someone else's data to make
| a profit.
| greenbandit wrote:
| > cute little altruistic process
|
| Maybe it is not the opinion which is unpopular, but the way
| it is being presented.
| ansc wrote:
| >I don't care if you use Web scraping to solve the Israeli
| / Palestinian conflict.
|
| Maybe you should though. It's always worth it to think
| about which giant's shoulder you're standing on. It's
| giants all the way down.
| dmkii wrote:
| I agree that there is a line at using someone else's data
| to make a profit, but it is kind of ironic that you mention
| Google, because their exact business model is scraping
| websites to feed their search results and litter it with
| ads to make a profit. For me there is a big line between
| aggregating publicly available data (search results,
| reviews, news, job postings, etc. ) and intentionally
| violating terms of service like signing up for fake
| accounts an harvesting user data. So entitled maybe not
| (sites can try to prevent you from scraping), but if you
| make something publicly available you shouldn't be
| surprised when people use it in ways you may not originally
| have intended (within legal boundaries of course).
| jumby wrote:
| Wait - so you are saying that information on the public
| internet isn't public? Man, I wish people would remember
| the origin of the web and the entire reason it exists. If
| you don't want information public, protect it - otherwise,
| I say it's fair game.
| blantonl wrote:
| Remember the OP article is about a system that is
| designed to completely and directly circumvent
| protections.
|
| If an organization puts a series of processes in place to
| prevent scrapers from wholesale taking data in violation
| of terms of service, and you develop a _5 server cluster
| of 200x 4G modems_ it 's no longer "fair game" and you're
| directly being unethical in your use of someone else's
| services.
| Spivak wrote:
| Yeah, I think it's fair to say that in the presence of
| anti-bot measures (whether they work or not) that the
| content on the website isn't public anymore.
|
| Available to someone meeting certain criteria (student
| discount, senior discount) doesn't mean available to
| anyone. I see no reason that "not available to be
| consumed by autonomous agents" is somehow invalid in a
| way that unlimited refills is only available to humans
| and not robots.
| tengbretson wrote:
| Is it unethical for a mouse to eat the cheese without
| triggering the trap?
| hipadev23 wrote:
| I find it aptly hilarious that your own business model at
| broadcastify.com is recording publicly accessible radio
| broadcasts and then selling access to those recordings for
| commercial gain.
| blantonl wrote:
| Why is that hilarious? We developed an entire community,
| infrastructure, system, architecture, everything, from
| scratch, and provide access to something that _never existed
| in the first place_ on the Internet. That 's a significant
| key difference here.
|
| This would be analogous to you thinking ancestory.com is
| "aptly hilarious" for arguing against someone just scraping
| their site for content.
|
| What makes you think you should be entitled to drive by the
| very unique house that we built, and pointing right at that
| house and saying "I think I'll take that all of that for
| myself!"
| hipadev23 wrote:
| Because you fail to see the very obvious parallels to
| scraping. I'm not criticizing your business (I think you
| provide a valuable service) but your hypocritical stance on
| what forms of publicly available information are allowed to
| be gathered and repackaged.
|
| Google's original (and OpenAI's) business model was also
| building a scraping infrastructure, system, and
| architecture, from scratch -- and providing access to
| something that never existed in the first place.
| blantonl wrote:
| It's completely perpendicular, not parallel.
|
| Public safety communications are radio waves that are
| _broadcasted_ and the ability to _passively monitor_ them
| is enshrined in United States law. That is a massively
| key difference.
|
| If I was sending data into your home from my
| infrastructure without any action from you whatsoever,
| and you were reaching up into the air and gathering it
| and repackaging it, AND the law said that I have no
| intellectual property rights to said data, then that's a
| whole different story.
| zarzavat wrote:
| Every time you use Google you benefit from scraping.
| Scraping is how the world works for the last 25+ years.
|
| You are trying to draw a distinction between data that is
| pushed and data that is pulled, and maybe there is some
| economic argument there in terms of resource usage, but
| that is very context-dependent.
|
| In UK listening to public radio broadcasts is illegal. I
| think this law is idiotic and ignore it. It seems you do
| too since there appear to be streams from UK on your site
| :)
| edgyquant wrote:
| You are scraping radio signals and selling it. It's an
| exact parallel and if you fail to see this it is indeed
| hilarious.
| throwaway48476 wrote:
| It is difficult to get a man to understand something,
| when his salary depends on his not understanding it.
| blantonl wrote:
| If you don't understand the difference between
| intercepting radio signals and Web scraping, I'd say your
| understanding of physics and technology is pretty
| hilarious.
|
| Look around in your house dude, there are radio signals
| present in your house right now as we speak - you just
| can't see them - the data literally exists right in your
| home without you even having to do anything. And the law
| grants to the unequivocal right in the United States to
| intercept those radio signals.
| papichulo2023 wrote:
| So you only point that scrapping data is bad because the
| cost? How do you know that the site someone is scraping
| doesnt have fixed cost?
| rmbyrro wrote:
| Why is it ethical if you build upon other people's data,
| but unethical if others do it?
|
| Nobody cares how valuable you think your service is. Who's
| the judge of what's entitled to scrape or not? If you think
| you're the judge, I find it somewhat arrogant.
|
| It is even more hilarious that you defend a position that,
| to me, looks authoritarian and individualistic. Might not
| be your intention, but it's what I read.
| blantonl wrote:
| _Why is it ethical if you build upon other people 's
| data, but unethical if others do it?_
|
| Because they GAVE IT TO ME, that's why.
|
| _Who 's the judge of what's entitled to scrape or not?
| If you think you're the judge, I find it somewhat
| arrogant._
|
| You find it arrogant that I want to protect my business
| interests from people who solely want to just "take" from
| the hard work my team has put together. Would you be
| arrogant if you built a platform over 20+ years, and then
| scrapers just took the data for themselves?
|
| _...looks authoritarian and individualistic._
|
| These assertions are ridiculous. LOL. Hyperbole at it's
| finest.
| dale_glass wrote:
| > Whenever I deal with a scraping process that decides it wants
| my entire business, and it wants all of it RIGHT NOW, or in 5
| minutes, I want to find the person and sit them down in a room
| and tell them "hey, develop your own ideas and business. Ok?
| Thanks"
|
| That's a lot of righteous anger for somebody building a
| business on top of other people's data.
|
| "Broadcastify is the worlds largest source of public safety,
| aircraft, rail, and marine radio live audio streams."
|
| I have no sympathy whatsoever. You're just complaining about
| the very thing you're doing. If it's fair for you to do that,
| it's fair for others to do it to you.
| blantonl wrote:
| They volunteer to provide the data to us. Every single last
| one of them. Nowhere in our business model did we make the
| conscious decision to say "hey, look at that business, they
| have something, and I'm going to take it."
| bsuvc wrote:
| Reading public website data is not "taking it". It is still
| there.
|
| Observing publicly available information is not theft, nor
| is it illegal.
|
| Of course copyright rules apply, but that is for if you
| reproduce something.
| mellosouls wrote:
| A curious title.
|
| "So you want to scrape like the unethical boys?" I guess doesn't
| scan so well. Bad boys maybe?
|
| I'm pretty sure Internet Archive, etc don't in fact misrepresent
| what they are to crawl websites...
| echelon wrote:
| > unethical
|
| Using and transforming information in useful ways is unethical
| if it results in a profit?
|
| That's what our brains do, too.
| llamaimperative wrote:
| No, destroying incentives to produce and share information is
| unethical (and more importantly, self-defeating).
|
| Brains that consume information don't destroy that incentive,
| they produce it.
|
| Intermediating that and capturing all of the value for
| yourself is the unethical part, just like all forms of rent-
| seeking.
| echelon wrote:
| > destroying incentives
|
| Internet usage and content creation are increasing, not
| decreasing.
|
| I continue to publish comments, code, and images that
| presumably get used to train models. My incentive hasn't
| been destroyed.
|
| > rent-seeking
|
| Supply and demand set the prices.
|
| Subscription services provide value and continue to invest
| in their product, catalog, and/or service. Property owners
| handle asset ownership and upkeep problems at scale.
|
| Inefficiencies will be met with competition, and businesses
| not providing value will be out-competed.
|
| Data under-availability is an inefficiency holding us back
| from bigger and better things.
| vasco wrote:
| Tell that to the most used website in the world, which is
| basically a scrapping-and-sorting machine.
| blantonl wrote:
| I can commit a code change in 2 seconds that would directly
| tell the most used website in the world to stop scrapping and
| sorting my data, and they would honor it and that would be
| the end of that.
|
| I'm under no illusions that they would or would not honor
| that in the future, but that's the state today.
| CoastalCoder wrote:
| > "So you want to scrape like the unethical boys?"
|
| What's considered ethical is a very debated topic.
|
| An assertion that something is simply "unethical" should be
| seen as the starting point of a discussion, not as a self-
| evident fact.
| marginalia_nu wrote:
| If someone tells you to go away via the robots exclusion
| standard, and puts up bot mitigation to prevent you, blocks
| your IPs, etc. then clearly you do not have their consent to
| help yourself to the data.
|
| I find it really hard to see how you could twist ignoring
| this clear lack of consent, and going to great lengths to
| circumvent what was clearly put into place to prevent you
| from doing the very thing you are doing, how you could twist
| that into an ethical action.
|
| It may or may not be technically illegal to do, you're but
| that is not a statement about what is ethical.
| lambdaba wrote:
| Surely the ethics are more complicated then just following
| robots.txt or not. The intended usage counts, and that
| isn't captured in robots.txt.
| marginalia_nu wrote:
| If you have a noble intent, you ask the webmaster for
| permission to use the data. Surely if they agree with
| your assessment that your intent is indeed noble, then
| you'll be given consent.
|
| I run a search engine and an internet crawler. I do this
| all the time. To this date I've never had a webmaster
| that didn't permit my crawler access when I've asked
| nicely.
| bryanrasmussen wrote:
| If you have a noble intent - identify members of fascist
| organizations - then obviously when you ask the top
| online fascist sites if you may scrape them to build up
| your list of online fascists - they will say no.
|
| OK less provocative, you have new algorithm to identify
| inaccessible websites, your automation is scary good,
| crawling a site you can identify many issues that most
| sites would have to pay for a full audit to get, but now
| these sites have problems - if you can identify their
| sites as being inaccessible then they have to fix these
| problems due to various accessibility standards that
| apply in the regions they operate in. But if they don't
| allow you access then they can maybe make an argument
| they are accessible due to audit they did last year, at
| any rate they don't want to be forced to spend money on
| accessibility issues right now which it sounds like they
| might have to if they let you crawl their site.
|
| Version 2 of above, some years ago I spoke about a job
| with a big time magazine publisher in Denmark and said
| one of the things that would make me a good employee is
| my knowledge of accessibility and their chief of
| development said they didn't have anyone with
| disabilities that used their site - so if I ask that guy
| to crawl their site why say yes? They have no users that
| would benefit!! Stop abusing our bandwidth bleeding heart
| guy.
| marginalia_nu wrote:
| All of these seem like variations of the-ends-justify-
| the-means, which generally tends to cut both ways in
| unanticipated ways.
|
| Bullying websites into accessibility compliance will most
| likely lead to them following the letter of the standard
| without giving a second of thought as to whether the
| content is in fact actually accessible. It's very
| difficult to get someone on board with your cause if your
| initial contact is an antagonistic one.
| greenbandit wrote:
| This might work in cases where those with the data are
| engaged in noble acts, but not ever actor is.
|
| I scrape and process websites of actors engaged in fraud.
| I do this to make the data more presentable to the proper
| authorities and to help uncover further evidence of their
| activities.
|
| I suspect that asking for consent would be quickly denied
| and the data/evidence would quickly become inaccessible.
| duggan wrote:
| Ok, you're building a service that scrapes e.g., property
| rental websites to find entries that are trying to scam
| naive renters.
|
| The property websites are incompetent to solve the problem,
| or don't care, but either way they sure don't want you
| scraping their valuable data.
|
| Is it still unethical?
| marginalia_nu wrote:
| That just makes both of you wrong.
| blantonl wrote:
| Agreed. It's kind of like when a non-profit organization
| argues that they are entitled to someone's data because
| "we're not making a profit off of it." That's ridiculous.
|
| Try asking a startup for free software licenses or seats
| or whatever as a non-profit. "We're entitled to 40 seats
| of your SAAS solution because we're a non-profit working
| to solve world peace." It's definitely within the
| startup's pervue to respond with a no.
| gwittel wrote:
| I'm really mixed on this. Anti bot stuff is increasingly a pain
| point for security research. Working in this space, I have to
| work against these systems.
|
| Threat actors use Cloudflare and other services to gate their
| payloads. That's a problem for our customers who are trying to
| find/detect things like brand impersonation and credential phish.
| Cloudflare has been completely unhelpful. They just don't care.
| heipei wrote:
| Seconding this. Evading detection has become a real cake-walk
| since threat actors are able to sign up for a free Cloudflare
| account and then put their phishing site on their 2-hours old
| domain behind a level of protection backed by a $20B company.
| Funny that you almost never see phishing on Akamai ;)
|
| Disclaimer: We operate in this space so we obviously have an
| interest in being able to detect these threats going forward.
| rashkov wrote:
| Why not Akamai?
| nkozyra wrote:
| Cost.
| throwaway48476 wrote:
| Cloudflare is the ultimate example of creating the problem
| and selling the solution.
| zinglersen wrote:
| I was under the (naive?) impression that Cloudflare a SaaS
| startup poster child. Do you mind expanding on your
| comment?
| throwaway48476 wrote:
| Among other things, cloudflare hosts DoS services while
| selling DoS protection.
| anyfactor wrote:
| I was a professional web scraper. I still keep up to date with
| the industry.
|
| These days, you do not make money by doing web scraping; you make
| money selling services to web scrapers. There are tons of web
| scraping SAAS and services out there, as well as dozens of
| residential proxy providers.
|
| Most anti-bot mechanisms evolve so quickly that you can make a
| decent income just by working in a traditional software
| engineering role dedicated entirely to engineering anti-anti-bot
| solutions. As these mechanisms evolve rapidly, working for a web
| scraping company is more stable than pursuing web scraping as a
| profession.
|
| Web scrapers get paid by projects, making it an unstable job in
| the long run. High-level web scraping requires operational
| investments in residential proxies and renting out servers.
| Additionally, low-end jobs pay very little. Brightdata hosting a
| conference on web scraping, which should indicate the
| profitability of selling services in large-scale web scraping.
| RockRobotRock wrote:
| I've been writing scrapers on Upwork for many years. I'm sick
| of doing project based work and want to work at/start a
| scraping SaaS. Any advice?
| jimz wrote:
| The irony is that before I realized it was so easy I would just
| open source the code - not on Github, mind you, since the likes
| of Akamai would DMCA pretty quickly, but playing a little bit
| of jurisdictional arbitrage I put it on Gitee - the Chinese
| copycat of Github. I don't have a background in any of this,
| but companies like the brag and it's not hard to put two and
| two together. It also was a practical way to enable me to place
| wagers on sports automatically - which was more or less my
| actual day job - and was pretty good for learning programming
| quickly in your late 20s.
|
| Instead almost immediately I got inundated by sneaker botters
| in China and in English from somewhere that doesn't use it as a
| native language, judging from the idiosyncratic use. I kept the
| code up for a bit but took it down not because of any legal
| threats (good luck with DMCA-ing a platform endorsed by the
| CCP, even though I have no love for the party, I also find the
| American attitude that places intellectual property over real
| property in practice - from my experience as a defense attorney
| - to be just as screwed up in terms of priorities, just a
| matter of degrees. What made me take it down was the fact that
| I did not want to work in a customer service job or really for
| anyone, and judging by the requests, it was mostly consisted of
| "you do the work but we'll split the profits", which I can't
| believe anyone would fall for.
|
| But since the internet is forever, some parts of code that
| specifically worked to emulate Cyberfed-Akamai from 0.8 to 2.3
| are probably still floating around. My bad. I don't wear shoes
| normally - flip flops or nothing after having to wear a suit to
| work for a decade - and have no idea beyond what happens in
| NBA2K. Although cybersecurity firms making products that
| someone who learned how to program in their mid 20s and put
| online within 3 years and had it work should be pretty ashamed
| of how much they charge, considering that I haven't even taken
| a math course since 11th grade and had too much of an ADHD
| problem to watch videos or even read more than blog posts or
| documentation. Everything I learned, I learned by copying from
| Github and similar services until it worked. There must be a
| lot of snake oil being sold out there, maybe most of it, since
| the insidiousness of the whole thing is that selling bunk
| solutions seldom gets you in trouble anyway, while actual crime
| - rape, murder, robbery and the like - are largely lagging
| because the police simply prefer to complain about culture war
| bs instead of actually, you know, do their jobs. Who knew
| Judith Butler was THIS spot on.
| wanderlust123 wrote:
| How do you keep up with the industry?
| sublinear wrote:
| > Those companies employ ill-adjusted individuals that do nothing
| else than look for the most recent techniques to fingerprint
| browsers ... When normal people are out drinking beers in the pub
| on Friday night, these individuals invent increasingly bizarre
| ways to fingerprint browsers and detect bots ;)
|
| Why not both on a friday night?
| KieranMac wrote:
| I'm a lawyer that works in the web-scraping space, and I always
| chuckle when I read threads like this. Almost every company that
| we now consider a monopolist (or their affiliates) in the tech
| space used scraping a part of their process to build their
| business, and almost every one of those same monopolists now
| prohibits startups and competitors from scraping their data
| (which, invariably, is not actually "their" data in any sort of
| legally cognizable sense). And so perhaps the ethics of web
| scraping are not so straightforward. And neither are the legal
| issues associated with it.
|
| I wrote an article about that last fall that got some attention
| here.
|
| https://news.ycombinator.com/item?id=37264676
| jMyles wrote:
| > And so perhaps the ethics of web scraping are not so
| straightforward.
|
| It strikes me that the _ethics_ of web scraping are extremely
| straightforward and cognizable with a terse analysis:
|
| * You can respond however you like to my HTTP request, and I
| can parse your response however I like.
|
| Simple, traditional, common. This is the way that conversations
| have occurred since the dawn of human communication, no?
|
| > the legal issues associated with it.
|
| But aren't these, without exception, fabrics spun out of the
| cloth that shields established players with the threat of state
| violence? This is not particularly new, and seems to fit in the
| pathetic-and-predictable file.
|
| Moreover, the broader cheap attempt to cast this in
| "intellectual" property terms, and to attach that to protection
| of artists and creators, warrants a very particular eye-roll
| for its illogic.
| elicksaur wrote:
| If I say, "Hey, please don't text me anymore. I'm going to
| block this number," and you respond by buying 500 phones in
| five cities and text me nonstop, is that ethical?
| andai wrote:
| Not sure the metaphor works here. For example most sites
| let Google scrape them as much as it likes, but go out of
| their way to block other robots. By doing so they are
| effectively forcing the whole world to use (or support,
| since smaller search engines have to piggyback on the big
| ones wih special status, and pay them) proprietary spyware.
|
| In your analogy, most websites block everyone except the
| biggest pervert known to man.
| eli wrote:
| Isn't that a choice the website owner should be able to
| make?
| jMyles wrote:
| Of course it's your choice to make.
|
| Is someone forcing you to respond to requests you'd
| prefer to ignore?
| paulryanrogers wrote:
| If crawlers are stealth DDoSing my site then I lose the
| ability to respond entirely.
| jMyles wrote:
| It's your job to separate the wheat from the chaff at the
| boundary of your network interface. In fact, personal
| boundaries of all sorts, from informational to emotional to
| physical to economic, are of paramount importance in the
| information age.
|
| Nobody (and certainly not the state) is going to erect your
| personal boundaries for you by ensuring justice in the face
| of spammy text messages (or, for that matter, hypnotic and
| manipulative social media). This is your job - maybe your
| most important job.
|
| Just as its your job to protect your personal health and
| safety. Nobody (and certainly not the state) is going to do
| that for you.
|
| Is there something about the trajectory of evolution of the
| internet that suggests to you that this is incorrect?
|
| I observe continually (seemingly perpetually) increasing
| traffic, and continually (seemingly perpetually) increasing
| capacity for general purpose computing. I also observe
| enormous empathy and cyberpunk traditions in our
| communities, protecting each other. Do my eyes and ears
| deceive me?
| paulryanrogers wrote:
| Restraining orders are a thing for a reason. It's cheaper
| to harass someone out of business (intentionally or
| otherwise) than to compete on a level playing field.
|
| Being a good neighbor requires restraining oneself and
| making requests with consideration for the other party.
|
| Full disclosure: I worked for a price monitoring service
| that prided itself on crawling up to every 3 hours. Steps
| were always taken to mitigate the impact. Sometimes even
| asking hosts to allow-list the crawlers.
| theamk wrote:
| [delayed]
| graemep wrote:
| Anti bot stuff also seems to be a security threat and privacy
| threat: preventing users from accessing your site if using VMs,
| port scanning, various froms of fingerprinting
| Terr_ wrote:
| I prefer the approach of an algorithmic challenge that forces
| the "new visitor" to spend some CPU cycles.
|
| It's a clear process, doesn't involve privacy risks or strange
| sneaky games, and tends to fail in ways that a human can at
| least see and report, as opposed to mysterious outages.
| dang wrote:
| Discussed at the time:
|
| _Scrape like the big boys_ -
| https://news.ycombinator.com/item?id=29117022 - Nov 2021 (189
| comments)
___________________________________________________________________
(page generated 2024-04-27 23:00 UTC)