[HN Gopher] So you want to scrape like the big boys (2021)
       ___________________________________________________________________
        
       So you want to scrape like the big boys (2021)
        
       Author : aragonite
       Score  : 165 points
       Date   : 2024-04-27 02:56 UTC (20 hours ago)
        
 (HTM) web link (incolumitas.com)
 (TXT) w3m dump (incolumitas.com)
        
       | blantonl wrote:
       | This tends to be a _very unpopular opinion_ around here, but in
       | almost all cases I find Internet scraping to be unethical and
       | downright malicious. I 'm not saying all cases, but I'm saying
       | almost.
       | 
       | A lot of the actors involved tend to be hustle culture types who
       | think they are OWED your data, regardless of the ethics, laws,
       | being a good citizen, whatever. They will blatantly disregard
       | terms of service and hide behind massive setups such as these to
       | circumvent protection etc.
       | 
       | And the problem is, if you run any sort of business or service
       | that is data oriented, there will be thousands of people that
       | will do this, which will cause you to devote enormous amounts of
       | time, effort, money, and infrastructure just to mitigate the
       | issues involved with data scraping. That's before you are even
       | addressing whether or not these people are "stealing" your data.
       | People who feel they are entitled to the crux of your business
       | aren't bothered by being nice in the way they take it - they'll
       | launch services that will cripple infrastructure.
       | 
       | Whenever I deal with a scraping process that decides it wants my
       | entire business, and it wants all of it RIGHT NOW, or in 5
       | minutes, I want to find the person and sit them down in a room
       | and tell them "hey, develop your own ideas and business. Ok?
       | Thanks"
       | 
       | And if you think this was a problem before, it's exponentially
       | worse over the past few months with every Tom, Susan, and Harry
       | deciding they must have all your data to train their new LLM AI
       | model. By the thousands.
        
         | vouaobrasil wrote:
         | I absolutely agree. In fact, I think the problem is that like
         | everything, there is an optimal point for efficiency, and
         | crossing that line by making things "too easy" when it comes to
         | data means too much power for one person to handle ethically.
         | Absolute power may corrupt absolutely, but near absolutely
         | power also corrupts quite nicely, too.
         | 
         | In short, we should have limits to amount of scraping possible,
         | simply because humans can never be trusted past a certain point
         | to remain ethical. After all, ethics at its first approximation
         | is only a mechanism to improve societal cohesiveness, and it
         | only works as long as the person doesn't have enough power to
         | "do away" with society.
        
           | jumby wrote:
           | Would you make the same argument of the inverse: data
           | gathering?
        
         | flir wrote:
         | There's a lot of local history locked up in facebook's
         | nostalgia groups. I want to archive it in an open format.
         | 
         | I want to grab new rental listings and put them in an RSS feed,
         | so I only look at each one once.
         | 
         | That's my uses for data scraping right now. If that destroys
         | someone's business, I don't actually care. Maybe it's selfish,
         | but my right to re-format data for my own convenience outweighs
         | their right to make a profit.
        
           | throwaway11460 wrote:
           | Not that I think you shouldn't do it or you're doing
           | something wrong, but describing it as a _right_ irks me the
           | wrong way. You don 't have any right to expect someone else's
           | computers to work for you.
        
             | flir wrote:
             | I'm not sure how to phrase it except in terms of competing
             | rights, but I take your point.
             | 
             | At the point where I'm scraping, the data's on my computer
             | though.
        
               | solarkraft wrote:
               | You could call them _interests_.
               | 
               | It's often in a business's interest to format data in a
               | specific way to make money, for example interlacing it
               | with ads.
        
               | flir wrote:
               | Nice.
        
           | blantonl wrote:
           | _If that destroys someone 's business, I don't actually care.
           | Maybe it's selfish, but my right to re-format data for my own
           | convenience outweighs their right to make a profit._
           | 
           | Exhibit A
        
             | flir wrote:
             | Yeah, it's as unsympathetic framing of my position as I can
             | offer.
             | 
             | But it's basically the same question as adblockers: Can I
             | do what I want with the 1's and 0's on my own machine?
             | 
             | I'm not going to accept that I owe anyone a business model.
        
               | blantonl wrote:
               | I'm not going to disagree with your use case here.
               | 
               | But I'm going to assume that you have some level of a
               | conscious and you don't really mean you could give 3
               | shits about someone else's hard work so you can have some
               | satisfaction at home. Because at face value that's
               | exactly what you said.
        
               | flir wrote:
               | No, I think that's fair. Unsympathetic framing, but not
               | inaccurate. It's that whole "information wants to be
               | free" thing.
        
         | brigadier132 wrote:
         | > hustle culture types
         | 
         | It seems like you have this imaginary strawman that you hate
         | and it seems like that's the foundation of why you dislike
         | this.
        
           | blantonl wrote:
           | No. The foundation of why I dislike it is simple. If I own
           | some data, then I get to dictate the terms of how that data
           | is used. Period.
           | 
           | "Hustle culture types" is simply a little anecdote about the
           | types that would look you in the eye and tell you they are
           | entitled to disregard what I said above. They'll usually wrap
           | it in some altruistic bs to justify as well.
        
             | throwaway11460 wrote:
             | Why do you put it on the open internet if you don't want
             | machines to find and read it?
             | 
             | ToS is nice but you can't expect that it applies - the user
             | (of the machine doing the scraping) might be a child which
             | makes the potential contract automatically void, for
             | example. Also, there are people under jurisdictions where
             | such things have no power, or that don't recognize your
             | rights to the data.
             | 
             | And the whole thing of putting data out publicly and then
             | just expecting machines to see the pile of data and go "oh
             | so where do I sign the ToS?" is weird...
             | 
             | Just put it behind a rate limited API key...
        
               | blantonl wrote:
               | What makes you think putting data on the Internet all the
               | sudden means I unilaterally surrender the rights to my
               | intellectual property?
               | 
               | If I choose to make my data available to some businesses
               | to make discovery of it easier, and I choose to decline
               | to allow others to unilaterally copy my data to develop a
               | different business, that's my right. And it is unethical
               | and unreasonable for any other person to assume otherwise
               | that they are entitled to the same rights I granted
               | someone else.
               | 
               | If I own some data, I get to the be arbitrator of the
               | who/what/when/where on the use of the data. Period.
        
               | throwaway11460 wrote:
               | Sure, you can do whatever you like. Cut the connection if
               | you don't like it. But I can do whatever I like too -
               | read the data that your machine sent me, for example. If
               | your machine sends my machine data it's IMHO reasonable
               | to expect that you don't care about me having it _unless
               | we agreed otherwise_. But in many countries ToS is not
               | considered a legal contract at all - just having it on
               | your site somewhere is not enough. Sometimes not even
               | having users check the ToS checkmark would form a valid
               | contract.
               | 
               | There are many kinds of data that can't be owned at all.
               | Actually it's the other way around - there is a very
               | small subset of data that can be owned. You can try to
               | cover it under some kind of a non-disclosure clause in a
               | contract, but again - a contract would have to exist.
        
               | blantonl wrote:
               | Look, you are trying to argue that you might want to take
               | some data from me and use it in a personal, non-
               | commercial sense. Cool.
               | 
               | The entire purpose of the OP article is to develop a
               | system to directly circumvent data access and protection
               | mechanisms for profit. Pure and simple.
               | 
               | Spare me the altruistic BS. No one is developing and
               | utilizing a cluster of freaking _distributed servers with
               | forty 4G modems_ to do anything other than steal data
               | from services that don 't want their data stolen, so they
               | can _use it for profit_
               | 
               | You have to call a spade a spade here.
        
               | throwaway11460 wrote:
               | What I'm saying is - your machine is fully capable of
               | providing just the right amount of data to fulfill your
               | purposes. If you don't like people taking it all, don't
               | build a machine that gives it to them at 1 Gb/s. Stuff
               | about some ToS or rights or IP ownership is just noise.
        
               | xyzzyman wrote:
               | > What makes you think putting data on the Internet all
               | the sudden means I unilaterally surrender the rights to
               | my intellectual property?
               | 
               | Because intellectual property doesn't exist.
        
               | layer8 wrote:
               | Scraping doesn't imply IP violation.
        
               | Starman_Jones wrote:
               | As an analogy, imagine that a gardener builds a beautiful
               | flower garden, bisected by a cute stone path, which she
               | invites the public to view freely, save for a single
               | restriction; a sign reading "keep off the flower beds."
               | 
               | There is a well-understood social contract here. I should
               | not drive my car along the path, even if don't crush the
               | flowers. I shouldn't walk on the flower beds, even if
               | that sign isn't legally enforceable. And if a runaway
               | lawnmower, RC car, or some other machine of mine does end
               | up in the garden, I am responsible, because it was my
               | machine.
               | 
               | With websites, there is even a TOS specifically for
               | scrapers - robots.txt. The fact that it is easy to bypass
               | or ignore is no excuse for actually bypassing or ignoring
               | it.
               | 
               | The anonymity of the Internet functions as a ring of
               | Gyges, where since people don't face consequences (even
               | social ones), they feel entitled to do as they will.
               | However, just because you can do something does not mean
               | you have a right to do something.
        
               | throwaway11460 wrote:
               | Robots.txt is definitely not any kind of ToS - some
               | people (Google) said they will respect it. No reason to
               | expect people even knowing about the concept -
               | practically nobody knows about it, not even most
               | developers.
               | 
               | And again - there are countries where any ToS without
               | explicit signature or other kind of legal agreement don't
               | apply at all.
               | 
               | Just like writing "by using the toilet you agree to
               | transfer your soul for infinity" on a piece of toilet
               | paper taped somewhere in the vicinity of a toilet gives
               | you nothing - even if it was a more reasonable contract,
               | nobody agreed to anything.
               | 
               | As for your other point, I think this is more like
               | standing next to a highway with a sign that reads "don't
               | drive cars here" and expecting people to stop and turn
               | around. They didn't even see your sign at their speed and
               | it's kinda unreasonable to expect they would be checking
               | for that kind of a sign on a highway. At least make it
               | properly - big, red, reflective (e.g. a Connection Reset,
               | or at least 403 Forbidden).
        
               | Starman_Jones wrote:
               | Yes, there is no legal enforcement mechanism behind
               | robots.txt. Nor do I particularly want there to be.
               | However, most people agree that reasonable requests made
               | regarding the use of someone's property should be
               | followed. The capability to do something without
               | consequences is not the same as the right to do
               | something.
               | 
               | Our gardener should not need to build a brick wall around
               | their public garden to keep your lawnmower out.
        
               | photochemsyn wrote:
               | I think this analogy would be improved if the sign said
               | "Please don't take any pictures." This is far more
               | restrictive than a sign saying "Please don't take any
               | seeds or cuttings." The latter is more understandable
               | because such activity damages the flower garden
               | (particularly if everyone starts taking seeds and
               | cuttings).
               | 
               | Now let's say a photographer visits the flower garden,
               | takes images, and sells them online as post cards? As
               | long as the photographer is not hindering other people
               | (flooding the site with repeat requests, in the analogy),
               | it doesn't seem to be a problem.
               | 
               | On the other hand, let's say we don't have a flower
               | garden, we have an art gallery or a street artist's
               | display - or the pages of a recently published book. Now
               | the issue is distributing copyrighted material without
               | paying the creator... but what if there's a broad social
               | consensus that copyright is out of control and should
               | have been radically shortened decades ago?
               | 
               | The vast majority of data being scraped is not
               | copyrightable creative work, however, so as long as
               | you're not obnoxiously hammering a site, scraping seems
               | perfectly ethical.
        
             | Dah00n wrote:
             | >If I own some data, then I get to dictate the terms of how
             | that data is used. Period.
             | 
             | What if you got that data from me/users and I/we claim the
             | same rights (like GDPR for example)? Will you still honour
             | ownership as above?
        
             | some1else wrote:
             | Serving HTML will get you scraped. Your terms don't
             | overrule fair use.
        
         | malwrar wrote:
         | If your business is just that you have a bundle of information
         | and expose it over an open website, I'm not really sure how
         | you're able to maintain a mentality that you are somehow
         | entitled to ownership of that information. You already put it
         | out there, it's now public, any illusion to exclusivity is now
         | gone because anyone could come along at any time and make a
         | copy without your knowledge. A moral position on this issue is
         | even more confusing to me. Do you think that you e.g. own the
         | knowledge on which radio frequencies are used where? Do you
         | think you have a moral claim on ownership of (presumably
         | unpaid) user-submitted information? I think the only legitimate
         | moral grievance you have is high traffic volumes from
         | inconsiderate scrapers.
        
           | blantonl wrote:
           | _Do you think you have a moral claim on ownership of
           | (presumably unpaid) user-submitted information?_
           | 
           | You damn right I do. I own, develop, and maintain the entire
           | system that enabled the body of works to exist in the first
           | place.
           | 
           | Do you think that you have a claim on ownership of the data
           | because you drove by, saw what you liked, and decided that
           | now you'll just rip the baton out of my hand?
        
             | malwrar wrote:
             | > You damn right I do. I own, develop, and maintain the
             | entire system that enabled the body of works to exist in
             | the first place.
             | 
             | I don't think that meets the bar. Running a website is
             | absolutely not equivalent to the collective effort people
             | put in to populate that website with the information that
             | actually gives the overall artifact its value. There is a
             | large history of outrage when similar information
             | repository websites with user-generated content violate
             | expectations of openness. Nevermind the fact that the
             | actual information itself isn't even private or
             | proprietary, just obscure and distributed.
             | 
             | > Do you think that you have a claim on ownership of the
             | data because you drove by, saw what you liked, and decided
             | that now you'll just rip the baton out of my hand?
             | 
             | I wouldn't claim ownership nor want to, when I scrape stuff
             | I usually just want information in a different format. I'm
             | confused as to how you think you can even "own" data to
             | begin with. Suppose that your users uploaded songs instead
             | of RF info, do you believe you own their music solely
             | because they chose to share it on your site? Do you think
             | your users would believe that?
        
               | blantonl wrote:
               | _I'm confused as to how you think you can even "own" data
               | to begin with._
               | 
               | It's actually very simple. If I'm in a _position_ to
               | restrict access to the data, then I own it, unless there
               | is some legal authority that has jurisdiction over me
               | that says I _must_ make it available to the public.
        
               | malwrar wrote:
               | Operating a website doesn't automatically put you in that
               | position, as evidenced by the fact that scraping does not
               | require your consent to be possible. Ultimately there's
               | little practical difference between someone's eyes
               | viewing information and a program viewing that same
               | information, a copy has been made in some form. Scraping
               | a new site takes maybe a few hours of python to
               | accomplish, the barrier is low.
        
               | blantonl wrote:
               | I don't think you understand. If I decide as the owner of
               | a site, that I don't want you scraping my business and I
               | block you, then I am in that position. I'm automatically
               | in that position because I can implement the blocks
               | necessary to uphold the the terms of use of my business,
               | or I can just do it for arbitrary reasons. Maybe you are
               | hammering my server. Maybe I'm in a bad mood this morning
               | and don't like that you're using Python.
               | 
               | I can unilaterally decide whether or not you use my
               | business, in any way shape or form, even if I just don't
               | like you, as long as I don't violate any laws
               | (discrimination etc).
        
               | malwrar wrote:
               | I absolutely understand, it's just not hard to make
               | scraper traffic appear as (or be) legitimate browser
               | traffic and/or simply distributed across numerous IPs.
               | Other technical controls all have trivial circumvention
               | methods. There is legal precedent (at least in the US)
               | suggesting that scraping public information may be
               | permissible under law (see HiQ Labs v. LinkedIn).
               | Scrapers only ever need to succeed once.
               | 
               | Under these circumstances, how can a website operator
               | feel any sense of practical control over scrapers?
        
               | icehawk wrote:
               | Given that you haven't fixed your problem with scrapers
               | (given the complaints you're making right in this
               | thread.) It's obvious you're not in a position to
               | restrict the data-- otherwise you'd not be complaining
               | about scrapers, and thus you don't own it.
        
             | jMyles wrote:
             | > Do you think that you have a claim on ownership of the
             | data because you drove by, saw what you liked, and decided
             | that now you'll just rip the baton out of my hand?
             | 
             | Are you just trolling at this point?
             | 
             | _You are handing the baton over_ in an HTTP response. If
             | you don't want to do that, then change the logic of your
             | server.
             | 
             | Good grief man.
        
             | camgunz wrote:
             | I think your basic arguments are either:
             | 
             | - scraping is immoral
             | 
             | - we should bake DRM into the internet
             | 
             | There's no technical or legal difference between a scraping
             | or web request, and I can't really believe that you think
             | that non-scraping web requests are immoral, so I think that
             | probably isn't your argument.
             | 
             | Moving onto DRM, I think most people don't want it baked
             | into the internet. I think individual entities can choose
             | to use it if they want--that's basically how you protect
             | against scraping, so I think people irritated by having
             | their content copied and thus devalued (or their ads
             | replaced) should probably just do that.
        
         | greenbandit wrote:
         | I use web scraping to identify and monitor fraud.
         | 
         | Exhibit A: https://archive.ph/0ZUA8
         | 
         | This website is used to recruit people to set up "lead
         | generation" Google Business Profiles and leave paid reviews.
         | 
         | Exhibit B: https://archive.ph/WWZuw
         | 
         | This is an example of the Craigslist ad used to initially
         | attract people to the website above.
         | 
         | Exhibit C: https://archive.ph/wip/7Xig4
         | 
         | This is one of the Google Maps contributors which left paid
         | reviews.
         | 
         | If you start with the reviews on that profile, you'll find a
         | network of Google Business Profiles for fake service-area
         | businesses connected through paid reviews.
         | 
         | Web scraping allows me to collect this type of data at scale.
         | 
         | I also use scraping to monitor the status of fake listings. If
         | they are removed, the actor behind them will often get them
         | reinstated. This allows me to report them again.
        
           | blantonl wrote:
           | I don't care if you use Web scraping to solve the Israeli /
           | Palestinian conflict. You're not _entitled_ to anyone 's
           | data, computers, services, etc because you've decided for
           | altruistic reasons that it is appropriate.
           | 
           | Cool use case. Love it. Fascinating stuff. But if Google told
           | you to stop, would you? Or would you instead decide to build
           | a 5 server cluster of 200 4G modems spread across continents
           | to continue your work? Because if you did I would assume that
           | you've decided to move on from a cute little altruistic
           | process into a commercial use of someone else's data to make
           | a profit.
        
             | greenbandit wrote:
             | > cute little altruistic process
             | 
             | Maybe it is not the opinion which is unpopular, but the way
             | it is being presented.
        
             | ansc wrote:
             | >I don't care if you use Web scraping to solve the Israeli
             | / Palestinian conflict.
             | 
             | Maybe you should though. It's always worth it to think
             | about which giant's shoulder you're standing on. It's
             | giants all the way down.
        
             | dmkii wrote:
             | I agree that there is a line at using someone else's data
             | to make a profit, but it is kind of ironic that you mention
             | Google, because their exact business model is scraping
             | websites to feed their search results and litter it with
             | ads to make a profit. For me there is a big line between
             | aggregating publicly available data (search results,
             | reviews, news, job postings, etc. ) and intentionally
             | violating terms of service like signing up for fake
             | accounts an harvesting user data. So entitled maybe not
             | (sites can try to prevent you from scraping), but if you
             | make something publicly available you shouldn't be
             | surprised when people use it in ways you may not originally
             | have intended (within legal boundaries of course).
        
             | jumby wrote:
             | Wait - so you are saying that information on the public
             | internet isn't public? Man, I wish people would remember
             | the origin of the web and the entire reason it exists. If
             | you don't want information public, protect it - otherwise,
             | I say it's fair game.
        
               | blantonl wrote:
               | Remember the OP article is about a system that is
               | designed to completely and directly circumvent
               | protections.
               | 
               | If an organization puts a series of processes in place to
               | prevent scrapers from wholesale taking data in violation
               | of terms of service, and you develop a _5 server cluster
               | of 200x 4G modems_ it 's no longer "fair game" and you're
               | directly being unethical in your use of someone else's
               | services.
        
               | Spivak wrote:
               | Yeah, I think it's fair to say that in the presence of
               | anti-bot measures (whether they work or not) that the
               | content on the website isn't public anymore.
               | 
               | Available to someone meeting certain criteria (student
               | discount, senior discount) doesn't mean available to
               | anyone. I see no reason that "not available to be
               | consumed by autonomous agents" is somehow invalid in a
               | way that unlimited refills is only available to humans
               | and not robots.
        
         | tengbretson wrote:
         | Is it unethical for a mouse to eat the cheese without
         | triggering the trap?
        
         | hipadev23 wrote:
         | I find it aptly hilarious that your own business model at
         | broadcastify.com is recording publicly accessible radio
         | broadcasts and then selling access to those recordings for
         | commercial gain.
        
           | blantonl wrote:
           | Why is that hilarious? We developed an entire community,
           | infrastructure, system, architecture, everything, from
           | scratch, and provide access to something that _never existed
           | in the first place_ on the Internet. That 's a significant
           | key difference here.
           | 
           | This would be analogous to you thinking ancestory.com is
           | "aptly hilarious" for arguing against someone just scraping
           | their site for content.
           | 
           | What makes you think you should be entitled to drive by the
           | very unique house that we built, and pointing right at that
           | house and saying "I think I'll take that all of that for
           | myself!"
        
             | hipadev23 wrote:
             | Because you fail to see the very obvious parallels to
             | scraping. I'm not criticizing your business (I think you
             | provide a valuable service) but your hypocritical stance on
             | what forms of publicly available information are allowed to
             | be gathered and repackaged.
             | 
             | Google's original (and OpenAI's) business model was also
             | building a scraping infrastructure, system, and
             | architecture, from scratch -- and providing access to
             | something that never existed in the first place.
        
               | blantonl wrote:
               | It's completely perpendicular, not parallel.
               | 
               | Public safety communications are radio waves that are
               | _broadcasted_ and the ability to _passively monitor_ them
               | is enshrined in United States law. That is a massively
               | key difference.
               | 
               | If I was sending data into your home from my
               | infrastructure without any action from you whatsoever,
               | and you were reaching up into the air and gathering it
               | and repackaging it, AND the law said that I have no
               | intellectual property rights to said data, then that's a
               | whole different story.
        
               | zarzavat wrote:
               | Every time you use Google you benefit from scraping.
               | Scraping is how the world works for the last 25+ years.
               | 
               | You are trying to draw a distinction between data that is
               | pushed and data that is pulled, and maybe there is some
               | economic argument there in terms of resource usage, but
               | that is very context-dependent.
               | 
               | In UK listening to public radio broadcasts is illegal. I
               | think this law is idiotic and ignore it. It seems you do
               | too since there appear to be streams from UK on your site
               | :)
        
               | edgyquant wrote:
               | You are scraping radio signals and selling it. It's an
               | exact parallel and if you fail to see this it is indeed
               | hilarious.
        
               | throwaway48476 wrote:
               | It is difficult to get a man to understand something,
               | when his salary depends on his not understanding it.
        
               | blantonl wrote:
               | If you don't understand the difference between
               | intercepting radio signals and Web scraping, I'd say your
               | understanding of physics and technology is pretty
               | hilarious.
               | 
               | Look around in your house dude, there are radio signals
               | present in your house right now as we speak - you just
               | can't see them - the data literally exists right in your
               | home without you even having to do anything. And the law
               | grants to the unequivocal right in the United States to
               | intercept those radio signals.
        
               | papichulo2023 wrote:
               | So you only point that scrapping data is bad because the
               | cost? How do you know that the site someone is scraping
               | doesnt have fixed cost?
        
             | rmbyrro wrote:
             | Why is it ethical if you build upon other people's data,
             | but unethical if others do it?
             | 
             | Nobody cares how valuable you think your service is. Who's
             | the judge of what's entitled to scrape or not? If you think
             | you're the judge, I find it somewhat arrogant.
             | 
             | It is even more hilarious that you defend a position that,
             | to me, looks authoritarian and individualistic. Might not
             | be your intention, but it's what I read.
        
               | blantonl wrote:
               | _Why is it ethical if you build upon other people 's
               | data, but unethical if others do it?_
               | 
               | Because they GAVE IT TO ME, that's why.
               | 
               |  _Who 's the judge of what's entitled to scrape or not?
               | If you think you're the judge, I find it somewhat
               | arrogant._
               | 
               | You find it arrogant that I want to protect my business
               | interests from people who solely want to just "take" from
               | the hard work my team has put together. Would you be
               | arrogant if you built a platform over 20+ years, and then
               | scrapers just took the data for themselves?
               | 
               |  _...looks authoritarian and individualistic._
               | 
               | These assertions are ridiculous. LOL. Hyperbole at it's
               | finest.
        
         | dale_glass wrote:
         | > Whenever I deal with a scraping process that decides it wants
         | my entire business, and it wants all of it RIGHT NOW, or in 5
         | minutes, I want to find the person and sit them down in a room
         | and tell them "hey, develop your own ideas and business. Ok?
         | Thanks"
         | 
         | That's a lot of righteous anger for somebody building a
         | business on top of other people's data.
         | 
         | "Broadcastify is the worlds largest source of public safety,
         | aircraft, rail, and marine radio live audio streams."
         | 
         | I have no sympathy whatsoever. You're just complaining about
         | the very thing you're doing. If it's fair for you to do that,
         | it's fair for others to do it to you.
        
           | blantonl wrote:
           | They volunteer to provide the data to us. Every single last
           | one of them. Nowhere in our business model did we make the
           | conscious decision to say "hey, look at that business, they
           | have something, and I'm going to take it."
        
             | bsuvc wrote:
             | Reading public website data is not "taking it". It is still
             | there.
             | 
             | Observing publicly available information is not theft, nor
             | is it illegal.
             | 
             | Of course copyright rules apply, but that is for if you
             | reproduce something.
        
       | mellosouls wrote:
       | A curious title.
       | 
       | "So you want to scrape like the unethical boys?" I guess doesn't
       | scan so well. Bad boys maybe?
       | 
       | I'm pretty sure Internet Archive, etc don't in fact misrepresent
       | what they are to crawl websites...
        
         | echelon wrote:
         | > unethical
         | 
         | Using and transforming information in useful ways is unethical
         | if it results in a profit?
         | 
         | That's what our brains do, too.
        
           | llamaimperative wrote:
           | No, destroying incentives to produce and share information is
           | unethical (and more importantly, self-defeating).
           | 
           | Brains that consume information don't destroy that incentive,
           | they produce it.
           | 
           | Intermediating that and capturing all of the value for
           | yourself is the unethical part, just like all forms of rent-
           | seeking.
        
             | echelon wrote:
             | > destroying incentives
             | 
             | Internet usage and content creation are increasing, not
             | decreasing.
             | 
             | I continue to publish comments, code, and images that
             | presumably get used to train models. My incentive hasn't
             | been destroyed.
             | 
             | > rent-seeking
             | 
             | Supply and demand set the prices.
             | 
             | Subscription services provide value and continue to invest
             | in their product, catalog, and/or service. Property owners
             | handle asset ownership and upkeep problems at scale.
             | 
             | Inefficiencies will be met with competition, and businesses
             | not providing value will be out-competed.
             | 
             | Data under-availability is an inefficiency holding us back
             | from bigger and better things.
        
         | vasco wrote:
         | Tell that to the most used website in the world, which is
         | basically a scrapping-and-sorting machine.
        
           | blantonl wrote:
           | I can commit a code change in 2 seconds that would directly
           | tell the most used website in the world to stop scrapping and
           | sorting my data, and they would honor it and that would be
           | the end of that.
           | 
           | I'm under no illusions that they would or would not honor
           | that in the future, but that's the state today.
        
         | CoastalCoder wrote:
         | > "So you want to scrape like the unethical boys?"
         | 
         | What's considered ethical is a very debated topic.
         | 
         | An assertion that something is simply "unethical" should be
         | seen as the starting point of a discussion, not as a self-
         | evident fact.
        
           | marginalia_nu wrote:
           | If someone tells you to go away via the robots exclusion
           | standard, and puts up bot mitigation to prevent you, blocks
           | your IPs, etc. then clearly you do not have their consent to
           | help yourself to the data.
           | 
           | I find it really hard to see how you could twist ignoring
           | this clear lack of consent, and going to great lengths to
           | circumvent what was clearly put into place to prevent you
           | from doing the very thing you are doing, how you could twist
           | that into an ethical action.
           | 
           | It may or may not be technically illegal to do, you're but
           | that is not a statement about what is ethical.
        
             | lambdaba wrote:
             | Surely the ethics are more complicated then just following
             | robots.txt or not. The intended usage counts, and that
             | isn't captured in robots.txt.
        
               | marginalia_nu wrote:
               | If you have a noble intent, you ask the webmaster for
               | permission to use the data. Surely if they agree with
               | your assessment that your intent is indeed noble, then
               | you'll be given consent.
               | 
               | I run a search engine and an internet crawler. I do this
               | all the time. To this date I've never had a webmaster
               | that didn't permit my crawler access when I've asked
               | nicely.
        
               | bryanrasmussen wrote:
               | If you have a noble intent - identify members of fascist
               | organizations - then obviously when you ask the top
               | online fascist sites if you may scrape them to build up
               | your list of online fascists - they will say no.
               | 
               | OK less provocative, you have new algorithm to identify
               | inaccessible websites, your automation is scary good,
               | crawling a site you can identify many issues that most
               | sites would have to pay for a full audit to get, but now
               | these sites have problems - if you can identify their
               | sites as being inaccessible then they have to fix these
               | problems due to various accessibility standards that
               | apply in the regions they operate in. But if they don't
               | allow you access then they can maybe make an argument
               | they are accessible due to audit they did last year, at
               | any rate they don't want to be forced to spend money on
               | accessibility issues right now which it sounds like they
               | might have to if they let you crawl their site.
               | 
               | Version 2 of above, some years ago I spoke about a job
               | with a big time magazine publisher in Denmark and said
               | one of the things that would make me a good employee is
               | my knowledge of accessibility and their chief of
               | development said they didn't have anyone with
               | disabilities that used their site - so if I ask that guy
               | to crawl their site why say yes? They have no users that
               | would benefit!! Stop abusing our bandwidth bleeding heart
               | guy.
        
               | marginalia_nu wrote:
               | All of these seem like variations of the-ends-justify-
               | the-means, which generally tends to cut both ways in
               | unanticipated ways.
               | 
               | Bullying websites into accessibility compliance will most
               | likely lead to them following the letter of the standard
               | without giving a second of thought as to whether the
               | content is in fact actually accessible. It's very
               | difficult to get someone on board with your cause if your
               | initial contact is an antagonistic one.
        
               | greenbandit wrote:
               | This might work in cases where those with the data are
               | engaged in noble acts, but not ever actor is.
               | 
               | I scrape and process websites of actors engaged in fraud.
               | I do this to make the data more presentable to the proper
               | authorities and to help uncover further evidence of their
               | activities.
               | 
               | I suspect that asking for consent would be quickly denied
               | and the data/evidence would quickly become inaccessible.
        
             | duggan wrote:
             | Ok, you're building a service that scrapes e.g., property
             | rental websites to find entries that are trying to scam
             | naive renters.
             | 
             | The property websites are incompetent to solve the problem,
             | or don't care, but either way they sure don't want you
             | scraping their valuable data.
             | 
             | Is it still unethical?
        
               | marginalia_nu wrote:
               | That just makes both of you wrong.
        
               | blantonl wrote:
               | Agreed. It's kind of like when a non-profit organization
               | argues that they are entitled to someone's data because
               | "we're not making a profit off of it." That's ridiculous.
               | 
               | Try asking a startup for free software licenses or seats
               | or whatever as a non-profit. "We're entitled to 40 seats
               | of your SAAS solution because we're a non-profit working
               | to solve world peace." It's definitely within the
               | startup's pervue to respond with a no.
        
       | gwittel wrote:
       | I'm really mixed on this. Anti bot stuff is increasingly a pain
       | point for security research. Working in this space, I have to
       | work against these systems.
       | 
       | Threat actors use Cloudflare and other services to gate their
       | payloads. That's a problem for our customers who are trying to
       | find/detect things like brand impersonation and credential phish.
       | Cloudflare has been completely unhelpful. They just don't care.
        
         | heipei wrote:
         | Seconding this. Evading detection has become a real cake-walk
         | since threat actors are able to sign up for a free Cloudflare
         | account and then put their phishing site on their 2-hours old
         | domain behind a level of protection backed by a $20B company.
         | Funny that you almost never see phishing on Akamai ;)
         | 
         | Disclaimer: We operate in this space so we obviously have an
         | interest in being able to detect these threats going forward.
        
           | rashkov wrote:
           | Why not Akamai?
        
             | nkozyra wrote:
             | Cost.
        
           | throwaway48476 wrote:
           | Cloudflare is the ultimate example of creating the problem
           | and selling the solution.
        
             | zinglersen wrote:
             | I was under the (naive?) impression that Cloudflare a SaaS
             | startup poster child. Do you mind expanding on your
             | comment?
        
               | throwaway48476 wrote:
               | Among other things, cloudflare hosts DoS services while
               | selling DoS protection.
        
       | anyfactor wrote:
       | I was a professional web scraper. I still keep up to date with
       | the industry.
       | 
       | These days, you do not make money by doing web scraping; you make
       | money selling services to web scrapers. There are tons of web
       | scraping SAAS and services out there, as well as dozens of
       | residential proxy providers.
       | 
       | Most anti-bot mechanisms evolve so quickly that you can make a
       | decent income just by working in a traditional software
       | engineering role dedicated entirely to engineering anti-anti-bot
       | solutions. As these mechanisms evolve rapidly, working for a web
       | scraping company is more stable than pursuing web scraping as a
       | profession.
       | 
       | Web scrapers get paid by projects, making it an unstable job in
       | the long run. High-level web scraping requires operational
       | investments in residential proxies and renting out servers.
       | Additionally, low-end jobs pay very little. Brightdata hosting a
       | conference on web scraping, which should indicate the
       | profitability of selling services in large-scale web scraping.
        
         | RockRobotRock wrote:
         | I've been writing scrapers on Upwork for many years. I'm sick
         | of doing project based work and want to work at/start a
         | scraping SaaS. Any advice?
        
         | jimz wrote:
         | The irony is that before I realized it was so easy I would just
         | open source the code - not on Github, mind you, since the likes
         | of Akamai would DMCA pretty quickly, but playing a little bit
         | of jurisdictional arbitrage I put it on Gitee - the Chinese
         | copycat of Github. I don't have a background in any of this,
         | but companies like the brag and it's not hard to put two and
         | two together. It also was a practical way to enable me to place
         | wagers on sports automatically - which was more or less my
         | actual day job - and was pretty good for learning programming
         | quickly in your late 20s.
         | 
         | Instead almost immediately I got inundated by sneaker botters
         | in China and in English from somewhere that doesn't use it as a
         | native language, judging from the idiosyncratic use. I kept the
         | code up for a bit but took it down not because of any legal
         | threats (good luck with DMCA-ing a platform endorsed by the
         | CCP, even though I have no love for the party, I also find the
         | American attitude that places intellectual property over real
         | property in practice - from my experience as a defense attorney
         | - to be just as screwed up in terms of priorities, just a
         | matter of degrees. What made me take it down was the fact that
         | I did not want to work in a customer service job or really for
         | anyone, and judging by the requests, it was mostly consisted of
         | "you do the work but we'll split the profits", which I can't
         | believe anyone would fall for.
         | 
         | But since the internet is forever, some parts of code that
         | specifically worked to emulate Cyberfed-Akamai from 0.8 to 2.3
         | are probably still floating around. My bad. I don't wear shoes
         | normally - flip flops or nothing after having to wear a suit to
         | work for a decade - and have no idea beyond what happens in
         | NBA2K. Although cybersecurity firms making products that
         | someone who learned how to program in their mid 20s and put
         | online within 3 years and had it work should be pretty ashamed
         | of how much they charge, considering that I haven't even taken
         | a math course since 11th grade and had too much of an ADHD
         | problem to watch videos or even read more than blog posts or
         | documentation. Everything I learned, I learned by copying from
         | Github and similar services until it worked. There must be a
         | lot of snake oil being sold out there, maybe most of it, since
         | the insidiousness of the whole thing is that selling bunk
         | solutions seldom gets you in trouble anyway, while actual crime
         | - rape, murder, robbery and the like - are largely lagging
         | because the police simply prefer to complain about culture war
         | bs instead of actually, you know, do their jobs. Who knew
         | Judith Butler was THIS spot on.
        
         | wanderlust123 wrote:
         | How do you keep up with the industry?
        
       | sublinear wrote:
       | > Those companies employ ill-adjusted individuals that do nothing
       | else than look for the most recent techniques to fingerprint
       | browsers ... When normal people are out drinking beers in the pub
       | on Friday night, these individuals invent increasingly bizarre
       | ways to fingerprint browsers and detect bots ;)
       | 
       | Why not both on a friday night?
        
       | KieranMac wrote:
       | I'm a lawyer that works in the web-scraping space, and I always
       | chuckle when I read threads like this. Almost every company that
       | we now consider a monopolist (or their affiliates) in the tech
       | space used scraping a part of their process to build their
       | business, and almost every one of those same monopolists now
       | prohibits startups and competitors from scraping their data
       | (which, invariably, is not actually "their" data in any sort of
       | legally cognizable sense). And so perhaps the ethics of web
       | scraping are not so straightforward. And neither are the legal
       | issues associated with it.
       | 
       | I wrote an article about that last fall that got some attention
       | here.
       | 
       | https://news.ycombinator.com/item?id=37264676
        
         | jMyles wrote:
         | > And so perhaps the ethics of web scraping are not so
         | straightforward.
         | 
         | It strikes me that the _ethics_ of web scraping are extremely
         | straightforward and cognizable with a terse analysis:
         | 
         | * You can respond however you like to my HTTP request, and I
         | can parse your response however I like.
         | 
         | Simple, traditional, common. This is the way that conversations
         | have occurred since the dawn of human communication, no?
         | 
         | > the legal issues associated with it.
         | 
         | But aren't these, without exception, fabrics spun out of the
         | cloth that shields established players with the threat of state
         | violence? This is not particularly new, and seems to fit in the
         | pathetic-and-predictable file.
         | 
         | Moreover, the broader cheap attempt to cast this in
         | "intellectual" property terms, and to attach that to protection
         | of artists and creators, warrants a very particular eye-roll
         | for its illogic.
        
           | elicksaur wrote:
           | If I say, "Hey, please don't text me anymore. I'm going to
           | block this number," and you respond by buying 500 phones in
           | five cities and text me nonstop, is that ethical?
        
             | andai wrote:
             | Not sure the metaphor works here. For example most sites
             | let Google scrape them as much as it likes, but go out of
             | their way to block other robots. By doing so they are
             | effectively forcing the whole world to use (or support,
             | since smaller search engines have to piggyback on the big
             | ones wih special status, and pay them) proprietary spyware.
             | 
             | In your analogy, most websites block everyone except the
             | biggest pervert known to man.
        
               | eli wrote:
               | Isn't that a choice the website owner should be able to
               | make?
        
               | jMyles wrote:
               | Of course it's your choice to make.
               | 
               | Is someone forcing you to respond to requests you'd
               | prefer to ignore?
        
               | paulryanrogers wrote:
               | If crawlers are stealth DDoSing my site then I lose the
               | ability to respond entirely.
        
             | jMyles wrote:
             | It's your job to separate the wheat from the chaff at the
             | boundary of your network interface. In fact, personal
             | boundaries of all sorts, from informational to emotional to
             | physical to economic, are of paramount importance in the
             | information age.
             | 
             | Nobody (and certainly not the state) is going to erect your
             | personal boundaries for you by ensuring justice in the face
             | of spammy text messages (or, for that matter, hypnotic and
             | manipulative social media). This is your job - maybe your
             | most important job.
             | 
             | Just as its your job to protect your personal health and
             | safety. Nobody (and certainly not the state) is going to do
             | that for you.
             | 
             | Is there something about the trajectory of evolution of the
             | internet that suggests to you that this is incorrect?
             | 
             | I observe continually (seemingly perpetually) increasing
             | traffic, and continually (seemingly perpetually) increasing
             | capacity for general purpose computing. I also observe
             | enormous empathy and cyberpunk traditions in our
             | communities, protecting each other. Do my eyes and ears
             | deceive me?
        
               | paulryanrogers wrote:
               | Restraining orders are a thing for a reason. It's cheaper
               | to harass someone out of business (intentionally or
               | otherwise) than to compete on a level playing field.
               | 
               | Being a good neighbor requires restraining oneself and
               | making requests with consideration for the other party.
               | 
               | Full disclosure: I worked for a price monitoring service
               | that prided itself on crawling up to every 3 hours. Steps
               | were always taken to mitigate the impact. Sometimes even
               | asking hosts to allow-list the crawlers.
        
           | theamk wrote:
           | [delayed]
        
       | graemep wrote:
       | Anti bot stuff also seems to be a security threat and privacy
       | threat: preventing users from accessing your site if using VMs,
       | port scanning, various froms of fingerprinting
        
         | Terr_ wrote:
         | I prefer the approach of an algorithmic challenge that forces
         | the "new visitor" to spend some CPU cycles.
         | 
         | It's a clear process, doesn't involve privacy risks or strange
         | sneaky games, and tends to fail in ways that a human can at
         | least see and report, as opposed to mysterious outages.
        
       | dang wrote:
       | Discussed at the time:
       | 
       |  _Scrape like the big boys_ -
       | https://news.ycombinator.com/item?id=29117022 - Nov 2021 (189
       | comments)
        
       ___________________________________________________________________
       (page generated 2024-04-27 23:00 UTC)