[HN Gopher] Web scraping is legal, US appeals court reaffirms
___________________________________________________________________
Web scraping is legal, US appeals court reaffirms
Author : spenvo
Score : 432 points
Date : 2022-04-18 19:37 UTC (3 hours ago)
(HTM) web link (techcrunch.com)
(TXT) w3m dump (techcrunch.com)
| bobajeff wrote:
| I had no idea this was even being discussed. I'm glad they are
| reasonable on this. Wish they had been as reasonable on breaking
| encryption/DRM schemes.
| amelius wrote:
| Where is the lobbying? Something seems wrong ...
| 8bitsrule wrote:
| Wondered how this works WRT copyright (since the article did not
| contain the word). Here's Kent State's (short) IPR advice
| [https://libguides.library.kent.edu/data-management/copyright] It
| says "Data are considered 'facts' under U.S. law. They are not
| copyrightable.... Creative arrangement, annotation, or selection
| of data can be protected by copyright."
| ricardo81 wrote:
| For what it's worth, Linkedin was incredibly easy to scrape back
| in the day, wrt profile/email correlation. Can't buy any
| aggressive stance they may have against 'scrapers'.
|
| 2 options.
|
| Their Linkedin ID's are base 12, and would redirect you if you
| simply wanted to enumerate them.
|
| You could also upload your 'contacts', 200-300 at a time and it'd
| leak profile IDs (Twitter and Facebook mitigated this ~5 years
| ago). I still have a @pornhub or some such "contact" that I can't
| delete from testing this.
| infiniteL0Op wrote:
| altdataseller wrote:
| Does the ruling make it illegal to block scrapers?
| ricardo81 wrote:
| Interesting as I've seen a few search engine start ups that seem
| to scrape other search engine results, depending on your
| definition of scraping. My definition would be a user agent that
| doesn't uniquely identify itself that isn't using an authorised
| API.
| cloudyporpoise wrote:
| These anti-scraping corporate activists need to get with the
| times and allow access to their data, legitimately, through an
| API. Third parties will scrape and sell the data regardless, so
| why not just cut them out and even charge for individuals to
| legitimately use the API. API keys could be tied to an individual
| and at least LinkedIn would know who was conducting what action.
|
| Make it easier to get the data through an API than having to
| scrape it.
| CWuestefeld wrote:
| While I have sympathy for what the scrapers are trying to do in
| many cases, it bothers me that this doesn't seem to address what
| happens when badly-behaved scrapers cause, in effect, a DOS on
| the site.
|
| For the family of sites I'm responsible for, bot traffic
| comprises a majority of traffic - that is, to a first
| approximation, the lion's share of our operational costs are from
| needing to scale to handle the huge amount of bot traffic. Even
| when it's not as big as a DOS, it doesn't seem right to me that I
| can't tell people they're not welcome to cause this additional
| system load.
|
| Or even if there was some standardized way that we could provide
| a dumb API, just giving them raw data so we don't need to incur
| the additional expense of the processing for creature comforts on
| the page designed to make our users happier but the bots won't
| notice.
| hardtke wrote:
| The problem with many sites (and LinkedIn in particular) is
| that they whitelist a bunch of specific websites, presumably
| based on the business interests, but disallow everyone else in
| their robots.txt. You should either allow all scrapers that
| respect certain load requirements or allow none. Anything that
| Google is allowed to see and include in their search results
| should be fair game.
|
| Here's the end of LinkedIn's robots.txt:
|
| User-agent: * Disallow: /
|
| # Notice: If you would like to crawl LinkedIn, # please email
| whitelist-crawl@linkedin.com to apply # for white listing.
| diamondage wrote:
| And this is what the HiQ case hinged on. LinkedIn were
| essentially selectively applying the computer fraud and abuse
| act based on their business interests - that was never going
| to sit well with judges.
| car_analogy wrote:
| > Even when it's not as big as a DOS, it doesn't seem right to
| me that I can't tell people they're not welcome to cause this
| additional system load.
|
| You _can_ tell them. You just can 't prosecute them if they
| don't obey.
| davidhyde wrote:
| I don't know what kind of data you serve up but perhaps you
| could serve low quality or inaccurate content from addresses
| that are guessed from your api. I.e. endpoints not normally
| reachable in the normal functioning of your web app should
| return reasonable junk. A mixture of accurate and inaccurate
| data becomes worthless for bots and worthless data is not worth
| scraping. Just an idea!
| ryan_j_naughton wrote:
| As other have said (A) there are plenty of countermeasures you
| can take, but also (B) you are frustrated that you are
| providing something free to the public and then annoyed at the
| "wrong" customers are using your product and costing you money.
| I'm sorry, but this is a failure of your business model.
|
| If we were to analogize this to a non-internet example: (1) A
| company throws a free concert/event and believes they will make
| money by alcohol sales. (2) A bunch of sober/non-drinking folks
| attend the concert but only drink water (3) Company blames the
| concert attendees for "taking advantage" of them when they
| really just had poor company policies and a bad business model.
|
| Put things behind authentication and authorization. Add a
| paywall. Implement DDOS and detection and banning approaches
| for scrapers. Etc etc.
|
| But don't make something public and then get mad at THE PUBLIC
| for using it. Behind that machine is a person, who happens to
| be a member of the public.
| noisenotsignal wrote:
| There are certain classes of websites where the proposed
| solutions aren't a great fit. For example, a shopping site
| hiding their catalog behind paywalls or authentication would
| raise barriers to entry such that a lot of genuine customers
| would be lost. I don't think the business model is in general
| to be blamed here and it's ok to acknowledge the unfortunate
| overhead and costs added by site usage patterns (e.g.
| scraping) that are counter to the expectation.
| Nextgrid wrote:
| But don't you already have countermeasures to deter DoS attacks
| or malicious _human_ users (what if someone pays or convinces
| people to open your site and press F5 repeatedly)?
|
| If not, you should, and the badly-behaved scrapers are actually
| a good wake-up call.
| colinmhayes wrote:
| I'm sympathetic to this. I built a search engine for my senior
| project and my half baked scraper ended up taking down duke
| law's site during their registration period. Ended up getting a
| not so kindly worded email from them, but honestly this wasn't
| an especially hard problem to solve. All of my traffic was
| coming from the cluster that was on my university's subnet, it
| wouldn't have been that hard to for them to IP address timeouts
| when my crawler started scraping thousands of pages a second on
| their site. Not to victim blame, this was totally my fault, but
| I was a bit surprised that they hadn't experienced this before
| with how much automated scraping goes on.
| brightball wrote:
| I'm honestly more interested in bot detection than anything
| else at this point.
|
| It seems like it should be perfectly legal to detect and then
| hold the connection open for a long period of time without
| giving a useful response. Or even send highly compressed gzip
| responses designed to fill their drives.
|
| Legal or not, I can't see any good reason that we can't make
| it painful.
| fjabre wrote:
| Make it painful if they abuse the site.
|
| We all benefit from open data. Polite scrapers are just
| fine and a natural part of the web ecosystem.
|
| Google has been scraping the web all day every day for
| decades now.
| rmbyrro wrote:
| I have sympathy for your operational issues and costs, but
| isn't this kind of complaint the same as a shopping mall/center
| complaining of people who go in, check some info and go out
| without buying?
|
| I understand that bots have leverage and automation, but so
| does you to reach a larger audience. Should we continue to
| benefit from one side of the leverage, while complaining about
| the other side?
| wlesieutre wrote:
| It's more like a mall complaining that while they're trying
| to serve 1000 customers, someone has gone and dumped 10000000
| roombas throughout the stores which are going around scanning
| all the price tags.
| kortilla wrote:
| No, because those are people going to the mall. Not robots
| 100x the quantity of real people.
| CWuestefeld wrote:
| No. When I say that bots exceed the amount of real traffic,
| I'm including people "window shopping" on the good side.
|
| My complaint is more like, somebody wants to know the prices
| of all our products, and that we have roughly X products
| (where X is a very large number). They get X friends to all
| go into the store almost simultaneously, each writing down
| the price of the particular product they've been assigned to
| research. When they do this, there's scant space left in the
| store for even the browsing kind of customers to walk in. (of
| course I exaggerate a bit, but that's the idea)
| mrobins wrote:
| I'm sympathetic to the complaints about "rude" scraping
| behavior but there's an easy solution. Rather than make
| people consume boatloads of resources they don't want
| (individual page views, images, scripts, etc.) just build
| good interoperability tools that give the people what they
| want. In the physical example above that would be a product
| catalog that's easily replicated with a CSV product listing
| or an API.
| jensensbutton wrote:
| You don't know why any random scraper is scraping you and
| thus you don't know what api to build that will do them
| from scraping. Also, it's likely easier for them to
| contribute scraping than write a bunch of code to
| integrate with your API so it there's no incentive for
| them to do so either.
| mcronce wrote:
| Writing a scraper for a webpage is typically far more
| development effort than writing an API wrapper
| withinboredom wrote:
| Just advertise the API in the headers. Or better yet, set
| the buttons/links only to be accessible via .usetheapi-
| dammit selector. Lastly, provide an API and a
| "developers.whatever.com" domain to report issues with
| the API, get API keys, and pay for more requests. It
| should be pretty easy to setup, especially if there's an
| internal API available behind the frontend already. I'd
| venture a dev team could devote 20% to a few sprints and
| have an MVP thing up and running.
| mrobins wrote:
| I think lots of website owners know exactly where the
| value in their content exists. Whether or not they want
| to share that in a convenient way, especially to
| competitors etc is another story.
|
| That said if scraping is inevitable, it's immensely
| wasteful effort to both the scraper and the content owner
| that's often avoidable.
| altdataseller wrote:
| For the 2nd part, I have done scraping and would always
| opt for an API if the price is reasonable over paying
| nosebleed amounts for residential proxies
| CWuestefeld wrote:
| Yes, exactly. Nobody is standing up and saying "we're the
| ones doing this, and here's what we wish you'd put in an
| API".
|
| Also, I'm a big Jenson Button fan.
| [deleted]
| bastardoperator wrote:
| Have you considered using a cache service like cloudflare?
| KennyBlanken wrote:
| > While I have sympathy for what the scrapers are trying to do
| in many cases, it bothers me that this doesn't seem to address
| what happens when badly-behaved scrapers cause, in effect, a
| DOS on the site.
|
| Like when Aaron Swartz spent months hammering JSTOR causing it
| to become so slow it was almost unusuable, and despite knowing
| that he was causing widespread problems (including the eventual
| banning of MIT's entire IP range) actually worked to add
| additional laptops and improve his scraping speed...all the
| while going out of his way to subvert MIT's netops group trying
| to figure out where he was on the network.
|
| JSTOR, by the way, is a non-profit that provides aggregate
| access to their cataloged archive of journals, for schools and
| libraries to access journals they would otherwise never be able
| to afford. In many cases, free access.
| linuxdude314 wrote:
| If most of your traffic is bots, is the site even worth
| running?
|
| This really is akin to the question, "Should others be allowed
| to take my photo or try to talk to me in public?"
|
| Of course the answer should be yes, the internet is the digital
| equivalent of a public space. If you make it accessible, anyone
| should be able to consume.
|
| If you don't want it scraped add auth!
| [deleted]
| voxic11 wrote:
| The court just ruled that scraping on its own isn't a violation
| of the CFAA. Meaning it doesn't count as the crime of
| "accessing a protected computer without authorization or
| exceeding authorized access and obtaining information".
|
| However presumably all the other provisions of the CFAA still
| apply, so if your scraping damages the functioning of a
| internet service then you still would have committed the crime
| of "Damaging a protected computer by intentional access".
| Negligently damaging a protected computer is punishable by 1
| year in prison on the first offence. Recklessly damaging a
| protected computer is punishable by 1-5 years on the first
| offense. And intentionally damaging a protected computer is
| punishable by 1-10 years for the first offense. These penalties
| can go up to 20 years for repeated offenses.
| gdulli wrote:
| When the original ruling in favor of HiQ came out, it still
| allowed for LinkedIn to block certain kinds of malicious
| scraping. LinkedIn had been specifically blocking HiQ, and was
| ordered to stop doing that.
| kstrauser wrote:
| I've told this story before, but it was fun, so I'm sharing it
| again:
|
| I'll skip the details, but a previous employer dealt with a
| large, then-new .mil website. Our customers would log into the
| site to check on the status of their invoices, and each page
| load would take approximately 1 minute. Seriously. It took
| about 10 minutes to log in and get to the list of invoices
| available to be checked, then another minute to look at one of
| them, then another minute to get out of it and back into the
| list, and so on.
|
| My job was to write a scraper for that website. It ran all
| night to fetch data into our DB, and then our website could
| show the same information to our customers in a matter of
| milliseconds (or all at once if they wanted one big aggregate
| report). Our customers _loved_ this. The .mil website 's
| developer _hated_ it, and blamed all sorts of their tech
| problems on us, although:
|
| - While optimizing, I figured out how to skip lots of
| intermediate page loads and go directly to the invoices we
| wanted to see.
|
| - We ran our scraper at night so that it wouldn't interfere
| with their site during the day.
|
| - Because each of our customers had to check each one of their
| invoices every day if they wanted to get paid, and we were
| doing it more efficiently, our total load on their site was
| lower than the total load of our customers would be.
|
| Their site kept crashing, and we were there scapegoat. It was
| great fun when they blamed us in a public meeting, and we
| responded that we'd actually disabled our crawler for the past
| week, so the problem was still on their end.
|
| Eventually, they threatened to cut off all our access to the
| site. We helpfully pointed out that their brand new site wasn't
| ADA compliant, and we had vision-impaired customers who weren't
| able to use it. We offered to allow our customers to run the
| same reports from our website, for free, at no cost to the .mil
| agency, so that they wouldn't have to rebuild their website
| from the ground up. They saw it our way and begrudgingly
| allowed us to keep scraping.
| dwater wrote:
| I have worked with .mil customers who paid us to scrape and
| index their website because they didn't have a better way to
| access their official, public documents.
| oneoff786 wrote:
| Me too but for a private company
|
| In reality it was probably more like org sub group A wanted
| to leverage org sub group B's data but they didn't
| cooperate
| jll29 wrote:
| This is not .mil specific: I've been told of a case where
| an airline first legally attacked a flight search engine
| (Skyscanner) for scraping, and then told them to continue
| when they realized that their own search engine couldn't
| handle all the traffic, and even if it could, it was more
| expensive per query than routing via Skyscanner.
| brailsafe wrote:
| You probably can. On the protocol level with JSON-LD or other
| rich data packages that generate xml or standardized json
| endpoints. I did this for an open data portal, and this is
| something most G7 governments do with their federal open data
| portals using off the shelf packages (that are worth
| researching a bit obviously first), particularly in the python
| and flask world. We were still getting hammered by China at our
| Taiwanese language subdomain, but that was a different concern
| bequanna wrote:
| As someone that has been on the other end, I can tell you devs
| don't want to use selenium or inspect requests to reverse
| engineer your UI and _wish_ there were more clean APIs.
|
| Have you tried making your UI more challenging to scrape and
| adding a simple API that requires free registration?
| YPCrumble wrote:
| Reading your comment my impression is that this is either an
| exaggeration or a very unique type of site if bots make up the
| majority of traffic to the point that scrapers are anywhere
| near the primary load factor.
|
| Would someone let me know if I'm just plain wrong in this
| assumption? I've run many types of sites and scrapers have
| never been anywhere close to the main source of traffic or even
| particularly noticeable compared to regular users.
|
| Even considering a very commonly scraped site like LinkedIn or
| Craigslist - for any site of any magnitude like this public
| pages are going to be cached so additional scrapers are going
| to have negligible impact. And a rate limit is probably one
| line of config.
|
| I'm not saying you are necessarily wrong, but I can't imagine a
| scenario that you're describing and would love to hear of one.
| CWuestefeld wrote:
| It's a B2B ecommerce site. Our annual revenue from the site
| would put us on the list of top 100 ecommerce sites [1]
| (we're not listed because ecommerce isn't the only businesss
| we do. With that much potential revenue to steal from us,
| perhaps the stakes are higher.
|
| As described elsewhere, rate limiting doesn't work. The bots
| come from hundreds to thousands of separate IPs
| simultaneously, cooperating in a distributed fashion. Any one
| of them is within reasonable behavioral ranges.
|
| Also, caching, even through a CDN doesn't help. As a B2B
| site, all our pricing is custom as negotiated with each
| customer. (What's ironic is that this means that the pricing
| data that the bots are scraping isn't even representative -
| it only shows what we offer walkup, non-contract customers.)
| And because the pricing is dynamic, it also means that the
| scraping to get these prices is one of the more
| computationally expensive activities they could do.
|
| To be fair, there is some low-hanging fruit in blocking many
| of them. Like, it's easy to detect those that are flooding
| from a single address, or sending SQL injection attacks, or
| just plain coming from Russia. I assume those are just the
| script kiddies and stuff. The problem is that it still leaves
| a whole lot of bad actors once these are skimmed off the top.
|
| [1] https://en.wikipedia.org/wiki/List_of_largest_Internet_co
| mpa...
| mindslight wrote:
| > _As a B2B site, all our pricing is custom as negotiated
| with each customer ... the pricing is dynamic_
|
| So your company is deliberately trying to frustrate the
| market, and doesn't like the result of third parties
| attempting to help market efficiency? It seems like this is
| the exact kind of scraping that we generally want more of!
| I'm sorry about your personal technical predicament, but it
| doesn't sound like your perspective is really coming from
| the moral high ground here.
| YPCrumble wrote:
| Thanks for the explanation!
|
| The thing I still don't understand is why (edit server not
| cdn) caching doesn't work - you have to identify customers
| somehow, and provide everyone else a cached response at the
| server level. For that matter, rate limit non-customers
| also.
| CWuestefeld wrote:
| The pages getting most of the bot action are search and
| product details.
|
| Search results obviously can't be cached, as it's
| completely ad hoc.
|
| Product details can't be cached either, or more
| precisely, there are parts of each product page that
| can't be cached because
|
| * different customers have different products in the
| catalog
|
| * different products have different prices for a given
| product
|
| * different products have customer-specific aliases
|
| * there's a huge number of products (low millions) and
| many thousands of distinct catalogs (many customers have
| effectively identical catalogs, and we've already got
| logic that collapses those in the backend)
|
| * prices are also based on costs from upstream suppliers,
| which are themselves changing dynamically.
|
| Putting all this together, the number of times a given
| [product,customer] tuple will be requested in a
| reasonable cache TTL isn't very much greater than 1. The
| exception being for walk-up pricing for non-contract
| users, and we've been talking about how we might optimize
| that particular cases.
| YPCrumble wrote:
| Ahhhhh, search results makes a whole lot more sense!
| Thank you. Search can't be cached and the people who want
| to use your search functionality as a high availability
| API endpoint use different IP addresses to get around
| rate limiting.
|
| The low millions of products also makes some sense I
| suppose but it's hard to imagine why this doesn't simply
| take a login for the customer to see the products if
| they're unique to each customer.
|
| On the other hand, I suspect the price this company is
| paying to mitigate scrapers is akin to a drop of water in
| the ocean, no? As a percent of the development budget it
| might seem high and therefore seem big to the developer,
| but I suspect the CEO of the company doesn't even know
| that scrapers are scraping the site. Maybe I'm wrong.
|
| Thanks again for the multiple explanations in any case,
| it opened my eyes to a way scrapers could be problematic
| that I hadn't thought about.
| toast0 wrote:
| If you've got a site with a _lot_ of pages, bot traffic can
| get pretty big. Things like a shopping site with a large
| number of products, a travel site with pages for hotels and
| things to do, something to do with movies or tv shows and
| actors, basically anything with a large catalog will drive a
| lot of bot traffic.
|
| It's been forever since I worked at Yahoo Travel, but bot
| traffic was significant then, I'd guess roughly 5-10% of the
| traffic was declared bots, but Yandex and Baidu weren't
| agressive crawlers yet, so I wouldn't be terribly surprised
| if a site with a large catalog that wasn't top 3 with humans
| would have a majority of traffic as bots. For the most part,
| we didn't have availability issues as a result of bot
| traffic, but every once in a while, a bot would really ramp
| up traffic and cause issues, and we would have to carefully
| design our list interfaces to avoid bots crawling through a
| lot of different views of the same list (while also trying to
| make sure they saw everything in the list). Humans may very
| well want to have all the narrowing options, but it's not
| really helpful to expose hotels near Las Vegas starting with
| the letter M that don't have pools to Google.
| YPCrumble wrote:
| I appreciate the response but I'm still perplexed. It's not
| about the percent of traffic if that traffic is cached. And
| rate limiting also prevents any problems. It just doesn't
| seem plausible that scrapers are going to DDoS a site per
| the original comment. I suppose you'd get bad traffic
| reports and other problems like log noise, but claiming it
| to be a general form of DDoS really does sound like
| hyperbole.
| breischl wrote:
| As another example, I used to work on a site that was roughly
| hotel stays. A regular person might search where to stay in
| small set of areas, dates and usually the same number of
| people.
|
| Bots would routinely try to scrape pricing for every
| combination of {property, arrival_date, departure_date,
| num_guests} in the next several years. The load to serve this
| would have been _vastly_ higher than real customers, but our
| frontend was mostly pretty good at filtering them out.
|
| We also served some legitimate partners that wanted basically
| the same thing via an API... and the load was in fact
| enormous. But at least then it was a real partner with some
| kind of business case that would ultimately benefit us, and
| we could make some attempt to be smart about what they asked
| for.
| VWWHFSfQ wrote:
| > a very unique type of site if bots make up the majority of
| traffic
|
| Pretty much Twitter and the majority of such websites.
| YPCrumble wrote:
| Do you really believe bots make up a significant amount of
| Twitter's operating cost? Like I said they're just
| accessing cached tweets and are rate limited. How can the
| bot usage possibly be more than a small part of twitter's
| operating cost?
| nojito wrote:
| Bandwidth isn't free.
| YPCrumble wrote:
| I didn't say it is free, I said that the bandwidth for bots
| is negligible compared to that of regular users.
| nojito wrote:
| Negligible isn't free either.
| svnpenn wrote:
| Implement TLS fingerprinting on your server. People can still
| fake that if they are determined, but it should cut the abuse
| way down.
| userbinator wrote:
| TLS fingerprinting is one of the ways minority browsers and
| OS setups get unfairly excluded. I have an intense hatred of
| Cloudflare for popularising that. Yes, there are ways around
| it, but I still don't think I should have to fight to use the
| user-agent I want.
| oh_sigh wrote:
| I don't want to say tough cookies, but if OPs
| characterization isn't hyperbole("the lion's share of our
| operational costs are from needing to scale to handle the
| huge amount of bot traffic."), then it can be a situation
| where you have to choose between 1) cut off a huge chunk of
| bots, but upset a tiny percent of users, and improve the
| service for everyone else, or 2) simply not provide the
| service at all due to costs.
| nyuszika7h wrote:
| I don't think it's likely to cause issues if implemented
| properly. Realistically you can't really build a list of
| "good" TLS fingerprints because there are a lot of
| different browser/device combinations, so in my experience
| most sites usually just block "bad" ones known to belong to
| popular request libraries and such.
| CWuestefeld wrote:
| No, nor can we just do it by IP. The bots are MUCH more
| sophisticated than that. More often than not, it's a
| cooperating distributed net of hundreds of bots, coming from
| multiple AWS, Azure, and GCP addresses. So they can pop up
| anywhere, and that IP could wind up being a real customer
| next week. And they're only recognizable as a botnet with
| sophisticated logic looking at the gestalt of web logs.
|
| We do use a 3rd party service to help with this - but that on
| its own is imposing a 5- to 6-digit annual expense on our
| business.
| z3c0 wrote:
| Have you considered setting up an API to allow the bots to
| get what they want without hammering your front-end
| servers?
| CWuestefeld wrote:
| Yes. And if I could get the perpetrators to raise their
| hands so I could work out an API for them, it would be
| the path of least resistance. But they take great pains
| to be anonymous, although I know from circumstantial
| evidence that at least a good chunk of it is various
| competitors (or services acting on behalf of competitors)
| scraping price data.
|
| IANAL, but I also wonder if, given that I'd be designing
| something specifically for competitors to query our
| prices in order to adjust their own prices, this would
| constitute some form of illegal collusion.
| marginalia_nu wrote:
| What seems to actually work is to identify the bots and
| instead of giving up your hand by blocking them, to
| quietly poison the data. Critically, it needs to be
| subtle enough that it's not immediately obvious the data
| is manipulated. It should look like a plausible response,
| only with some random changes.
| kayodelycaon wrote:
| What makes you think they would use it?
| z3c0 wrote:
| It's in their interest. I've scraped a lot, and it's not
| easy to build a reliable process on. Why parse a human
| interface when there's an application interface
| available?
| thaumaturgy wrote:
| There's a lot of metadata available for IPs, and that
| metadata can be used to aggregate clusters of IPs, and that
| in turn can be datamined for trending activity, which can
| be used to sift out abusive activity from normal browsing.
|
| If you're dropping 6 figs annually on this and it's still
| frustrating, I'd be interested in talking with you. I built
| an abuse prediction system out of this approach for a small
| company a few years back, it worked well and it'd be cool
| to revisit the problem.
| borski wrote:
| You could ban their IPs?
| KMnO4 wrote:
| IP bans are equivalent to residential door locks. They're
| only deterring the most trivial attacks.
|
| In school I needed to scrape a few hundred thousand pages of
| a proteomics database website. For some reason you had to
| view each entry one at a time. There was IP throttling which
| banned you if you made requests too quickly. But slowing the
| script to 1 request per second would have taken days to
| scrape the site. So I paid <$5 for a list of 500 proxy
| servers and distributed it, completing the task in under half
| an hour.
| borski wrote:
| I agree it's not perfect. It's also significantly better
| than nothing.
| l33t2328 wrote:
| Using proxies to hide your identity to get around a denial
| of access seems to get awfully close to violating the
| Computer Fraud and Abuse Act(in USA, at least).
|
| I'm surprised your school was okay with it.
| throw10920 wrote:
| Have you considered serving a proof-of-work challenge to
| clients accessing your website? Minimal cost on legit users,
| but large costs on large-scale web-scraping operations, and it
| doesn't matter if they split up their efforts across a bunch of
| IP addresses - they're still going to have to do those
| computations.
|
| https://en.wikipedia.org/wiki/Hashcash
| nyuszika7h wrote:
| No thanks, as a user I would stay far away from such
| websites. This is akin to crypto miners. I don't need them to
| drive up my electricity costs and also contribute to global
| warming in the process. It's not worth the cost.
| userbinator wrote:
| That's what rate-limiting is for. Don't be so aggressive with
| it that you start hitting the faster visitors, however, or they
| may soon go somewhere else (has happened to me a few times).
| loceng wrote:
| Do you know if there's a way to rate limit logged-in users
| differently than visitors of a site?
| rolph wrote:
| rate limiting can be a double edged sword, you can be
| better off giving a scraper highest bandwidth so they are
| gone sooner, otherwise somthing like making a zip or other
| sort of compilation of the site available may be an option.
|
| just what kind of scraper you have is a concern.
|
| does scraper just want a bunch of stock images;
|
| or does scraper have FOMO on web trinkets;
|
| or does scraper want to mirror/impersonate your site.
|
| the last option is the most concerning because then;
|
| scraper is mirroring bcz your site is cool and local UI/UX
| is wanted;
|
| or is scraper phishing smishing or otherwise duping your
| users.
| loceng wrote:
| Yeah, good points to consider. I think the sites that
| would be scrapped the most would be where the data is
| regularly and reliably up-to-date, and a large volume of
| it at that - so not just one scraper but many different
| parties may on a daily or weekly basis try to scrap every
| page.
|
| I feel that ruling should have the caveat that if a fair
| cost paid API version for getting publicly listed data
| then the scrapers must legally use that (say no more than
| 5% more than cost of CPU/bandwidth/etc of the scraping
| behaviour); ideally a rule too that at minimum there be a
| delay if they are republishing that data without your
| permission, so at least you as the platform/source/reason
| for the data being up-to-date aren't harmed too - which
| may then kill the source platform over time if regular
| visitors somehow start going to the competitor publishing
| the data.
| patmorgan23 wrote:
| Absolutely you just have to check the session cookie
| minusf wrote:
| nginx can be set up to do that using the session cookie.
| CWuestefeld wrote:
| Rate limiting isn't an effective defense for us.
|
| First, as a B2B site, many of our users from a given customer
| (and with huge customers, that can be many) are coming
| through the same proxy server, effectively presenting to us
| as the same IP,
|
| Second, the bots years back became much more sophisticated
| than a single, or even relatively finite, IP. Today they work
| AWS, Azure, GCP, and other cloud services. So the IPs that
| they're assigned today will be different tomorrow. Worse, the
| IPs that they're assigned today may well be used by a real
| customer tomorrow.
| gregsadetsky wrote:
| Have you tried including the recaptcha v3 library and
| looking at the distribution of scores? --
| https://developers.google.com/recaptcha/docs/v3 --
| "reCAPTCHA v3 returns a score for each request without user
| friction"
|
| It obviously depends on how motivated the scrapers are
| (i.e. whether their headless browsers are actually
| headless, and/or doing everything they can to not appear
| headless, whether Google has caught on to their latest
| tricks etc. etc.) but it would at least be interesting to
| look at the score distribution and then see whether you can
| cut off or slow down the < 0.3 scoring requests (or
| redirect them to your API docs)
| 9dev wrote:
| It sounds great, until you have Chinese customers. That's
| when you'll figure out Recaptcha just doesn't really work
| in China, and have to begrudgingly ditch it altogether...
| kevincox wrote:
| If your users are logged in you can rate limit by user
| instead of by IP. This mostly solves this problem.
| Generally what I do is for logged in users I rate limit by
| user, then for not-logged-in users I rate limit
| aggressively by IP. If they hit the limit the message lets
| them know that they can get around it by logging in. Of
| course this depends on user accounts having some sort of
| cost to create. I've never actually implemented it but
| considered having only users who have made at least one
| purchase bypass the IP limit or otherwise get a bigger rate
| limit.
| forgotmypw17 wrote:
| Yes, I think working to accommodate the non-humans along with
| the humans is the right approach here.
|
| Scrapers have a limited range of IPs, so rate-limiting them and
| stalling (or dropping) request responses is one way to deal
| with the DoS scenario.
|
| For my sites, I have placed the majority behind HTTP Basic
| Auth...
| KptMarchewa wrote:
| You realistically can't. There are services like [0][1] that
| mean any IP could be a scraper node.
|
| [0] https://brightdata.com/proxy-types/residential-proxies
| [1] https://oxylabs.io/products/residential-proxy-pool
| orlp wrote:
| > How does Bright Data acquire its residential IPs?
|
| > Bright Data has built a unique consumer IP model by which
| all involved parties are fairly compensated for their
| voluntary participation. App owners install a unique
| Software Development Kit (SDK) to their applications and
| receive monthly remuneration based on the number of users
| who opt-in. App users can voluntarily opt-in and are
| compensated through an ad-free user experience or enjoy an
| upgraded version of the app they are using for free. These
| consumers or 'peers' serve as the basis of our network and
| can opt-out at any time. This model has brought into
| existence an unrivaled, first of its kind, ethically sound,
| and compliant network of real consumers.
|
| I don't know how they can say with a straight face that
| this is 'ethically sound'. They have, essentially, created
| a botnet, but apparently because it's "AdTech" and the user
| "opts-in" (read: they click on random buttons until they
| hit one that makes the banner/ad go away) it's suddenly not
| malware.
| TedDoesntTalk wrote:
| NordVPN (Tesonet) has another business doing the same
| thing. They sell the IP addresses/bandwidth of their
| NordVPN customers to anyone who needs bulk mobile or
| residential IP addresses. That's right, installing their
| VPN software adds your IP address to a pool that NordVPN
| then resells. Xfinity/Comcast sort of pioneered this with
| their wifi routers that automatically expose an isolated
| wifi network called 'xfinity' (IIRC) whether you agree or
| not.
| rascul wrote:
| > They sell the IP addresses/bandwidth of their NordVPN
| customers to anyone who needs bulk mobile or residential
| IP addresses
|
| I would be interested in a reference for this if you have
| one.
| duskwuff wrote:
| The Comcast access points do, at least, have the saving
| grace that they're on a separate network segment from the
| customer's hardware, and don't share an IP address or
| bandwidth/traffic limit with the customer.
|
| Tesonet and other similar services (e.g. Luminati) don't
| have that. As far as anyone -- including web services,
| the ISP, or law enforcement -- are concerned, their
| traffic is the subscriber's traffic.
| lazyjeff wrote:
| Now I wonder whether "retrieving" your own OAuth token from an
| app to make REST calls that extract your own data from cloud
| services is legal. It seems to fall under the same guideline,
| that exceeding authorization is not unauthorized access, so even
| though it's usually against the terms of service it doesn't
| violate CFAA?
| MWil wrote:
| Next up, we just need all public court records to be freely
| available to BE scraped and not $3 per page
|
| https://patentlyo.com/patent/2020/08/court-pacer-should.html
| KennyBlanken wrote:
| Really the problem is that PACER has been turned into a cash
| cow for the federal court system, with fees and profits growing
| despite costs being virtually nill.
|
| But yeah, the irony of the federal court system legalizing
| screen scraping, something PACER contractually prohibits.
| jakelazaroff wrote:
| If I have to stake out a binary position here, I'm pro scraping.
| But I really wish we could find a way to be more nuanced here.
| The scraper in question is looking at public LinkedIn profiles so
| that it can snitch to employers about which employees might be
| looking for new jobs. That's not at all the same as archival;
| it's using my data to harm me.
| xboxnolifes wrote:
| It's a public page. Your employer could just as well check your
| page theirself. It may be a tragedy of efficiency, but it's not
| like the scraper is grabbing hidden data. The issue is in
| something else. Maybe it's the fact that your current employer
| would punish you for looking for a new job. Or maybe LinkedIn's
| public "looking for job" status is not sustainable in it's
| current form.
| notch656a wrote:
| Weev was charged, and eventually convicted based merely on
| scraping from AT&T [0]. When the charge was vacated, it was
| only on venue/jurisdiction, not on the basis of the scraping
| being legal. Seems there's precedent merely scraping this
| information is felonious behavior.
|
| https://www.eff.org/deeplinks/2013/07/weevs-case-flawed-
| begi...
| jakelazaroff wrote:
| Yes, and if I have a public Twitter account it's perfectly
| possible for someone to flood me with spam messages. That
| doesn't mean we should do nothing to prevent it. As I said
| elsewhere, we should strive to make it possible for people to
| exist in public digital spaces without worrying about bad
| actors.
| xboxnolifes wrote:
| Someone can manually spam you, and I don't think that
| should be allowed. That is a separate topic and discussion.
| Unless you are arguing that your employer should not be
| allowed to check your LinkedIn status.
| jakelazaroff wrote:
| I'm just using it as an example of a case in which a
| public profile doesn't automatically mean anything goes.
| I had hoped to generate discussion about how to throw out
| some of the bathwater without throwing out the baby too,
| but I guess no one is really interested.
| brians wrote:
| Yes, but it's specifically using data you published to harm
| you. Compare Blind, which is engineered to not be attributable
| in this way.
| jakelazaroff wrote:
| I understand that I published it. That doesn't mean I should
| accept that hostile parties will use it against me.
|
| This is kinda like telling someone who is being harassed on
| social media that they're consenting to it by having a public
| account. We should strive to make our digital personae safe
| from bad actors, not throw our hands up and say "if you put
| yourself out there, you have no recourse".
| TedDoesntTalk wrote:
| This is great news. A win for the Internet Archive and other
| archivists.
| ghaff wrote:
| IANAL but it's not immediately obvious to me that this ruling
| covers bulk scraping and _republishing_ untransformed. I 'm
| genuinely curious about this personally. I presumably can't
| just grab anything I feel like off the web, curate it, and sell
| it.
| 1vuio0pswjnm7 wrote:
| "On LinkedIn, our members trust us with their information, which
| is why we prohibit unauthorized scraping on our platform."
|
| This is an unpersuasive argument because it ignores all the
| computer users who are not "members". Whether or not "members"
| trust LinkedIn should have no bearning on whether other computer
| users who may or may not be "members" can retrieve others' public
| information.
|
| Even more, this statement does not decry so-called scraping only
| "unauthorised" scraping. Who provides "authorisation". Surely not
| the LinkedIn members.
|
| It is presumptuous if not ridiculous for "tech" companies to
| claim computer users "trust" them. Most of these companies
| recieve no feedback from the majority of their "members". Tech
| companies generally have no "customer service" for the members
| they target with data collection.
|
| Further, there is an absence of meaningful choice. It is like
| saying people "trust" credit bureaus with their information.
| History shows these data collection intermediaries could not be
| trusted and that is why Americans have the Fair Credit Reporting
| Act.
| MWil wrote:
| Great point about non-members/members
| ketzu wrote:
| I am not versed in law, especially not in US law, but this case
| seems to be very specific that scraping is no violation of the
| CFAA. I do support this interpretation.
|
| However, the case of scraping I personally find more problematic
| is the use of personal data I provide to one side, then used by
| scrapers without my knowledge or permission. I truly wonder which
| way we are better off on that issue as a society. Independent of
| the current law, should anything that is accessible be
| essentially free-for-all or should there be limitations on what
| you are allowed to do. Cases highlighted in the article: Facial
| recognition by third parties on social media profiles, facebook
| scraping for personal data, search engines, journalists or
| archives. (Not all need to have the same answer to the question
| "do we want this") Besides that, the point I care slightly less
| about is the idea that allowing scaping with very leisure limits
| leads to even more closed up systems.
| snarf21 wrote:
| serious question (ianal): If I write down some information, at
| what point does that information have copyright protection? Do
| I have to claim it with c?
| henryfjordan wrote:
| Never. "Mere listings" of data (like the phone book) are not
| copyrightable.
|
| But also anything you write which is copyrightable is
| copyright immediately. You can register the work w/ the
| copyright office for some extra perks but it's not strictly
| necessary.
| butlerm wrote:
| You might want to check out Feist v. Rural Telephone Company
| (1991), and also look up the Berne Convention (on copyright),
| which the U.S. joined in the 1970s.
|
| If by "information" you mean mere facts without creativity in
| selection or arrangement, those are generally not protectable
| by copyright in the United States, although possibly in some
| other countries. Copyright generally protects works of
| authorship, and nothing else. No creativity no copyright.
| henryfjordan wrote:
| "I gave my data to Linkedin and now scrapers are reading it
| from the public web". Be mad at Linkedin before you are mad at
| the scraper.
| ViViDboarder wrote:
| I may edit and delete my information from LinkedIn, but I
| have no idea who has persisted that data beyond there.
|
| There is such a thing as scraping responsibility and
| irresponsibly. Both kinds happen.
| ketzu wrote:
| This seems to have various angles to it.
|
| First, one question is if the intent of the original owner of
| the data important? When I put data on linkedin (or facebook,
| my private website, hackenews or my employers website) I
| might have an opinion on who gets to do what with my data
| (see also GDPR discussions). Should I blame linkedin (or
| meta/myself/my employer) to do what I expected them to do, or
| should I blame those that do what I don't want them to do?
| Should I just be blamed directly because I even want to make
| a distinction between those? If I didn't want my data I could
| just not provide it (or participate in/surf the web at all if
| we extend the idea to more general data collection).
|
| Secondly, it touches on the idea that linkedin should not
| make the data publicly available (i.e., without
| authentication), and we end up with a less open system. Is
| that better? Is it what we want? Maybe there are also other
| ways that I am not aware just now. (Competing purely on value
| added is probably futile for data aggregators.)
| henryfjordan wrote:
| Your intent as the original owner of the data is important!
| You have to explicitly give Linkedin the right to display
| your data. It's in their Terms of Service. If Linkedin does
| something with your data that is outside the ToS, then that
| is on them, but if they do something within the ToS that
| you don't like then maybe you should not have provided them
| with your data.
|
| As for whether the data should be public, that's a decision
| we each have to make.
| ct0 wrote:
| Consider that scrapers may be far less interested in you as the
| individual than they are regarding your input into the
| aggregated data points of you and those like you.
| altdataseller wrote:
| The scraper example in this case HiQ had a product that
| tracked employee profile changes to predict employee churn.
|
| So they were specifically interested in you, personally not
| the aggregate
| xboxnolifes wrote:
| That's the same argument for all major internet tracking
| cookies. I don't think that's going to convince this site's
| userbase.
| nomoreusernames wrote:
| so i dont have the right to not be scraped? thats like sending
| radiowaves and making me pay a license for a radio i dont use.
| same with spam. i should give you a token to send me emails i
| maybe want to look at your stuff.
| notch656a wrote:
| An interesting departure, considering weev was convicted on
| merely scraping [0] AT&T. Although his charge was vacated, it was
| on the venue/jurisdiction, not that scraping was found to be
| legal.
|
| [0] https://www.eff.org/deeplinks/2013/07/weevs-case-flawed-
| begi...
| whatever1 wrote:
| Is it legal though to login to a website and then scrape data
| (that would not be accessible if I was just browsing as a guest)?
___________________________________________________________________
(page generated 2022-04-18 23:00 UTC)