[HN Gopher] Scrape like the big boys
___________________________________________________________________
Scrape like the big boys
Author : incolumitas
Score : 286 points
Date : 2021-11-05 09:22 UTC (13 hours ago)
(HTM) web link (incolumitas.com)
(TXT) w3m dump (incolumitas.com)
| InvOfSmallC wrote:
| Where I was working we stopped caring about ips browser etc
| because it was just a race. What we did was analyzing behaviour
| of clicks and acted on that. When we recognized it we went on
| serving a fake page. It cuts down a little bit of costs because
| it was static pages. In general it took a lot of time for them to
| discover the pattern and it was way more manageable for us.
| devops000 wrote:
| Could you share your code for AWS lambda and puppetter? It's
| definitely interesting for other websites
| incolumitas wrote:
| Sure.
|
| https://github.com/NikolaiT/Crawling-Infrastructure
|
| And here I am writing about it (but its quite old):
| https://incolumitas.com/2019/08/31/web-scraping-puppeteer-aw...
| joekrill wrote:
| A little pet-peeve I have is when an obscure(ish) acronym is used
| and never defined. Is SERP a well-known acronym? Perhaps this is
| a niche blog and I'm not the intended audience.
| tptacek wrote:
| Yes; a SERP is a Google search result page. It's the most
| important acronym in SEO.
| nomdep wrote:
| I don't remember never ever hearing it and I've been in the
| industry for some time
| Kiro wrote:
| You can't be serious.
| hollerith wrote:
| Huh. I've never been in the industry, but noticed "SERP" at
| least 15, maybe 20, years ago and have remembered it since.
|
| (If I were writing something to be published, though, I
| would write "search-engine results page" instead of
| "SERP".)
| weird-eye-issue wrote:
| You've been in the SEO industry for some time and never
| heard SERP?
| daveguy wrote:
| I had to look it up.
|
| SERP: Search Engine Results Page
| bgroat wrote:
| Not the OP, but I thought it was well known.
|
| That said, I do a lot of SEO work.
|
| Still, it should be best practice to define any acronym or
| initialism the first time you use it
| fergie wrote:
| Unintroduced acronyms should always be avoided.
| nsotelo wrote:
| As English speakers we often take for granted acronyms such
| as DB or even USA. For foreigners these can also be
| inscrutable.
| praptak wrote:
| Depends on the audience-acronym pair. I don't think HTTP
| needs an introduction in a technical article, OTOH (on the
| other hand ;) ) a general newspaper should probably expand
| HTTP but not WWW.
| joncp wrote:
| An all-too-common occurrence in HN comments as well.
| marginalia_nu wrote:
| The word SERP feels like a bit of a shibboleth for SEO-people.
| They seem to take it for granted, the rest of the world just
| looks puzzled when they hear it.
| hall0ween wrote:
| Basic question, how does one profit from scraping data and what
| kinda data?
|
| Taking a stab at answering it: you scrape the data and build a
| business around selling it. Stock prices? But that's boring, plus
| how many others are doing it? I bet a lot.
| 323 wrote:
| These are scraping artificially limited releases clothes/shoes.
| You buy a shoe at $100 and immediately sell it at $1000.
|
| Artificial scarcity - every week you release a "limited edition
| item", but if you do the math, it's not limited edition at all
| if you integrate over a year.
| throw1234651234 wrote:
| 1. Be job site. 2. Have employees that cost money call
| facilities and get job listings. 3. Establishing relationships
| with facilities to list jobs. 4. Buy job listings from 3rd
| parties. 5. List them for free hoping to make margin. 6.
| Scraper steals all jobs, lags site, and gets value of hard work
| for free.
| hall0ween wrote:
| ahh thanks
| IceWreck wrote:
| The author says proxys are expensive and then proceeds to spend a
| shitton of money buying all that hardware.
| incolumitas wrote:
| 4G proxies are just soo much better than so called
| "residential" or straight datacenter proxies. It makes sense to
| create your own 4G proxy farm if you conduct business in that
| area.
|
| With only 10 dongles and 10 dataplans, you can have a lot of IP
| addresses that are extremely hard to block. It's an one time
| investment, paying proxy providers is a fixed cost.
| bsder wrote:
| Where do you get 4G dongles that don't suck nowadays?
|
| We tried to get some, but all of the ones we could get were
| various levels of broken or unsupported.
| palijer wrote:
| That was not the authors main argument against proxies, that
| was just an additional point. You ignored the primary argument
| in your judgment.
|
| >>Because I could not fully trust the other customers with whom
| I shared the proxy bandwidth. What if I share proxy servers
| with criminals that do more malicious stuff than the somewhat
| innocent SERP scraping?
| RandomThrow321 wrote:
| Can they not call out a secondary point?
| ebbp wrote:
| Having spent a week battling a particularly inconsiderate
| scraping attempt, I'm quite unsurprised by the juvenile tone and
| fairly glib approach to the ethics of bots/scraping presented by
| the piece.
|
| For the site I work for, about 20-30% of our monthly hosting
| costs go towards servicing bot/scraping traffic. We've generally
| priced this into the cost of doing business, as we've prioritised
| making our site as freely accessible as possible.
|
| But after this week, where some amateur did real damage to us
| with a ham-fisted attempt to scrape too much too quickly, we're
| forced to degrade the experience for ALL users by introducing
| captchas and other techniques we'd really rather not.
| paco3346 wrote:
| I'm right there with you. I'm the lead engineer for an
| automotive SaaS provider (with ~6000 customers and ~4 billion
| requests per month) and we recently started moving all our
| services to Cloudflare's WAF to take advantage of their bot
| protection. We were getting scrapes from botnets in the 100000+
| per minute range that was affecting performance.
|
| We chose to switch to the JS challenge screen as it requires no
| human interaction. We now block 75% (estimated to the best of
| our knowledge) of bot traffic but some customers are livid over
| the challenge screen.
| [deleted]
| EdwardDiego wrote:
| What were they scraping, if I can ask? Was it targeted or
| just wget -r style?
| paco3346 wrote:
| It was a hybrid of low-effort vulnerability scanning and
| targeted inventory scraping. Many dealerships in the
| automotive space will pay gray-hat third parties to scrape
| and compile data on their competitors.
|
| The irony for us as a provider is that it's one of our
| customers (party A) paying a third party to scrape data
| from another one of our customers (party B) which in turn
| affects the performance of party A's site. We've started
| blocking these third parties and directing them to paid
| APIs that we offer.
| RobSm wrote:
| And how do you get your 'inventory data'? Aren't you
| scraping (or using scraped data) yourself? Oh the irony
| :)
| paco3346 wrote:
| No, we're a contracted provider for these customers. They
| ingest their data into our network through APIs or CSVs.
| Andoryuuta wrote:
| I'm really surprised that the JS challenges helped so much,
| given that there are open source libraries for bypassing them
| (e.g. cloudscraper[0]).
|
| [0]: https://github.com/venomous/cloudscraper
| paco3346 wrote:
| If someone wanted to get past it they probably could. We've
| had a few sources of traffic that we've had to straight up
| block (as opposed to challenge) because of this exact
| issue. So far it's been a "good enough" solution that
| blocks enough of the bot traffic to be effective.
| RobSm wrote:
| Why do you think those bots were scraping your data in the
| first place?
| devwastaken wrote:
| If an amateur can do that to your service by scraping, imagine
| what someone can do if they actually intend to do you harm.
| With cloud pricing models someone could find a little
| misconfiguration or oversight and put you in the hole in
| operating costs. Anti-abuse is a necessary design when your
| service is exposed to the internet.
|
| Not saying that doesn't suck - it does, it's why many ideas
| don't work in practice as an online service.
| kulikalov wrote:
| Why not create api endpoint and charge mild cost for that data?
| You'll make money instead of spending it.
| scarygliders wrote:
| Do you honestly believe all site scraper people/companies are
| ethical enough to go to whoever pays /them/ to scrape data
| from a competitor's site and say "oh they offer an API to
| access this data let's pay for that", instead of "why pay for
| that data when we can scrape it right off their site"?
|
| Also, not all types of company will provide API endpoints. It
| all depends on the type of site - for example, an online shop
| might not wish to provide easily accessible data on offered
| products and prices, to their competitors who may wish to
| undercut them. Why would an online shop do that?
| jadell wrote:
| I run a large scraper farm against several large sites.
| They're not online shops, and we don't compete with them.
| But they do have hundreds of thousands of data points that
| we use to provide reports and analytics for our clients,
| who also do not compete with the sites.
|
| I absolutely would pay for an API that provides that data.
| I'd be willing to pay 10x more than the cost of maintaining
| and running the scrapers.
|
| But the sites being scraped have no interest in that.
| texasbigdata wrote:
| Building and maintaining the scraper is the not cost they
| would use to measure it internally. It's the cost to
| build the API, and support it and perhaps any perverse
| incentive it creates where even more data flows out to
| competitors.
| wolverine876 wrote:
| And the cost of being scraped.
| RobSm wrote:
| Building API is 5 times easier than building routes for
| your public webpages, which is basically an 'API' as
| well.
| CWuestefeld wrote:
| Have you tried approaching those sites and asking them to
| provide an API, pointing out that it would be easier for
| both of you in the long run? Or are you just assuming
| they wouldn't do it.
|
| Because right now, I sure wish that the bots - which
| comprise probably 2/3 of my traffic - are causing me huge
| headaches and I wish that the people doing it would tell
| me what the heck they want.
| zivkovicp wrote:
| Well, you don't need an api, just a CSV file with a
| catalog.
|
| The scraping company WILL use the API/CSV file... they will
| probably also still charge their customer for scraping, so
| it's a win-win :D
|
| You can think of it this way, the prices and product data
| are publicly visible already on the website, there are no
| real secrets, none of it is password protected.
|
| You can be principled and insist on blocking bots and spend
| a lot of time and money on tools, people, and ultimately
| hosting because the bots will _always_ win; or you can
| offer the data for free /minimal fee and serve it with
| almost zero cost and cache it so you can do that with a
| micro sized server.
|
| You can always lie about some of the prices if you want,
| but you will just encourage bots again.
|
| Ethics are nice, but let's be honest, very lacking.
| Sometimes it's better to be pragmatic.
| scarygliders wrote:
| > You can think of it this way, the prices and product
| data are publicly visible already on the website, there
| are no real secrets, none of it is password protected.
|
| There's the problem right there. The prices and product
| data are publicy visible - because there is a target
| audience of /humans/ for whom the site is designed and
| intended to be used by. The site is not there to cater
| for a competitor's scrapers.
|
| I don't care how much people couch their unethical
| behaviour in "the data is publically available", the
| basic fact is most if not all websites exist for human
| eyeballs to look at them. They do not exist for arseholes
| to DOS them by inundating them with scrapers.
| zivkovicp wrote:
| I agree 100%, but it is a fact of life, and sometimes
| it's better to just minimize the fuzz and focus on the
| things that matter.
|
| Your argument is perfectly valid and applies to offline
| activities as well (what stops a competitor from walking
| through the aisles of a Walmart or Costco?), but this is
| a battle that can't be won, there are too many parasitic
| actors. It is human nature.
| mcdonje wrote:
| > (what stops a competitor from walking through the
| aisles of a Walmart or Costco?)
|
| That's a significant portion of Nielsen's business model.
| TeMPOraL wrote:
| > _the basic fact is most if not all websites exist for
| human eyeballs to look at them._
|
| There's a whole ethical subthread here of websites trying
| to making the experience for those humans miserable, and
| taking away the agency necessary to protect oneself from
| that. A browser is _a_ user agent. So is a screen reader.
| So is a script one writes to not deal with bullshit
| fluff, when all one wants is a simple table of products,
| features and prices.
| 0xdeadbeefbabe wrote:
| Let's not encourage these unethical people to even think
| of using human eyeballs and manual data entry for their
| scraping instead of bots. That sounds pretty darn
| unethical.
| zo1 wrote:
| From my perspective, the problem is that the data that is
| offered isn't really "for humans". The data is for
| _convincing_ the humans to buy /pay or worse, browse and
| watch ads as a result.
|
| But overall, information is one of those goods that has
| intrinsic properties like no other. It can be copied,
| infinitely. And we haven't yet figured out the dynamics
| of how to reason about it, so it feels like we're
| pretending they're physical goods.
|
| Edit. Side note. I'd go further and say that some of the
| data is even worse, it's "offered" with the real
| intention being to confuse the users into performing non-
| optimally in the market. Look at
| Amazon/Ebay/AliExpress/Google listings for evidence of
| that. Just Google - Google is a ML and scraping power
| house, and the best they can muster is to be spammed with
| fake websites and duplicate/confusing listings.
| TeMPOraL wrote:
| You hit the nail on the head. It's hard to have sympathy
| for site operators complaining about scraping, where
| almost every site does its best[0] to make using it a
| time consuming, potentially risky and overall annoying
| ordeal. Not to mention, information asymmetry is anathema
| to a well-functioning market, and yet no. 1 reason for
| fighting bots given in the whole thread here is a desire
| to maintain that information asymmetry.
|
| And that's also the dirty secret behind the "attention
| economy": it's whole point is to make things _as
| inefficient as possible_ , because if you're making money
| on people's attention, you need to first steal it (by
| distracting them from what they're trying to achieve),
| and then either direct towards your goals (vs. those of
| the users), or stretch it out to maximize their exposure
| to advertising.
|
| --
|
| [0] - Sometimes unintentionally. Unfortunately, the
| overall zeitgeist of UX design is heavily influenced by
| bad players, so default advice in the industry is often
| already intrinsically user-hostile.
| matheusmoreira wrote:
| > Why would an online shop do that?
|
| Because otherwise the HTML will become the API.
| marginalia_nu wrote:
| Bots are one of those things that are easy to build and hard to
| get right, and there's really no way of preparing for the
| chaotic reality of real web pages other than fixing the
| problems as they show up. Weird and unexpected interactions are
| going to happen. Crawling the real web involves navigating a
| fractal of unexpected, undocumented and non-standard corner
| cases. Nobody gets that right on the first try. Because of that
| I do think we need to be a bit patient with bots.
|
| At the same time, even as someone who runs a web crawler, I
| have zero qualms about blocking misbehaving bots.
| chillfox wrote:
| I kinda feel like rate limiting your request to individual
| domains and IP addresses is an easy thing that goes a long
| way towards getting it right.
| marginalia_nu wrote:
| There are still snags with that.
|
| Stuff like redirect resolution is very easy to overlook.
| You may think you're fetching 1 URL per second, but if you
| are using the wrong tool and you're on a server that has
| you bouncing around like in a pinball machine and takes you
| through a dozen redirects for every request, the reality
| may be closer to 10 requests per second.
|
| On top of that, sometimes the same server has multiple
| domains. Sometimes the same IP-address serves a large
| number of servers (maybe it's a CDN).
| RobSm wrote:
| If you build your site in a way that multiplies each
| request 10x, well then that's what you get. Don't do that
| and you won't have issue with requests. Or handle those
| requests properly. There are solutions to that. You know
| how many requests your local google CDN gets? They know
| how to manage load.
| marginalia_nu wrote:
| Most pages have at least a http->https redirect, many
| contain a lot of old links to http content.
|
| Usually it's error pages that really drive the large
| redirect chains. They often have a vibe of like some
| forgotten stopgap put in place to help with some
| migration to a version of the site that is no longer in
| existence.
|
| Of course you don't know it's an error page until you
| reach the end of the redirect chain.
| [deleted]
| krzyk wrote:
| As a programmer that just sometimes wants to check if given
| item is available in store I would like to be able to use API
| for that. But if it is not available one has to scrape.
| scarygliders wrote:
| Right with you there.
|
| I had a particularly bad time not so long ago, when a
| customer's site - a shop - was brought to its knees because
| someone, probably a competitor, hired some scraper-company of
| some sort to scrape every product and price.
|
| The scraper would systematically go through every single
| product page.
|
| And by scraper, I mean - 100's of them. All at the same time,
| using the old trick of 1 scraper requesting 3 or 4 product
| pages at a time then pausing for a while.
|
| They used umpteen different IP address blocks from all over the
| globe - but mainly using OVH vps IP address blocks from France.
|
| Now, maybe if they'd just thrown, say, 5 or 10 of the scraper
| "units" at the site, no one would have noticed in amongst
| Googlebot (which they wanted to use anyway because they are
| using Google Shopping to try to bring in more sales).
|
| But no. This shower of arseholes threw 100's of scraper "tasks"
| at the site. They got greedy.
|
| Now, the site was robust enough to handle this load - barely -
| which was massive, however, having to do that /and/ also handle
| normal day-to-day traffic? Nah. The bastards got greedy and
| like you I spent a few days unfucking the damage they were
| causing.
|
| Seriously, I hate scrapers. I hate the people who make
| scrapers. I hate their lack of ethics. Fuck those guys.
| thatwasunusual wrote:
| It sucks when this happens, but it's easily avoidable by
| using a caching frontend of some sort.
|
| My favorite is Varnish,[0] which I have used with great
| success for _many_ web sites throughout the years. Even a web
| site that 10+ millions of requests per day ran from a single
| web server for a long time a decade-ish ago.
|
| [0] https://varnish-cache.org/
| mdoms wrote:
| If your site is so poorly written it can't handle a few
| hundred computers trying to do something as simple as loading
| your product pages then sorry, but that's on you. The
| information is on the public web and scrapers are as entitled
| to access it as any web browser.
| [deleted]
| matheusmoreira wrote:
| > Seriously, I hate scrapers. I hate the people who make
| scrapers. I hate their lack of ethics. Fuck those guys.
|
| Not everybody in this space is out to destroy your site. Some
| of us actively try to put as little load on your site as
| possible. My scraper puts less load on sites than I do when I
| browse them normally, I've measured it. Really sucks when we
| get lumped together with the other abusers and blocked.
| ligerzer0 wrote:
| Exactly, some of us use scrapers because while we can't go
| full Richard Stallman, we also don't want to visually sift
| through ridiculous UI just to look at some basic data/text.
| _jal wrote:
| In a past life, we were consulting with a startup that
| offered a subscription data service. They were very sensitive
| about scrapers, especially on the time limited try-before-
| you-buy accounts, which competitors were abusing.
|
| At their request, we built a method to flag accounts for data
| poisoning. Once flagged, those accounts would start getting
| plausible-ish looking garbage data.
|
| It was pretty effective. One competitor went offline for a
| few days about a week after that started, and had a more
| limited offering when they came back up.
| scarygliders wrote:
| That's a good way of going about dealing with this kind of
| abuse indeed. Wish I'd thought of doing that at the time,
| but due to the nature of this shop you didn't need a user
| account to browse the products/prices.
|
| I'm now making an entirely new shop for them - I shall bear
| this in mind. Thanks for that!
| brightball wrote:
| Yea. Detect them and mess with them is the only approach
| that seems to work for a lot of abusive activity. Banning
| doesn't work because they will just start over from
| scratch. The only thing you can really do is make them
| think you haven't "caught" them yet and during that stretch
| make sure their time is wasted.
| funnyflamigo wrote:
| > Seriously, I hate scrapers. I hate the people who make
| scrapers. I hate their lack of ethics. Fuck those guys.
|
| Wait till you find out what half of Google's business is
| based on (spoiler - scraping).
|
| I really don't think scraping itself is an issue 90% of the
| time. It's the behavior of the out of control scrapers that
| are the problem. A well behaved scraper should barely be
| noticeable, if at all.
| RobSm wrote:
| Exactly. I am surprised that the 'devs' can't figure out a
| way to block only annoying/excessive scrapers. Most likely
| they are just lazy and then just put 3rd party 'solution'
| and job done. Pay me.
| jjeaff wrote:
| At least google's scraping does result in your website
| being discoverable by users. So you get something out of
| it. That's not to say that sometimes Google is missing or
| stealing data they scrape. But at least there is some
| benefit. Many other scrapers are merely taking the data to
| compete.
| funnyflamigo wrote:
| I strongly feel that if a human can get to it manually,
| we have to accept that either it will be botted or humans
| will be paid to do it by hand (They call these people
| "analysts" or "market researchers").
|
| I might argue that what google actually uses their
| scraped data for is their search engine - which is
| private. They simply allow us access to specially crafted
| queries, which they can and do manipulate (for many
| reasons, some good some bad).
|
| The only thing I'd say meets that definition would be
| like Common Crawl.
| [deleted]
| [deleted]
| jtdev wrote:
| Considering the demand for your content, why haven't you
| created and provided an API? Maybe you could monetize?
| chewmieser wrote:
| Like everyone and their brother has a web spider. And some of
| them are VERY badly designed. We block them when they use too
| many resources, although we'd rather just let them be.
|
| Can't speak for the op but we have APIs and move the ones
| scraping and reselling our content to APIs. The majority are
| just a worthless suck on resources though.
| ebbp wrote:
| We do offer an API - the scrapers are trying to circumvent
| using that, presumably.
| halfmatthalfcat wrote:
| Maybe the API terms/cost are prohibitive? I'm sure there's
| some equilibrium where they would rather pay you than go
| through the trouble of scraping.
| kulikalov wrote:
| Maybe docs or infra are unbearable
| purerandomness wrote:
| Why do you think are they trying to circumvent it?
|
| Does your API provide all the information that can be found
| on the site, or are they scraping because the API is
| incomplete?
|
| We've once had to scrape Amazon product pages because they
| have a lot of API endpoints, but those didn't contain the
| data we needed.
| scarygliders wrote:
| Why would Amazon wish to provide you with easy to access
| data on their products and prices when you could either
| be a competitor wishing to undercut those prices, or be a
| scraper company hired by such a competitor?
|
| In what universe is providing such a straightforward way
| of helping a competitor considered sane business
| practice?
| matheusmoreira wrote:
| Because they will get the data regardless of what you do
| and if you don't make an API it will cost you more due to
| overhead.
| jtdev wrote:
| In the end, they still get the data, just in a much less
| desirable way for both you and the customer.
| manquer wrote:
| Most sellers who are on Amazon platform give Amazon that
| information and a lot more, knowing full well Amazon will
| use their sales data to launch an Amazon Basics
| competitior.
|
| It is a sane business approach when you are a pragmatic
| business who knows the limits that constrain your
| business.
|
| Either the content company is going to build a simple API
| (could be just a static CSV file hosted on S3 or
| whatever) with useful information or try to monetize/hide
| this information and force scapers to use the website .
|
| A bot is always going to win unless you want to make
| users also a lot of friction. In the era of deepfakes and
| fairly robust AI tooling the difference between bot
| action and humann action is not all that much.
|
| If you are going to be agressive with captcha , IP blocks
| and other fingerprinting, users who get identified false
| positive.or annpyed would leave.
|
| When the cost of losing those users is more than allowing
| access to scrapers,you would absolutely setup the API.
| weird-eye-issue wrote:
| Man your comment is hilarious because in fact Amazon DOES
| provide an API for exactly that
| scarygliders wrote:
| And yet...
|
| > We've once had to scrape Amazon product pages because
| they have a lot of API endpoints, but those didn't
| contain the data we needed.
|
| ...only a couple of comments up.
| matheusmoreira wrote:
| This is the number one reason to scrape websites. It's
| always nice when there's an API with documentation and
| rate limiting rules you can follow. Sometimes the data I
| need just isn't there, though. Then I open up their site
| and find a huge amount of private API endpoints that do
| exactly what I want. Then I open up a ticket about it and
| it gets 200 replies but they ignore it for _years_. It 's
| fucking stupid and it's really no wonder people scrape
| their site.
| 1cvmask wrote:
| What is your site may I ask?
|
| Just curious about the difference in value from using your
| API and web scraping as there is a cost to web scraping as
| well.
| bryanrasmussen wrote:
| If you make your scraper well, and it counterfeits being
| a real user believably, you end up with a solution that
| can be tweaked as needed to handle whatever traps people
| put in to try to defeat your scrapers.
|
| If you make your api client well, you don't have the
| problems of a scraper - but if the api owner decides to
| change rules for api and you can't do what your business
| is based on being able to do (think of api owner as
| Twitter) then you need to make a scraper.
| gmanis wrote:
| Is it not viable to put majority of your data behind a
| login and so the bots only get a very limited snapshot
| while legitimate users get it through a free login?
|
| I'm asking this because I'm going through very similar
| situation and would love to see other opinions around this.
| weird-eye-issue wrote:
| You are defining legitimate users as those that have a
| valid session cookie? Good luck
| aninteger wrote:
| Wait, why wouldn't you have rate limiting on your API?
| Providers like Cloudflare offer this although I guess you
| could roll your own too since our industry loves to
| reinvent the wheel.
| throwaway2993 wrote:
| I wrote a scraper a couple of years ago to get a single data
| point from a website where my client was already a paying
| customer. This website had an API, which they were also
| paying for, but the API didn't cover that data point, so at
| the time they had one of their admin people populating that
| missing piece of data manually, which was taking them around
| ten minutes a day.
|
| I asked them if my customer could pay to access this data
| point via their API and they quoted 3600 EUR/month! Enter the
| scraper...
| [deleted]
| taytus wrote:
| >where some amateur did real damage to us
|
| If an amateur can do damage to you, then I have some bad news
| for you...
| Goronmon wrote:
| _If an amateur can do damage to you, then I have some bad
| news for you..._
|
| I believe the point wasn't surprise that damage occurred at
| all, but frustration that damage can occur just out
| laziness/ignorance rather than malice.
| scarygliders wrote:
| Indeed, that was precisely their point, and "bad news for
| you" is disingenuous as there are many techniques used by
| incompetent, or just downright unethical and greedy scraper
| companies which, no matter how robust the target is, can
| still give it a major headache.
|
| I've witnessed a site being basically DOS'ed due to
| particularly greedy and aggressive mass scraping attempts.
| convolutionart wrote:
| This is nonsense. It's always easier to destroy than to
| build/mantain. If you got any real advice, by all means...
| biosed wrote:
| I used to lead Sys Eng for a FTSE 100 company. Our data was
| valuable but only for a short amount of time. We were constantly
| scraped which cost us in hosting etc. We even seen competitors
| use our figures (good ones used it to offset their prices, bad
| ones just used it straight). As the article suggest, we couldn't
| block mobile operator IPs, some had over 100k customers behind
| them. Forcing the users to login did little as the scrapers just
| created accounts. We had a few approaches that minimised the
| scraping:
|
| Rate Limiting by login,
|
| Limiting data to know workflows ...
|
| But our most fruitful effort was when we removed limits and
| started giving "bad" data. By bad I mean alter the price up or
| down by a small percentage. This hit them in the pocket but
| again, wasn't a golden bullet. If the customer made a transaction
| on the altered figure we we informed them and took it at the
| correct price.
|
| It's a cool problem to tackle but it is just an arms race.
| wolverine876 wrote:
| > But our most fruitful effort was when we removed limits and
| started giving "bad" data. By bad I mean alter the price up or
| down by a small percentage. ... If the customer made a
| transaction on the altered figure we we informed them and took
| it at the correct price.
|
| Is that legal? It would be a big blow to trust if I was the
| customer, but that's without knowing what you were selling and
| in what market.
| killingtime74 wrote:
| It's legal if it's in the contract. Standard for contracts to
| allow for mistakes and confirmations of prices
| kwhitefoot wrote:
| It's not mistake if you do it deliberately!
| killingtime74 wrote:
| Yes (not saying it's a mistake) but putting confirmation
| can be in the contract, no law says you only get 1 chance
| to display price.
| rootusrootus wrote:
| I know a guy at Nike that had to deal with a similar problem.
| As I recall, they basically gave in -- instead of trying to
| fight the scrapers, they built them an API so they'd quit
| trashing the performance of the retail site with all the
| scraping.
| chadwittman wrote:
| The real Jedi move
| wrycoder wrote:
| Especially if you charge for it, which would save them
| money, because they wouldn't have to redo their code every
| time you changed your website.
| gonzo41 wrote:
| I think there's an opportunity for a new JS framework to have
| something like randomly generated dom that will always
| display the page and elements the same to a human but
| constantly break paths for computers.
|
| Like displaying a table with semantic elements, then divs,
| then using an iframe with css grid and floating values over
| the top.
|
| This almost seems like a problem for AI to solve.
| matheusmoreira wrote:
| Yes. That's exactly what everyone should do.
| echelon wrote:
| If data is your competitive advantage or product, then
| what? Accept that your market no longer exists and that
| there's no way to stop theft?
| Grimm1 wrote:
| You're going to need to explain how scraping publicly
| available information on a website is theft.
|
| If information is your competitive advantage maybe you
| shouldn't have it on a publicly accessible website, and
| should instead stick it behind an API with pay tiers and
| a very clear license regarding what you may do with it as
| an end user.
|
| Note, a simple sign up being required to view a website
| makes it not publicly available information any longer
| and you can cover usage, again, in a license.
|
| Then you have a whole bunch of legal avenues you can use
| to protect your work. Assuming you can afford it that is.
| achillesheels wrote:
| It is copyright information, no? So technically it is
| intellectual property theft if the scraping use is for
| commercial purposes.
| Grimm1 wrote:
| No? If you place information publicly on a website it's
| pretty much free game, no copyright violation, especially
| regarding user generated information. That's my take, but
| legally it's a gray area and it's still going back and
| forth in the courts (at least in the US) but for a while
| before a decision was vacated by the supreme court
| scraping publicly available information on a site was
| legally protected and seemingly inline with my thoughts
| on it.
| ransom1538 wrote:
| I love the honey pot approach. Put tons of valued hrefs on the
| page that are invisible (css) that the scrapper would find.
| Then just rate limit that ip address and randomize the data
| coming back. Profit.
| endymi0n wrote:
| > It's a cool problem to tackle but it is just an arms race.
|
| Plus, it's one you're going to lose. I was once asked at an
| All-Hands why we don't defend ourselves against bots even more
| vigorously.
|
| My answer was: "Because I don't know how to build a publically
| available website that I could not scrape myself if I really
| wanted to."
| DeathArrow wrote:
| You can put some wasm crypto mining code and at least profit from
| bots. :D
| abc03 wrote:
| I scrap government sites a lot as they don't provide apis. For
| mobile proxies, I use the proxidize dongles and mobinet.io (free,
| with Android devices). As stated in the article, with cgNAT it's
| basically impossible to block them as in my case, half the
| country couldn't access the sites anymore (if you place them in
| several locations and use one carrier each there).
| kerokerokero wrote:
| Thanks for the share. Great stuff.
|
| I used to scrape websites to generate content for higher SERPs.
|
| Ended up going into the adult industry lols.
| (https://javfilms.net)
| anon9001 wrote:
| Neat! I've run across your site organically :P
|
| I've always wondered, and since you're right here... how do
| sites like this make money?
|
| It looks like you're probably crawling all the JAV vendors,
| finding free clips of today's releases, embedding them in your
| own site to draw traffic, and making money with affiliate links
| to buy the full content?
|
| Am I missing anything? It seems hard to believe you'd get
| enough affiliate signups to make it worthwhile.
|
| I can imagine your site as being a few hours a year of script
| maintenance and a money printer, or a 40hr/week SEO job with
| 1000s of similar sites across the adult industry.
|
| I'd love to know anything you're willing to share about how the
| business works.
| wilg wrote:
| Not the same kind of scraping, but does anyone have
| thoughts/resources/best practices for doing link previews (like
| Twitter/iMessage/Facebook)?
| mrg3_2013 wrote:
| wow! That was an interesting read.
| neals wrote:
| In a particularly hard to scrape website, using some kind of bot
| protection that I just couldn't reliably get working (if anybody
| wants to know what that was exactly, I'll go and check it) I now
| have a small Intel NUC running with firefox that listens to a
| local server and uses Temper Monkey to perform commands. Works
| like a charm and I can actualy see what it's doing and where it's
| going wrong. (though it's not scalable, of course)
|
| We use it for data-entry on a government website. A human would
| average around 10 minutes of clicking and typing, where the bot
| takes maybe 10 seconds. Last year we did 12000 entries. Good bot.
| nkozyra wrote:
| You can use chromium/chrome/cdp and turn headless off and see
| the same thing.
| funnyflamigo wrote:
| I'm curious what bot protection it was? It couldn't have been
| trying too hard unless you were employing multiple anti-
| fingerprinting techniques, I'm assuming you used firefox's
| built in anti-fingerprinting?
___________________________________________________________________
(page generated 2021-11-05 23:00 UTC)