[HN Gopher] We accidentally burned through 200GB of proxy bandwi...
___________________________________________________________________
We accidentally burned through 200GB of proxy bandwidth in 6 hours
Author : suchintan
Score : 49 points
Date : 2024-09-19 16:15 UTC (6 hours ago)
(HTM) web link (blog.skyvern.com)
(TXT) w3m dump (blog.skyvern.com)
| 8organicbits wrote:
| What infrastructure is this using? Bandwidth seems pretty pricy
| tobyjsullivan wrote:
| No kidding. AWS's notoriously expensive data transfer is only
| $0.09/GB. Who's charging $2.50/GB? Are they running on a
| cellular SIM with no data plan?
| mikeocool wrote:
| Sounds like they are running a web scarping business -- so
| maybe? Using a cellular connection would be one way to help
| not get immediately capcha-ed by every site using cloudflare.
| blitzar wrote:
| They should really setup their scraper and (exfil the data)
| via regular connections.
| ronsor wrote:
| Residential rotating proxy providers charge very high rates
| for data, on the order of $1 - $10 per GB. (These providers
| often do run their proxies through the cellular network,
| actually.)
| SteveNuts wrote:
| Is this something where end users can get paid for doing
| nothing other than proxying some traffic through their ISP?
| ronsor wrote:
| That's probably where some of the proxies come from.
| r1ch wrote:
| The end user typically has their device compromised by
| using free apps where the developers were bribed $$$ to
| add the proxy "SDK". The botnet operator then rents out
| the bandwidth at exorbitant rates to anyone who will pay
| for it.
|
| Chrome extensions are also a huge source of this, they
| look for extensions with a large install base and then
| make an offer to buy it to turn all the users into
| proxies.
| slt2021 wrote:
| end users install shady VPN apps/extensions to watch
| pirated content, and become part of residential proxy
| mesh/botnet
| ipython wrote:
| If by "some" traffic you mean botnets, sneaker and ticket
| scalpers, scammers, content scrapers, credential stuffers
| ... generally scummy stuff, sure.
|
| Based on this blog post I would not do any business with
| Skyvern, if they indeed do business with this underworld
| of bottom feeders.
| mrguyorama wrote:
| Sure, if you want a whole bunch of legitimately malicious
| traffic to be attributed to your internet account.
| floam wrote:
| Yes. Google "honeygain"
| perks_12 wrote:
| 200GB for $500? What cloud is this?
| hooverd wrote:
| api.skyvern.com is a CNAME to an EC2 ALB, but even using a NAT
| Gateway ($$$) I can't make more than $1/GB add up.
| intunderflow wrote:
| Looks like webshare from the screenshot
| jsnell wrote:
| I don't think it's a cloud. It's more likely a residential
| proxy network, which are typically created by installing
| malware on users' machines.
|
| The operators of these proxy networks want to avoid detection
| by both the users whose bandwidth they're stealing, and by the
| companies whose data is being scraped. So they want to make the
| bandwidth very expensive. And that expensive bandwidth in turn
| means that their only clients are dodgy as well. Either people
| looking to scrape data without consent and monetize it, or
| outright criminals.
| bscphil wrote:
| It's kind of surprising that a presumptively legitimate
| company (and YC-funded startup) would out themselves as
| buying black market residential proxy bandwidth, isn't it?
| jsheard wrote:
| Their frontpage also advertises the ability to pass
| CAPTCHAs, whether by automation or more likely by
| delegating them to third-world CAPTCHA farms. If that's a
| major selling point for your automation service then your
| target market probably ranges from dubious (e.g. data
| scrapers trying to get around limits) to extremely dubious
| (e.g. ticket scalpers, spammers, click fraud, etc).
| xp84 wrote:
| Just because something can be used for sketchy purposes
| doesn't mean that's the only purpose of it. there are
| thousands of situations where people are forced to
| interact with a shitty website 100x per day and the site
| won't provide an api. Imagine if your job was booking
| plane tickets all day. United could provide you an API
| key to do so via an API, but in practice they won't, only
| some enterprisey travel software company can get that
| kind of access, for a steep fee. You could build a tool
| which automatically puts together an itinerary based on
| rules and books it, through a tool like this. Perhaps a
| slightly contrived example but I believe things like this
| definitely happen.
| dontlikeyoueith wrote:
| > United could provide you an API key to do so via an
| API, but in practice they won't, only some enterprisey
| travel software company can get that kind of access, for
| a steep fee. You could build a tool which automatically
| puts together an itinerary based on rules and books it,
| through a tool like this. Perhaps a slightly contrived
| example but I believe things like this definitely happen.
|
| And you think that's NOT sketchy?
|
| I'm almost afraid to ask where you think the bar is...
| stickfigure wrote:
| It's exactly as sketchy as having a hypothetical robot
| sit down at a console and type it out. Which, IMO, is not
| very sketchy at all.
| rty32 wrote:
| Imagine a legitimate travel agency cannot book 100 United
| tickets a day via methods outlined in business contracts
| and need to resort to shady practice.
|
| Dude, please provide some real solid evidence to back
| this up, and perhaps come up with another realistic
| scenario where bypassing captcha is justified.
| xp84 wrote:
| > Imagine a legitimate travel agency cannot book 100
| United tickets a day
|
| That's the whole point, I never said travel agency, I was
| thinking a company with travelling consultants.
|
| How TF is it "shady" to purchase and use airfare?
|
| And again, bypassing captcha, say, to purchase tickets
| isn't evil either, if you are purchasing them for use and
| not for resale. It would just allow a person to book
| tickets for 50 people without wasting 6 hours to complete
| 25 CAPTCHAS and type in my information 25 times.
|
| CAPTCHA is a blunt instrument deployed in an attempt to
| mitigate _abuse,_ but it has a massive bad side effect
| that for every heavy user (not just evil users), it
| requires a human butt to be in a seat somewhere to do
| mindless busywork that could otherwise be automated.
| Working around that (sounds like OP agrees to do so on a
| case by case basis) is not inherently evil. It 's as evil
| (or benign) as whatever you're using it for.
| suchintan wrote:
| Agreed. Just for reference, one of our most popular use-
| cases is automating data entry into CRMs without APIs...
| No one wants to be doing this stuff manually, and
| automating it has some serious positive QoL impact
|
| We get a lot of requests for bad usage (ie spinning up
| upvote rings on Reddit) but we don't want to support
| things like that
| mrguyorama wrote:
| How long have you been here? It's not surprising at all. HN
| and YC have not demonstrated an aversion to "uh, greyhat"
| activity.
|
| If it were 2000, people would be sharing their ad clicking
| startups.
|
| YC has funded a looooooot of sketchy companies.
| dewey wrote:
| Residential proxies are not necessarily "black market".
| asmor wrote:
| It's almost never done with the full understanding of the
| person providing the proxy, doesn't matter if they get
| promised some change, their browser addons betray them or
| they install bundleware/adware.
|
| I'd say it has about the same moral standing as a payday
| loan.
| dewey wrote:
| There's other ways for example through mislabeled
| "residential" blocks, or "residential" proxies that are
| sold by ISPs to vendors.
| iforgotpassword wrote:
| I use one. I run a bot on IRC that extracts the <title> of
| every link posted (or downloads the image/whatever and
| extracts Metadata) and announces that to the channel. It has
| become more and more pointless to run this on a vps.
| Google/YouTube block the IP range, a lot of websites return
| the cloudflare security check, Amazon works on some days and
| doesn't on others... Ever since I proxy via residential
| proxies it just works. I'm a smooth criminal. :>
| morkalork wrote:
| So much for the open internet.
| nolist_policy wrote:
| You can thank the spammers.
| derekzhouzhen wrote:
| I feel your pain, but I refuse to cave. Say, 10% of the
| links fail to load, so what? It is their loss, not mine.
| floam wrote:
| It's not necessarily malware. There are services that are
| pretty upfront and pay cash money for residential US
| bandwidth. That said, naive people might be surprised when
| their IP starts getting blocked.
|
| e.g. https://www.honeygain.com/ (something like 100GB = $20).
| peab wrote:
| how does expensive bandwidth equate to dodgy clients? There
| are lot's of valid use cases for scraping data, and it's
| legal to scrape publicly available data, even if the websites
| hosting it try to block it (try a curl request to reddit, for
| example)
| patmcc wrote:
| >>>and it's legal to scrape publicly available data, even
| if the websites hosting it try to block it
|
| Is that something that's been fully decided?
| https://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc.
| is the most relevant case I'm aware of, and it suggests it
| might actually be illegal (if you know you've been blocked,
| at least).
| suchintan wrote:
| https://techcrunch.com/2024/01/24/court-rules-in-favor-
| of-a-...
|
| This is another interesting example where it was allowed
| morkalork wrote:
| Aren't there also some suspiciously cheap VPNs that do that
| in the background?
| ThePowerOfFuet wrote:
| Yes
| hypeatei wrote:
| Yeah, the author confirmed it in this thread actually:
|
| https://news.ycombinator.com/item?id=41594713
| miohtama wrote:
| Here more on "free VPNs"
|
| https://www.kaspersky.com/blog/what-is-wrong-with-free-
| vpn-s...
|
| Usually such proxy networks are outright criminal (even if
| users are not).
| dewey wrote:
| There's many reputable residential proxy networks too,
| usually there's a lot of vetting involved too as they don't
| want people running illegal activities though their network.
|
| It's almost a necessity these days to have access to that due
| to how much datacenter ranges are blocked.
| tux3 wrote:
| Absolutely wild. A normal price for bandwidth before volume
| discounts is 1c/GB, or 10 bucks per TB
| jsheard wrote:
| They're in the business of scraping/botting sites that don't
| want to be scraped/botted, and bandwidth that looks "legit"
| comes at a premium.
| baq wrote:
| I downloaded world of Warcraft the other day, 100GB, took less
| than 3 hours and you can be sure it didn't cost blizzard $0.05.
| roywiggins wrote:
| Blizzard quite famously used BitTorrent to save bandwidth,
| dunno if they still do:
|
| https://wowpedia.fandom.com/wiki/Blizzard_Downloader
| bdcravens wrote:
| Residential proxy service
|
| https://smartproxy.com/proxies/residential-proxies/pricing
|
| (may not be this service, but this is an example, and the price
| is consistent with their larger commitments)
| tristor wrote:
| I would have liked to see a bit more of 5 Whys here. It seems
| like a consistent lesson that startups have to learn over and
| over is how to manage external dependencies, and particularly the
| dangers of having Google as a dependency. This is new
| Chrom(e|ium) behavior, and it has a real cost, both for this
| company and for users, which may or may not be worth the ROI, but
| this is what happens when you have a large scale external
| dependency: stuff moves without your knowledge, consent, or
| control.
|
| Instead of Always. Be. Closing. it should be Always. Be.
| Mitigating. Dependencies. for startups.
| suchintan wrote:
| This is a great callout.
|
| We had an internal discussion about how to manage dependencies
| effectively, and we made the decision accept the risk that
| comes with blindly relying on Chrome for now, instead of
| investing heavily in mitigating that risk today.
|
| The main motivator was for us to continue moving fast, and
| accept that we have a few hard dependencies in our business.
|
| The goal is to find product market fit, then allocate time to
| de-risk some of these hard dependencies. If we fail to find
| product market fit, this may not matter at all
| tristor wrote:
| I think that's a fair strategy. Strong PMF generally
| overcomes weak execution, the challenge is that when you have
| hard dependencies on entities like Google or Apple it can
| easily become existential. Even if you choose to move forward
| with this dependency you should establish guard rails within
| your system to ensure you catch shifts faster that may be
| impactful and have a plan for mitigation. For instance, you
| should identify key points of integration and possible
| alternatives even if you choose not to migrate now, so that a
| future migration is better understood and can be discussed
| intelligently in the heat of the moment. Even internal
| documentation can assist as a mitigation for dependency risk.
| suchintan wrote:
| Yeah exactly. One action item from this is that we need to
| add anomaly detection to our proxy usage metrics so we can
| catch this in 15 minutes instead of 6 hours :)
| ang_cire wrote:
| Blocking Google from downloading anything onto your computer
| without consent is always a good idea.
| suchintan wrote:
| We were pretty careful about what we were blocking here -- had
| the exact same concern. Hopefully it doesn't come back to bite
| us in the future (new blogpost incoming?)
| hypeatei wrote:
| Especially if you're using expensive bandwidth from botnets.
| keepamovin wrote:
| you shouldn't be paying by the terabyte. Colocate and just pay
| for the maximum throughout. Far better rates
| skeeter2020 wrote:
| doesn't work when the sites you're scraping block the IPs/range
| of your server. They're using a proxy botnet that costs a
| premium
| keepamovin wrote:
| you shouldn't be paying by the terabyte. Colocate and just pay
| for the maximum throughout. Far better rates
| skeeter2020 wrote:
| doesn't work when the sites you're scraping block the IPs/range
| of your server. They're using a proxy botnet that costs a
| premium
| bradley13 wrote:
| "We run leverage proxy networks and run headful browser
| instances"
|
| Um...say what? I'm pretty broadly based in IT, and I have no idea
| what that means.
| suchintan wrote:
| Haha, apologies for the language!
|
| We use residential proxy networks when running Skyvern to help
| simulate real human behaviour (because that's what Skyvern is
| trying to do).
|
| We run headful browser instances (meaning a real chrome
| instance running with a real viewport) for the same cause!
| rustdeveloper wrote:
| You guys should look into some unlimited bandwidth options. I use
| https://scrapingfish.com/unlimited
| suchintan wrote:
| This is really cool! I'll check it out :)
| olliej wrote:
| Honestly given many of these stories, $500 seems to be getting
| off pretty lightly.
|
| It's still absurd to me that many (most?) of these
| hosting/bandwidth providers don't seems to allow automatic cut
| offs and such
| suchintan wrote:
| It definitely could have been much worse. We burned through our
| monthly allocation in 6 hours HAHA, I'm grateful that our
| allocation wasn't something like 10TB
| omoikane wrote:
| The discussion linked in the post is from 2022, and the
| corresponding issue has already been fixed:
|
| https://issues.chromium.org/issues/40220332
|
| I wonder if there is a more recent bug related to this?
| meindnoch wrote:
| >200GB of proxy bandwidth
|
| Gigabyte is a measure of information.
|
| Bandwidth is information transmitted over time.
| metadat wrote:
| 200GB is nothing since 2018 when AT&T mass introduced their 1-gig
| symmetric fiber. Any single common gigabit link can run 200GB in
| 15 minutes.
|
| On any gig link, over the course of 6 hours you can transmit a
| little more than 4TB one way.. which is 40x more.
| Johnny555 wrote:
| Too bad AWS didn't get that memo, 200GB would cost $18 there,
| and somehow the company in the original post is paying $500 for
| that bandwidth with whoever their proxy host is.
| suchintan wrote:
| Haha unfortunately we use residential proxies under the hood
| to simulate real users (as you'd expect from AI agents),
| where bandwidth is significantly more expensive!
| donmcronald wrote:
| How does a residential proxy work? Do people rent out their
| internet connections to commercial services?
| patmcc wrote:
| I'm now expecting we'll see a couple things in the next few
| years:
|
| 1. An explosion of residential proxy networks and other stuff to
| circumvent blocking of cloud IP ranges, for all the various AI
| scraping tools to use.
|
| 2. A corresponding explosion of countermeasures to the above.
| Instead of blocking suspicious IPs, maybe they get a 3GB file on
| their request to /scrape-target.html
| mrtesthah wrote:
| I think that may be against the ToS of most residential ISPs.
| bdcravens wrote:
| Perhaps, but it's already fairly prodigious. Among "ethical"
| providers, it's often bundled as a background service in a
| lot of clickwrap "freeware". (To say nothing of compromised
| computers in a botnet)
| bdcravens wrote:
| Perhaps an explosion of usage. There's already a few very large
| residential proxy networks.
| tim_at_ping wrote:
| Hello,
|
| A (different) proxy company owner here. This sucks! Sorry that
| you lost out on so much bandwidth.
|
| Feel free to reach out to me at tim@pingproxies.com and I'd be
| happy to get you set up on our service and credit you with 100GB
| of free bandwidth to help soften the blow. I'll also be able to
| get you pricing alittle better than you're currently on if you
| are interested ;)
|
| Within the next few months we're also releasing a bunch of tools
| to help stop things like this happening on our residential
| network such as some intelligent routing logic, spend controls
| and a few other things.
|
| You may also want to look into Static Residential ISP Proxies -
| we charge these per IP address rather than bandwidth and they
| often end up more economical. We work with carriers like
| Spectrum, Comcast & AT&T directly to get IP addresses on their
| networks so they look like residential connections but host them
| in datacenters - this way you get 99.99%+ availability, 1G+
| throughput, stable IP addresses and have unlimited bandwidth.
|
| @ everyone else in the thread; if you run a start-up and need
| proxies then email me - happy to credit you with 50GB free
| residential bandwidth + give some advice on infra if needed.
|
| Cheers, Tim at Ping
| SteveNuts wrote:
| I'm interested to know how your residential connections are
| sourced.
|
| It says they're "ethically sourced", but it seems like
| malware/botnet like behavior.
|
| Are these residential users aware their traffic is siphoned off
| for this purpose?
| chimen wrote:
| They are never ethically sourced. Ethically for them means
| placing a phrase in a 10k word TOS when victims installs app
| X, game y which loads their sdk. Ethically here means "we
| warned them in a TOS"
___________________________________________________________________
(page generated 2024-09-19 23:00 UTC)