[HN Gopher] We accidentally burned through 200GB of proxy bandwi...
       ___________________________________________________________________
        
       We accidentally burned through 200GB of proxy bandwidth in 6 hours
        
       Author : suchintan
       Score  : 49 points
       Date   : 2024-09-19 16:15 UTC (6 hours ago)
        
 (HTM) web link (blog.skyvern.com)
 (TXT) w3m dump (blog.skyvern.com)
        
       | 8organicbits wrote:
       | What infrastructure is this using? Bandwidth seems pretty pricy
        
         | tobyjsullivan wrote:
         | No kidding. AWS's notoriously expensive data transfer is only
         | $0.09/GB. Who's charging $2.50/GB? Are they running on a
         | cellular SIM with no data plan?
        
           | mikeocool wrote:
           | Sounds like they are running a web scarping business -- so
           | maybe? Using a cellular connection would be one way to help
           | not get immediately capcha-ed by every site using cloudflare.
        
             | blitzar wrote:
             | They should really setup their scraper and (exfil the data)
             | via regular connections.
        
           | ronsor wrote:
           | Residential rotating proxy providers charge very high rates
           | for data, on the order of $1 - $10 per GB. (These providers
           | often do run their proxies through the cellular network,
           | actually.)
        
             | SteveNuts wrote:
             | Is this something where end users can get paid for doing
             | nothing other than proxying some traffic through their ISP?
        
               | ronsor wrote:
               | That's probably where some of the proxies come from.
        
               | r1ch wrote:
               | The end user typically has their device compromised by
               | using free apps where the developers were bribed $$$ to
               | add the proxy "SDK". The botnet operator then rents out
               | the bandwidth at exorbitant rates to anyone who will pay
               | for it.
               | 
               | Chrome extensions are also a huge source of this, they
               | look for extensions with a large install base and then
               | make an offer to buy it to turn all the users into
               | proxies.
        
               | slt2021 wrote:
               | end users install shady VPN apps/extensions to watch
               | pirated content, and become part of residential proxy
               | mesh/botnet
        
               | ipython wrote:
               | If by "some" traffic you mean botnets, sneaker and ticket
               | scalpers, scammers, content scrapers, credential stuffers
               | ... generally scummy stuff, sure.
               | 
               | Based on this blog post I would not do any business with
               | Skyvern, if they indeed do business with this underworld
               | of bottom feeders.
        
               | mrguyorama wrote:
               | Sure, if you want a whole bunch of legitimately malicious
               | traffic to be attributed to your internet account.
        
               | floam wrote:
               | Yes. Google "honeygain"
        
       | perks_12 wrote:
       | 200GB for $500? What cloud is this?
        
         | hooverd wrote:
         | api.skyvern.com is a CNAME to an EC2 ALB, but even using a NAT
         | Gateway ($$$) I can't make more than $1/GB add up.
        
         | intunderflow wrote:
         | Looks like webshare from the screenshot
        
         | jsnell wrote:
         | I don't think it's a cloud. It's more likely a residential
         | proxy network, which are typically created by installing
         | malware on users' machines.
         | 
         | The operators of these proxy networks want to avoid detection
         | by both the users whose bandwidth they're stealing, and by the
         | companies whose data is being scraped. So they want to make the
         | bandwidth very expensive. And that expensive bandwidth in turn
         | means that their only clients are dodgy as well. Either people
         | looking to scrape data without consent and monetize it, or
         | outright criminals.
        
           | bscphil wrote:
           | It's kind of surprising that a presumptively legitimate
           | company (and YC-funded startup) would out themselves as
           | buying black market residential proxy bandwidth, isn't it?
        
             | jsheard wrote:
             | Their frontpage also advertises the ability to pass
             | CAPTCHAs, whether by automation or more likely by
             | delegating them to third-world CAPTCHA farms. If that's a
             | major selling point for your automation service then your
             | target market probably ranges from dubious (e.g. data
             | scrapers trying to get around limits) to extremely dubious
             | (e.g. ticket scalpers, spammers, click fraud, etc).
        
               | xp84 wrote:
               | Just because something can be used for sketchy purposes
               | doesn't mean that's the only purpose of it. there are
               | thousands of situations where people are forced to
               | interact with a shitty website 100x per day and the site
               | won't provide an api. Imagine if your job was booking
               | plane tickets all day. United could provide you an API
               | key to do so via an API, but in practice they won't, only
               | some enterprisey travel software company can get that
               | kind of access, for a steep fee. You could build a tool
               | which automatically puts together an itinerary based on
               | rules and books it, through a tool like this. Perhaps a
               | slightly contrived example but I believe things like this
               | definitely happen.
        
               | dontlikeyoueith wrote:
               | > United could provide you an API key to do so via an
               | API, but in practice they won't, only some enterprisey
               | travel software company can get that kind of access, for
               | a steep fee. You could build a tool which automatically
               | puts together an itinerary based on rules and books it,
               | through a tool like this. Perhaps a slightly contrived
               | example but I believe things like this definitely happen.
               | 
               | And you think that's NOT sketchy?
               | 
               | I'm almost afraid to ask where you think the bar is...
        
               | stickfigure wrote:
               | It's exactly as sketchy as having a hypothetical robot
               | sit down at a console and type it out. Which, IMO, is not
               | very sketchy at all.
        
               | rty32 wrote:
               | Imagine a legitimate travel agency cannot book 100 United
               | tickets a day via methods outlined in business contracts
               | and need to resort to shady practice.
               | 
               | Dude, please provide some real solid evidence to back
               | this up, and perhaps come up with another realistic
               | scenario where bypassing captcha is justified.
        
               | xp84 wrote:
               | > Imagine a legitimate travel agency cannot book 100
               | United tickets a day
               | 
               | That's the whole point, I never said travel agency, I was
               | thinking a company with travelling consultants.
               | 
               | How TF is it "shady" to purchase and use airfare?
               | 
               | And again, bypassing captcha, say, to purchase tickets
               | isn't evil either, if you are purchasing them for use and
               | not for resale. It would just allow a person to book
               | tickets for 50 people without wasting 6 hours to complete
               | 25 CAPTCHAS and type in my information 25 times.
               | 
               | CAPTCHA is a blunt instrument deployed in an attempt to
               | mitigate _abuse,_ but it has a massive bad side effect
               | that for every heavy user (not just evil users), it
               | requires a human butt to be in a seat somewhere to do
               | mindless busywork that could otherwise be automated.
               | Working around that (sounds like OP agrees to do so on a
               | case by case basis) is not inherently evil. It 's as evil
               | (or benign) as whatever you're using it for.
        
               | suchintan wrote:
               | Agreed. Just for reference, one of our most popular use-
               | cases is automating data entry into CRMs without APIs...
               | No one wants to be doing this stuff manually, and
               | automating it has some serious positive QoL impact
               | 
               | We get a lot of requests for bad usage (ie spinning up
               | upvote rings on Reddit) but we don't want to support
               | things like that
        
             | mrguyorama wrote:
             | How long have you been here? It's not surprising at all. HN
             | and YC have not demonstrated an aversion to "uh, greyhat"
             | activity.
             | 
             | If it were 2000, people would be sharing their ad clicking
             | startups.
             | 
             | YC has funded a looooooot of sketchy companies.
        
             | dewey wrote:
             | Residential proxies are not necessarily "black market".
        
               | asmor wrote:
               | It's almost never done with the full understanding of the
               | person providing the proxy, doesn't matter if they get
               | promised some change, their browser addons betray them or
               | they install bundleware/adware.
               | 
               | I'd say it has about the same moral standing as a payday
               | loan.
        
               | dewey wrote:
               | There's other ways for example through mislabeled
               | "residential" blocks, or "residential" proxies that are
               | sold by ISPs to vendors.
        
           | iforgotpassword wrote:
           | I use one. I run a bot on IRC that extracts the <title> of
           | every link posted (or downloads the image/whatever and
           | extracts Metadata) and announces that to the channel. It has
           | become more and more pointless to run this on a vps.
           | Google/YouTube block the IP range, a lot of websites return
           | the cloudflare security check, Amazon works on some days and
           | doesn't on others... Ever since I proxy via residential
           | proxies it just works. I'm a smooth criminal. :>
        
             | morkalork wrote:
             | So much for the open internet.
        
               | nolist_policy wrote:
               | You can thank the spammers.
        
             | derekzhouzhen wrote:
             | I feel your pain, but I refuse to cave. Say, 10% of the
             | links fail to load, so what? It is their loss, not mine.
        
           | floam wrote:
           | It's not necessarily malware. There are services that are
           | pretty upfront and pay cash money for residential US
           | bandwidth. That said, naive people might be surprised when
           | their IP starts getting blocked.
           | 
           | e.g. https://www.honeygain.com/ (something like 100GB = $20).
        
           | peab wrote:
           | how does expensive bandwidth equate to dodgy clients? There
           | are lot's of valid use cases for scraping data, and it's
           | legal to scrape publicly available data, even if the websites
           | hosting it try to block it (try a curl request to reddit, for
           | example)
        
             | patmcc wrote:
             | >>>and it's legal to scrape publicly available data, even
             | if the websites hosting it try to block it
             | 
             | Is that something that's been fully decided?
             | https://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc.
             | is the most relevant case I'm aware of, and it suggests it
             | might actually be illegal (if you know you've been blocked,
             | at least).
        
             | suchintan wrote:
             | https://techcrunch.com/2024/01/24/court-rules-in-favor-
             | of-a-...
             | 
             | This is another interesting example where it was allowed
        
           | morkalork wrote:
           | Aren't there also some suspiciously cheap VPNs that do that
           | in the background?
        
             | ThePowerOfFuet wrote:
             | Yes
        
           | hypeatei wrote:
           | Yeah, the author confirmed it in this thread actually:
           | 
           | https://news.ycombinator.com/item?id=41594713
        
           | miohtama wrote:
           | Here more on "free VPNs"
           | 
           | https://www.kaspersky.com/blog/what-is-wrong-with-free-
           | vpn-s...
           | 
           | Usually such proxy networks are outright criminal (even if
           | users are not).
        
           | dewey wrote:
           | There's many reputable residential proxy networks too,
           | usually there's a lot of vetting involved too as they don't
           | want people running illegal activities though their network.
           | 
           | It's almost a necessity these days to have access to that due
           | to how much datacenter ranges are blocked.
        
         | tux3 wrote:
         | Absolutely wild. A normal price for bandwidth before volume
         | discounts is 1c/GB, or 10 bucks per TB
        
           | jsheard wrote:
           | They're in the business of scraping/botting sites that don't
           | want to be scraped/botted, and bandwidth that looks "legit"
           | comes at a premium.
        
         | baq wrote:
         | I downloaded world of Warcraft the other day, 100GB, took less
         | than 3 hours and you can be sure it didn't cost blizzard $0.05.
        
           | roywiggins wrote:
           | Blizzard quite famously used BitTorrent to save bandwidth,
           | dunno if they still do:
           | 
           | https://wowpedia.fandom.com/wiki/Blizzard_Downloader
        
         | bdcravens wrote:
         | Residential proxy service
         | 
         | https://smartproxy.com/proxies/residential-proxies/pricing
         | 
         | (may not be this service, but this is an example, and the price
         | is consistent with their larger commitments)
        
       | tristor wrote:
       | I would have liked to see a bit more of 5 Whys here. It seems
       | like a consistent lesson that startups have to learn over and
       | over is how to manage external dependencies, and particularly the
       | dangers of having Google as a dependency. This is new
       | Chrom(e|ium) behavior, and it has a real cost, both for this
       | company and for users, which may or may not be worth the ROI, but
       | this is what happens when you have a large scale external
       | dependency: stuff moves without your knowledge, consent, or
       | control.
       | 
       | Instead of Always. Be. Closing. it should be Always. Be.
       | Mitigating. Dependencies. for startups.
        
         | suchintan wrote:
         | This is a great callout.
         | 
         | We had an internal discussion about how to manage dependencies
         | effectively, and we made the decision accept the risk that
         | comes with blindly relying on Chrome for now, instead of
         | investing heavily in mitigating that risk today.
         | 
         | The main motivator was for us to continue moving fast, and
         | accept that we have a few hard dependencies in our business.
         | 
         | The goal is to find product market fit, then allocate time to
         | de-risk some of these hard dependencies. If we fail to find
         | product market fit, this may not matter at all
        
           | tristor wrote:
           | I think that's a fair strategy. Strong PMF generally
           | overcomes weak execution, the challenge is that when you have
           | hard dependencies on entities like Google or Apple it can
           | easily become existential. Even if you choose to move forward
           | with this dependency you should establish guard rails within
           | your system to ensure you catch shifts faster that may be
           | impactful and have a plan for mitigation. For instance, you
           | should identify key points of integration and possible
           | alternatives even if you choose not to migrate now, so that a
           | future migration is better understood and can be discussed
           | intelligently in the heat of the moment. Even internal
           | documentation can assist as a mitigation for dependency risk.
        
             | suchintan wrote:
             | Yeah exactly. One action item from this is that we need to
             | add anomaly detection to our proxy usage metrics so we can
             | catch this in 15 minutes instead of 6 hours :)
        
       | ang_cire wrote:
       | Blocking Google from downloading anything onto your computer
       | without consent is always a good idea.
        
         | suchintan wrote:
         | We were pretty careful about what we were blocking here -- had
         | the exact same concern. Hopefully it doesn't come back to bite
         | us in the future (new blogpost incoming?)
        
         | hypeatei wrote:
         | Especially if you're using expensive bandwidth from botnets.
        
       | keepamovin wrote:
       | you shouldn't be paying by the terabyte. Colocate and just pay
       | for the maximum throughout. Far better rates
        
         | skeeter2020 wrote:
         | doesn't work when the sites you're scraping block the IPs/range
         | of your server. They're using a proxy botnet that costs a
         | premium
        
       | keepamovin wrote:
       | you shouldn't be paying by the terabyte. Colocate and just pay
       | for the maximum throughout. Far better rates
        
         | skeeter2020 wrote:
         | doesn't work when the sites you're scraping block the IPs/range
         | of your server. They're using a proxy botnet that costs a
         | premium
        
       | bradley13 wrote:
       | "We run leverage proxy networks and run headful browser
       | instances"
       | 
       | Um...say what? I'm pretty broadly based in IT, and I have no idea
       | what that means.
        
         | suchintan wrote:
         | Haha, apologies for the language!
         | 
         | We use residential proxy networks when running Skyvern to help
         | simulate real human behaviour (because that's what Skyvern is
         | trying to do).
         | 
         | We run headful browser instances (meaning a real chrome
         | instance running with a real viewport) for the same cause!
        
       | rustdeveloper wrote:
       | You guys should look into some unlimited bandwidth options. I use
       | https://scrapingfish.com/unlimited
        
         | suchintan wrote:
         | This is really cool! I'll check it out :)
        
       | olliej wrote:
       | Honestly given many of these stories, $500 seems to be getting
       | off pretty lightly.
       | 
       | It's still absurd to me that many (most?) of these
       | hosting/bandwidth providers don't seems to allow automatic cut
       | offs and such
        
         | suchintan wrote:
         | It definitely could have been much worse. We burned through our
         | monthly allocation in 6 hours HAHA, I'm grateful that our
         | allocation wasn't something like 10TB
        
       | omoikane wrote:
       | The discussion linked in the post is from 2022, and the
       | corresponding issue has already been fixed:
       | 
       | https://issues.chromium.org/issues/40220332
       | 
       | I wonder if there is a more recent bug related to this?
        
       | meindnoch wrote:
       | >200GB of proxy bandwidth
       | 
       | Gigabyte is a measure of information.
       | 
       | Bandwidth is information transmitted over time.
        
       | metadat wrote:
       | 200GB is nothing since 2018 when AT&T mass introduced their 1-gig
       | symmetric fiber. Any single common gigabit link can run 200GB in
       | 15 minutes.
       | 
       | On any gig link, over the course of 6 hours you can transmit a
       | little more than 4TB one way.. which is 40x more.
        
         | Johnny555 wrote:
         | Too bad AWS didn't get that memo, 200GB would cost $18 there,
         | and somehow the company in the original post is paying $500 for
         | that bandwidth with whoever their proxy host is.
        
           | suchintan wrote:
           | Haha unfortunately we use residential proxies under the hood
           | to simulate real users (as you'd expect from AI agents),
           | where bandwidth is significantly more expensive!
        
             | donmcronald wrote:
             | How does a residential proxy work? Do people rent out their
             | internet connections to commercial services?
        
       | patmcc wrote:
       | I'm now expecting we'll see a couple things in the next few
       | years:
       | 
       | 1. An explosion of residential proxy networks and other stuff to
       | circumvent blocking of cloud IP ranges, for all the various AI
       | scraping tools to use.
       | 
       | 2. A corresponding explosion of countermeasures to the above.
       | Instead of blocking suspicious IPs, maybe they get a 3GB file on
       | their request to /scrape-target.html
        
         | mrtesthah wrote:
         | I think that may be against the ToS of most residential ISPs.
        
           | bdcravens wrote:
           | Perhaps, but it's already fairly prodigious. Among "ethical"
           | providers, it's often bundled as a background service in a
           | lot of clickwrap "freeware". (To say nothing of compromised
           | computers in a botnet)
        
         | bdcravens wrote:
         | Perhaps an explosion of usage. There's already a few very large
         | residential proxy networks.
        
       | tim_at_ping wrote:
       | Hello,
       | 
       | A (different) proxy company owner here. This sucks! Sorry that
       | you lost out on so much bandwidth.
       | 
       | Feel free to reach out to me at tim@pingproxies.com and I'd be
       | happy to get you set up on our service and credit you with 100GB
       | of free bandwidth to help soften the blow. I'll also be able to
       | get you pricing alittle better than you're currently on if you
       | are interested ;)
       | 
       | Within the next few months we're also releasing a bunch of tools
       | to help stop things like this happening on our residential
       | network such as some intelligent routing logic, spend controls
       | and a few other things.
       | 
       | You may also want to look into Static Residential ISP Proxies -
       | we charge these per IP address rather than bandwidth and they
       | often end up more economical. We work with carriers like
       | Spectrum, Comcast & AT&T directly to get IP addresses on their
       | networks so they look like residential connections but host them
       | in datacenters - this way you get 99.99%+ availability, 1G+
       | throughput, stable IP addresses and have unlimited bandwidth.
       | 
       | @ everyone else in the thread; if you run a start-up and need
       | proxies then email me - happy to credit you with 50GB free
       | residential bandwidth + give some advice on infra if needed.
       | 
       | Cheers, Tim at Ping
        
         | SteveNuts wrote:
         | I'm interested to know how your residential connections are
         | sourced.
         | 
         | It says they're "ethically sourced", but it seems like
         | malware/botnet like behavior.
         | 
         | Are these residential users aware their traffic is siphoned off
         | for this purpose?
        
           | chimen wrote:
           | They are never ethically sourced. Ethically for them means
           | placing a phrase in a 10k word TOS when victims installs app
           | X, game y which loads their sdk. Ethically here means "we
           | warned them in a TOS"
        
       ___________________________________________________________________
       (page generated 2024-09-19 23:00 UTC)