[HN Gopher] We survived 10k requests/second: Switching to signed...
___________________________________________________________________
We survived 10k requests/second: Switching to signed asset URLs in
an emergency
Author : dyogenez
Score : 63 points
Date : 2024-08-15 20:36 UTC (2 hours ago)
(HTM) web link (hardcover.app)
(TXT) w3m dump (hardcover.app)
| dyogenez wrote:
| Earlier this week someone started hitting our Google Cloud
| Storage bucket with 10k requests a second... for 7 hours. I
| realized this while working from a coffee shop and spent the rest
| of the day putting in place a fix.
|
| This post goes over what happened, how we put an a solution in
| place in hours and how we landed on the route we took.
|
| I'm curious to hear how others have solved this same problem -
| generating authenticated URLs when you have a public API.
| tayo42 wrote:
| > I'm curious to hear how others have solved this same problem
|
| I think this is interesting to ask, because I often have
| problems where I'm almost certain it's been solved before, just
| people don't bother to write about it. Where can people
| congregate to discuss questions like this?
| dyogenez wrote:
| Hopefully here. Sometimes the best way to get people to
| respond is to be wrong. I'm sure I've done a bunch of things
| wrong.
| wordofx wrote:
| > I'm curious to hear how others have solved this same problem
|
| Not use Google to start with. And not make S3 buckets public.
| Must be accessed via CloudFront or CF Signed URLs. Making stuff
| public is dumb.
| wrs wrote:
| It sounds like you had public _list_ access to your bucket,
| which is always bad. However, you can prevent list access, but
| keep _read_ access to individual objects public. As long as
| your object names are unguessable (say, a 16-byte random
| number), you won't have the problem you had.
|
| I haven't used Rails since they integrated storage, but gems
| like Paperclip used to do this for you by hashing the image
| parameters with a secret seed to generate the object name.
|
| Using signed URLs is solving a different problem: making people
| hit your API at least once a day to get a working GCS URL for
| the image. It's not clear if that's an actual problem, as if
| people want to enumerate your API (as opposed to your bucket),
| they can do that with the new system too.
|
| That aside, I'm confused about the 250ms thing. You don't have
| to hit a Google API to construct a signed URL. It should just
| be a signature calculation done locally in your server. [0]
|
| https://cloud.google.com/storage/docs/access-control/signing...
| EGreg wrote:
| We've designed our system for this very use case. Whether it's on
| commodity hardware or in the cloud, whether or not it's using a
| CDN and edge servers, there are ways to "nip things in the bud",
| as it were, by rejecting requests without a proper signed
| payload.
|
| For example, the value of session ID cookies should actually be
| signed with an HMAC, and checked at the edge by the CDN. Session
| cookies that represent a authenticated session should also look
| different than unauthenticated ones. The checks should all happen
| at the edge, at your reverse proxy, without doing any I/O or
| calling your "fastcgi" process manager.
|
| But let's get to the juicy part... hosting files. Ideally, you
| shouldn't have "secret URLs" for files, because then they can be
| shared and even (gasp) hotlinked from websites. Instead, you
| should use features like X-Accel-Redirect in NGINX to let your
| app server determine access to these gated resources. Apache has
| similar things.
|
| Anyway, here is a write-up which goes into much more detail:
| https://community.qbix.com/t/files-and-storage/286
| dyogenez wrote:
| Ohh, using the session ID in the URL would be a nice addition
| to this. We already generate session tokens for every user -
| guests and logged in users. We could pass that through to
| segment on rather than IP address.
| Waterluvian wrote:
| Do any cloud providers have a sensible default or easy-to-enable
| mode for "you literally cannot spend one penny until you set
| specific quotas/limits for each resource you're allocating"?
| paxys wrote:
| No, because surprise runaway costs are their entire business
| model.
| ksnsnsj wrote:
| Not really, because those clients will be unhappy and cause
| trouble.
|
| They like the clients which expand slowly.
|
| So going from $100 to $100k in a month by accident they want
| to avoid while still being able to go from $1k to $100k in a
| year
| the8472 wrote:
| The dreaded C10k problem, remaining unsolved to this day.
| ksnsnsj wrote:
| Unlike the original c10k problem, serving those number of
| connectioms has now morthed from a technical to an economic
| problem
| the8472 wrote:
| [delayed]
| flockonus wrote:
| Have you considered putting cloudflare or similar CDN with
| unlimited egress in front of your bucket?
|
| Reading your blogpost I don't fully get how the current signing
| implementation can halt massive downloads, or the "attacker"(?)
| would just adapt their methods to get the signed URLs first and
| then proceed to download what they are after anyway?
| paxys wrote:
| Yup. The only mitigation here is that there is a limit to how
| many _different_ asset URLs they will be able to generate, but
| if they want to be malicious they can download the same file
| over and over again and still make you rack up a huge bill.
| ezekg wrote:
| Honestly, I would just move to R2 and save on egress fees even
| without the CDN. Runaway egress bills are no fun.
|
| I saved myself thousands $/mo moving to R2.
| ksnsnsj wrote:
| What is R2?
| ezekg wrote:
| Cloudflare's S3-compatible offering with zero egress fees:
| https://www.cloudflare.com/developer-platform/r2/
| l5870uoo9y wrote:
| 10k req/s would potentially crash the ruby proxy server halting
| the image serving.
|
| Cloudflare is the way to go. I generally serve heavy files,
| e.g. videos, from a Cloudflare bucket to avoid expensive bills
| from primary host.
| jiripospisil wrote:
| You cannot just put Cloudflare in front of your Google hosted
| bucket, that's against CF's terms of service. In order to do
| that you would have to also host the content itself on
| Cloudflare R2/Images etc. There used to be also html only
| restriction but that's no longer the case.
|
| > Next, we got rid of the antiquated HTML vs. non-HTML
| construct, which was far too broad. Finally, we made it clear
| that customers can serve video and other large files using the
| CDN so long as that content is hosted by a Cloudflare service
| like Stream, Images, or R2.
|
| https://blog.cloudflare.com/updated-tos/
| JohnMakin wrote:
| This is absolutely nuts to me and would immediately rule out
| ever hosting anything on google storage for me
| krzys wrote:
| It's Cloudflare's, which prohibits usage not directly
| related to hosting web content.
| ghayes wrote:
| Where is this against the GCP or CloudFlare's TOS?
| voxic11 wrote:
| Lots of people do this, so you definitely _can_ do this even
| if its against CF 's terms of service, which is something I
| can't find evidence of.
| jiripospisil wrote:
| > Cloudflare reserves the right to disable or limit your
| access to or use of the CDN, or to limit your End Users'
| access to certain of your resources through the CDN, if you
| use or are suspected of using the CDN without such Paid
| Services to serve video or a disproportionate percentage of
| pictures, audio files, or other large files.
|
| If you're putting the CDN in front of a bucket with nothing
| but images, you're automatically in breach.
|
| https://www.cloudflare.com/service-specific-terms-
| applicatio...
| KomoD wrote:
| You totally can, just not a "disproportionate percentage"
| jiripospisil wrote:
| That's when you're caching a whole page which contains
| images. The OP is talking about putting the CDN in front of
| a bucket which doesn't serve anything but images (= 100%).
| dyogenez wrote:
| Putting a CDN in front would prevent this at the bucket level,
| but then someone could still hit the CDN at 10k
| requests/second. We could rate limit it there though, which
| would be nice.
|
| The downside is that people already have the URLs for existing
| bucket directly. So we'd need to change those either way.
|
| The reason why the attacker couldn't just hit the API to get
| the signed URLs is due to rate limiting that I go over using
| the rack-attack ruby gem. Since that's limited to 60/second,
| that's more like 43k images/day max.
| paxys wrote:
| Quick feedback - you've used the term "signed URL" over 50 times
| in the post without once explaining what it is or how it works.
| shortrounddev2 wrote:
| Rather than allowing any object on a bucket to be downloaded by
| its raw URL (i.e: http://mycdn.io/abcdefg.jpeg), the backend
| service needs to generate a "signed" url, which is a short
| lived URL that grants the user a single request against that
| resources (GET, POST, PUT, etc.) (i.e:
| http://mycdn.io/abcdefg.jpeg?signed={securerandomstring}) So
| you can only use the URL to download it once, and you need to
| go through the backend API to generate the presigned URL. This
| could result in your backend getting hammered but you can also
| use DDOS protection to prevent 10k requests a second from going
| through your backend
|
| Theyre also a good way to allow users to upload images to your
| CDN without having to actually upload that data to your web API
| backend; you just give the user a presigned PUT request URL and
| they get a one-time ticket to upload to your bucket
| taeric wrote:
| Worth calling out that the big benefit is you basically lean
| on the service provider for streaming the data, without
| having to form a trust relationship between them and the
| receiver of the data.
|
| That is, the entire point is to not put more compute between
| the requester and the data. The absolute worst place to be
| would be to have compute that is streaming from the data
| provider, so that they can stream to the end user.
|
| Right?
| ddorian43 wrote:
| It's not a single time, but its with TTL.
| telotortium wrote:
| Until the author fixes the post, this is what they're talking
| about: https://cloud.google.com/storage/docs/access-
| control/signed-.... Essentially, it ensures that a URL is
| invalid unless the server signs it with a secret key controlled
| by the server, which means that clients can't access your
| assets just by guessing the URL. In addition to signing the
| URL, the signature can contain metadata such as permissions and
| expiration time.
| dyogenez wrote:
| Ohh good catch. Just updated the post with a section mentioning
| what signed URLs are before jumping into the solution.
| andrewstuart wrote:
| I'm always surprised to read how much money companies are willing
| to spend on things that can be done for essentially nothing.
|
| I had a look at the site - why does this need to run on a major
| cloud provider at all? Why use VERY expensive cloud storage at 9
| cents per gigabyte? Why use very expensive image conversion at
| $50/month when you can run sharp on a Linux server?
|
| I shouldn't be surprised - the world is all in on very expensive
| cloud computing.
|
| There's another way though assuming you are running something
| fairly "normal" (whatever that means) - run your own Linux
| servers. Serve data from those Linux computers. I use CloudFlare
| R2 to serve your files - its free. You probably don't need most
| of your fancy architecture - run a fast server on Ionos or
| Hetzner or something and stop angsting about budget alerts from
| Google for things that should be free and runnong on your own
| computers - simple,. straightforward and without IAM spaghetti
| and all that garbage.
|
| EDIT: I just had a look at the architecture diagram - this is
| overarchitected. This is a single server application that almost
| has no architecture - Caddy as a web server - a local queue -
| serve images from R2 - should be running on a single machine on a
| host that charges nothing or trivial amount for data.
| BigParm wrote:
| How much does it cost to have an ISP let you do that? What are
| the barriers generally?
| andrewstuart wrote:
| Let you do what? What barriers do you see?
| jazir wrote:
| > run your own Linux servers
|
| He might have thought it meant running servers on a home
| network instead of managing remote Linux servers.
| hypeatei wrote:
| If you're referring to hosting on a home network, you'll
| probably be behind CGNAT. Your ISP can give you a dedicated
| IP but it'll most likely cost something.
| jiripospisil wrote:
| > I use CloudFlare R2 to serve your files - its free.
|
| I mean technically it's not free. It's just that they have a
| very generous "Forever Free" number of read operations
| (10M/month, $0.36 per million after).
| blibble wrote:
| yeah, as a crotchety old unix guy, 10k requests a second was a
| benchmark 30 years ago on an actual server
|
| today a raspberry pi 5 can do 50k/s with TLS no sweat
| dyogenez wrote:
| If you're able to do that, then you have a huge skill! I'm not
| much of a devops engineer myself, so I'm leveraging work done
| by others. My skills are in application design. For hosting I
| try to rely on what others have built and host there.
|
| If I had your skills then our costs would be much smaller. As
| it stands now we pay about $700/month for everything - the bulk
| of it for a 16gb ram / 512gb space database.
| ksnsnsj wrote:
| I have read this argument before. Of cause you can do
| everything yourself _but it is not free_
|
| You are missing both development cost and much more importantly
| opportunity cost
|
| If I spent a person year on a cheap run architecture while my
| competitor spent a person year on a value add feature add, he
| will win
| Spivak wrote:
| Don't use cloud, use these two other clouds. This right here is
| the issue, the skills and know how to buy hardware, install it
| in a data center, and get it on the internet are niche beyond
| niche.
|
| Entering the world where you're dealing with Cogent, your Dell
| and Fortinet reps, suddenly having strong opinions about iDRAC
| vs iLO and hardware RAID is well beyond what anyone wants to
| care about just to run some web servers.
|
| When people talk about major cloud providers being expensive
| the alternative is never /really/ to do it yourself but move to
| a discount hosting provider. And it's not as if there isn't
| savings to be found there but it's just another form of cloud
| optimization. We're talking about a story where $100 of spend
| triggers an alert. The difference is so minuscule.
| 1a527dd5 wrote:
| I don't understand, why wasn't there a CDN in front of the public
| GCS bucket resources?
| ksnsnsj wrote:
| While this is normally done due to the reasons mentioned, to me
| that is a significant downside.
|
| Why can't GCS act as a CDN, too?
| qaq wrote:
| Beauty of cloud :) This could be easily served by a $100/month DO
| droplet with 0 worries about $.
| paxys wrote:
| Does DO have free bandwidth? If not how exactly does that solve
| the problem?
| Alifatisk wrote:
| I don't think they have unmetered bandwidth?
| jsheard wrote:
| They don't, although their overage rates are pretty
| reasonable compared to the big clouds at 1 cent per gig.
| It's hard to beat Hetzners 0.1 cents per gig, though.
|
| I'd rather pay pennies for bandwidth than rely on
| "unmetered" bandwidth which tends to suddenly stop being
| unmetered if you use it too much.
| atrus wrote:
| Not on DO. ~$100 a month droplet gets you about 5TB of transfer
| out. They pulled 15TB in 7 hours. That's ~1,440,000 (16 _3_ 30)
| on overage or about $15k extra.
| sroussey wrote:
| I used to have my own half server rack and unlimited
| bandwidth for $500/mo.
|
| My own machines, of course.
| daemonologist wrote:
| Doesn't DO charge $0.01/GB for egress overage? That's $150,
| not $15k. (Although Hetzner or something would've been even
| less.)
| atrus wrote:
| The formatting ate my math it's 1,440,000TB of transfer per
| month. (16 x 3 x 30 ). That's $14.4k
| qaq wrote:
| Didn't pay attention to transfer figure lets switch DO to
| CCX43 on Hetzner for $50 more
| 0xbadcafebee wrote:
| I think you miss the point of the cloud. It's not supposed
| to be cheaper. If you want cheap, yeah, run on Hetzner. If
| you want to deploy a WAF with complex rules to route
| specific traffic to either a multi-region ALB or a bucket
| with a WAF built in, and do it in 10 minutes, you use the
| cloud.
| qaq wrote:
| I really don't :) I work on this day in and day out. It
| just that 90% of the projects don't need any of the
| above.
| rsstack wrote:
| DO _is_ cloud. Using their droplets compared to someone more
| sophisticated on GCP is an engineering choice, but both are
| cloud and both have upsides and downsides, and one needs to
| understand their needs to make the correct decision both among
| the different providers and within a provider on the right
| setup.
| ponytech wrote:
| I rent a bare metal server for $50/month with unlimited
| bandwith...
| kawera wrote:
| Where?
| ksnsnsj wrote:
| There is no such thing as unlimited bandwidth.
|
| What I'm aware of are services which do not charge extra for
| egress but severely limit your egress bandwidth (like 10 Gbit
| peak, 100 Mbit avg)
|
| And limiting egress bandwidth is better is better done in the
| service per client than by the hoster for your system
| languagehacker wrote:
| Did this guy just write a blog post about how he completely
| rewrote a functional feature to save $800?
|
| In all seriousness, the devil is in the details around this kind
| of stuff, but I do worry that doing something not even clever,
| but just nonstandard, introduces a larger maintenance effort than
| necessary.
|
| Interesting problem, and an interesting solution, but I'd
| probably rather just throw money at it until it gets to a scale
| that merits further bot prevention measures.
| underwater wrote:
| It was $800 _so far_.
|
| Your point is valid for normal usage patterns where there is a
| direct relationship between active users and cost. But an
| attack meant OP's costs were sky rocketing even though usage
| was flat.
| dyogenez wrote:
| If this were a business and someone else's money I'd do the
| same. This is a bootstrapped side project coming out of my own
| wallet.
|
| If money wasn't an issue, I'd probably just allow people to
| download images for free.
| languagehacker wrote:
| Good point! My POV assumed some amount of revenue generation.
| Alifatisk wrote:
| I can't describe the surprise when I saw RoR being mentioned,
| that was unexpected but made the article way more exciting to
| read.
|
| Wouldn't this be solved by using Cloudflare R2 though?
| dyogenez wrote:
| That's good to hear Any chance to bring in Ruby.
|
| I'm not familiar with Cloudflare R2, so I'll have to check it
| out. I do like that we can rate limit based on either User ID
| requesting an image from the API, or by IP address. I'm not
| sure how we'd handle segmenting by user id with a CDN (but I'd
| have to read more to understand if that's a possibility).
| paulddraper wrote:
| Remember kids, CDNs are your friend.
|
| You can roll/host your own anything. Except CDN (if you care
| about uptime).
| arcfour wrote:
| I immediately groaned when I read "public bucket."
|
| On AWS you'd put CloudFront in front of the (now-private) bucket
| as a CDN, then use WAF for rate limiting, bot control, etc. In my
| experience GCP's services work similarly to AWS, so...is this not
| possible with GCP, or why wasn't this the setup from the get-go?
| That's the proper way to do things IMO.
|
| Signed URLs I only think of when I think of like, paid content or
| other "semi-public" content.
| dyogenez wrote:
| That's a good idea. I probably could've put a CDN in front of
| this and rate limited there while keeping things public. That
| might've been faster than using Ruby to be honest. The downside
| was that our API already shared the non-CDN URLs, so that would
| leave the problem open for anyone who already had that data.
| arcfour wrote:
| The bucket is private though, only accessible through the
| CDN. The old URLs would cease to function. On AWS this is
| implemented through OAI/OAC, granting the CloudFront
| distribution access via its own unique principal. AWS has had
| a baseline security recommendation for years now to disable
| S3 public access at the account/org level.
|
| Maybe this breaks things, maybe you need to expire some
| caches, but (forgive me for being blunt, I can't think of a
| better way to say it) that's the cost of not doing things
| correctly to begin with.
|
| My first thought as a security engineer when setting
| something up to be public has always been "how hard could
| someone hit this, and how much would it cost/affect
| availability?"
| 0xbadcafebee wrote:
| Google Cloud makes it insanely difficult/non-obvious what
| services you should use to solve these problems (or how to use
| them, because they're always difficult to use). They have a
| maze of unintuitive product names and sub-products and sub-sub-
| products, finding them in a UX is ridiculous, there's no useful
| tips/links/walkthroughs in the wizards, and their docs are
| terrible. It's like being trapped in the goddamn catacombs of
| Paris. On AWS, using buckets with CDN, ALB & WAF are obvious
| and easy, but on GCP it's a quagmire.
|
| The other thing is, AWS WAF was released in 2015, and the
| Google Cloud Armor WAF feature (the what now?) was released in
| 2020.
| written-beyond wrote:
| Honestly this is exactly how I felt about GCP when I was
| building something that would be used by millions of people.
| At that scale it's very easy to shoot yourself in the foot
| and boy does Google make that easy.
|
| There were so many things that were outright wrong in their
| documentation that caused me many sleepless nights. Like not
| recommending using a pool or closing cloudSQL connections in
| server less functions because they'll be closed automatically
| when the instance spins down.
|
| Don't get me wrong I had used pools extensively before, and I
| knew you had to close connections but their docs and examples
| would explicitly show the connections not being closed, just
| left for them to close when the instance spins down.
|
| Idk why they never thought that an instance might never spin
| down if it's getting hammered with requests and you end up
| with hundreds of open connections over multiple instances
| until GCP starts killing your requests telling you "out of
| connections" in a server less instance. The vaguest possible
| error which after a lot of debugging you understand that you
| can't have more than 100 open connections on a single
| function instance, but you were technically never supposed to
| have more than one open at any given time.
|
| _sigh_
| antihero wrote:
| That said, if you use CF in front of S3 (which you should),
| anyone with a gigabit connection can easily cost you hundreds
| of dollars. I know this because I did this to myself
| accidentally.
| hypeatei wrote:
| So your fix was to move the responsibility to the web server and
| Redis instance? I guess that works but introduces a whole lot
| more complexity (you mentioned adding rate limiting) and
| potential for complete outage in the event a lot of requests for
| images come in again.
| dyogenez wrote:
| That's my worry too. Our server load for our Rails server
| hasn't gone up even though our throughput has maxed out at 76k
| requests/second (which I think is a bunch of people from Hacker
| News going to the Hardcover homepage and downloading 100
| images).
|
| I don't like that if Rails goes down our images go down. I'd
| much prefer to separate these out and show the signed URLs in
| Next.js and be able to generate them through the API. I think
| we'll get there, but that's a bigger change than I could
| reliably make in a day.
| 0xbadcafebee wrote:
| Rate limiting (and its important cousin, back-off retries) is an
| important feature of any service being consumed by an "outside
| entity". There are many different reasons you'll want rate
| limiting at every layer of your stack, for every request you
| have: brute-force resistance, [accidental] DDoS protection,
| resiliency, performance testing, service quality, billing/quotas,
| and more.
|
| Every important service always eventually gets rate limiting. The
| more of it you have, the more problems you can solve. Put in the
| rate limits you think you need (based on performance testing) and
| only raise them when you need to. It's one of those features
| nobody adds until it's too late. If you're designing a system
| from scratch, add rate limiting early on. (you'll want to control
| the limit per session/identity, as well as in bulk)
| upon_drumhead wrote:
| Given that you want to be good stewards of book data, have you
| considered publishing bulk snapshots to archive.org on a set
| cadence? It would strongly reduce any needs to do any sort of
| bulk scraping and also ensure that should something happen to
| your service, the data isn't lost forever.
| taeric wrote:
| I'm confused, isn't this literally the use case for a CDN?
|
| Edit: I see this is discussed in other threads.
| dyogenez wrote:
| That would solve some of the problems. If the site was
| previously behind a CDN with a rate limit, I don't think we
| would have even had this problem.
|
| Given that we have the problem now, and that people already
| have the non-CDN URLs, we needed a solution that allowed us to
| roll out something ASAP, while allowing people that use our API
| to continue using the image URLs they've downloaded.
| taeric wrote:
| Makes sense. And kudos on getting a solution that works for
| you! :D
| Sytten wrote:
| I really hope this is not the whole of your code otherwise you
| have a nice open redirect vulnerability on your hand and possibly
| a private bucket leak if you don't check which bucket you are
| signing the request for. Never for the love of security take an
| URL as input from a user without doing a whole lot of checks and
| sanitization. And don't expect your language parser to be
| perfect, Orange Tsai demonstrated they can get confused [1].
|
| [1] https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-
| Ne...
| dyogenez wrote:
| I left off the method that generates the signed URL. It limits
| the bucket to a specific one per env and blocks some protected
| folders and file types. I left that out in case someone used it
| to find an opening to attack.
| sakopov wrote:
| I must be missing something obvious, but what do signed URLs have
| to do with requests going directly to resources in a bucket
| instead of a CDN of some sort like Cloudflare? Signed URLs are
| typically used to provide secure access to a resource in a
| private bucket. But it seems like it's used as a cache of sorts?
___________________________________________________________________
(page generated 2024-08-15 23:00 UTC)