[HN Gopher] Serverless at Scale: Lessons from 200M Lambda Invoca...
___________________________________________________________________
Serverless at Scale: Lessons from 200M Lambda Invocations
Author : thunderbong
Score : 56 points
Date : 2023-11-11 03:39 UTC (1 days ago)
(HTM) web link (insights.adadot.com)
(TXT) w3m dump (insights.adadot.com)
| Aeolun wrote:
| In other words, they had an average of 6.3 lambda invocations per
| second.
|
| Why make it sound so sensational? I did much more than that on a
| single xeon machine.
| geraldwhen wrote:
| Yes, but that wouldn't require so many hoops to jump through to
| make work, and it would probably cost a lot less.
|
| You have to use lambda so you can overcome artificial
| engineering constraints so that you can write blog posts.
| BiteCode_dev wrote:
| A cheap VPS could output much more, for 5 dollars a month.
| QuadrupleA wrote:
| Most of our apps, on a single t3a.nano (about $3/mo) can handle
| about 250 req / sec in stress tests. In sluggish python no
| less. People don't seem to understand modern compute speeds.
| paulddraper wrote:
| Does your app do more or less work?
| citrin_ru wrote:
| Xeon has a long history, but even around 2005 50 rps from a
| single box with perl web app was not considered highload.
| tomnipotent wrote:
| They claim to need 25 medium or 6 xlarge EC2 instances to
| handle ~17M monthly invocations, which seems insane. I don't
| know everything they're doing under the hood, but I'd expect to
| be able to handle billions of requests with that much hardware
| considering the product offering.
| Nihilartikel wrote:
| I'm similarly mystified. I have first hand experience with
| that volume of traffic..
|
| Per day. Per host. In 2009. On python.
|
| How is everyone making everything so slow?
| kgeist wrote:
| I thought lambdas are practical when they're called rarely,
| with a surge of traffic occasionally? So that you don't pay for
| the servers when they're unused, and occasionally you can
| withstand black friday traffic.
|
| If a lambda is called 6 times per second, I suspect the
| underlying VMs/containers that power lambdas are rarely shut
| down (I don't know how AWS works but that's how it works in
| another cloud provider I'm familiar with - they wait for a
| little for new requests before shutting down the container). So
| might've as well just used an always-on server.
|
| I also wonder why their calculations show that 6rps (that's
| what their "17 mln monthly lambda invocations" really means)
| would require 25 servers. We have a single mediocre VM which
| serves around 6 rps on average as well without issues...
| Although, of course, it all depends on what kind of load each
| request has. We don't do number crunching and most of the time
| is spent in the database.
| civilitty wrote:
| If they're called rarely, they have poor latency due to cold
| start times.
|
| IMO lambdas are most practical when someone else in the org
| is responsible for spinning up VMs and your boss is in a
| pissing match with their boss, preventing you from getting
| any new infrastructure to get work done. The technical merits
| rarely have anything to do with it.
| steveBK123 wrote:
| "it costs 2x as much but doesn't require me to navigate my
| messed up IT org" is an underappreciated motivation for a
| lot of architectures you see people pump.
|
| Your IT org political problems may not match, and therefore
| their architecture may not either. Also there's the
| question of what problem you are trying to scale at what
| scale, with what resiliency/uptime requirements, etc...
| tetha wrote:
| Mh, we've been looking at e.g. OpenFaaS to implement rare or
| slow and resource intensive processes as functions so they
| don't have to run all the time. Think of customer
| provisioning, larger data exports, model trainings and such
| things. Here, a slow startup might add a few seconds to a few
| minute long process.
|
| But our outcome is: Outside of really heavyweight processes
| like model trainings, it's a lot of infrastructural effort to
| run something like this on our own systems, opposed to just
| sticking that code into a rarely called REST endpoint in some
| application we're already running anyway. We'd need a lot
| more volume of somewhat rarely executed tasks to make it
| worth running it.
| islewis wrote:
| > _why their calculations show that 6rps (that 's what their
| "17 mln monthly lambda invocations" really means) would
| require 25 servers_
|
| I'm guessing that it has something to do with the average job
| taking 15 minutes. 6rps represents 6 jobs being created per
| second, but each one takes 15 minutes to run until
| completion. Another way to look at it is each second 90
| minutes of lambda work is created.
|
| If you consider 15 minutes the rolling window, which looks
| like a fair assumption based on the graph provided, there
| could be up to 5400 (15 min x 60 sec x 6 rps) functions
| running at once. Working backwards, 25 medium instance
| provide (25 instances x 4 GB mem) an 100gb memory pool, or
| 100,000mb. That leaves around ~18mb for each of those 5400
| jobs, if you don't consider OS resource overhead.
|
| Looking at averages in this situation very possibly can give
| a warped perception of reality, but 25 instances doesn't seem
| out of the realm of possibility. I'm sure they have much more
| relevant metrics to back up that number as well.
|
| Whether the functions really need this much time to run is
| another issue entirely, and hard to answer with the
| information given.
| tuetuopay wrote:
| They talk about lambdas that exceed the max runtime of 15
| minutes. For a single invocation. They go to great lengths
| talking about cron jobs, background jobs triggered by events,
| etc. I highly doubt the bulk of the lambdas they talk about are
| simple API calls. Otherwise yes, 6-7rps is peanuts, if we're
| talking about API calls. And since the very first point of the
| article is them highlighting what goes to lambda and what does
| not (public api calls go to a dedicated box), I think it's safe
| to say those 6 invocations/sec are definitely not API calls.
|
| TL;DR your Xeon box is their always-on api box.
| reese_john wrote:
| For our region that limit is set to 1000. That might sound like a
| lot when you start, but you quickly realise it's easy to reach
| once you have enough Lambdas and you scale up. We found ourselves
| hitting that limit a lot once our traffic and therefore our
| demands from our system started scaling.
|
| You can file a support ticket to have that limit raised.
|
| https://docs.aws.amazon.com/servicequotas/latest/userguide/r...
| Jamie9912 wrote:
| That's really annoying though.. Why should I have to go out of
| my way to increase capacity given that i'm paying for it
| anyway?
| ebbp wrote:
| Generally these limits exist so customers don't accidentally
| spend more than they intend to -- e.g. implementing a sort of
| infinite loop where Lambdas call each other constantly.
| Sounds implausible but I've seen that more than once!
| maccard wrote:
| > Sounds implausible but I've seen that more than once!
|
| The textbook example of this going wrong is a lambda that
| is invoked on uploading to S3 that writes the result to S3.
| There's even an AWS article on it - [0]
|
| [0] https://aws.amazon.com/blogs/compute/avoiding-
| recursive-invo...
| easton wrote:
| We actually got an email from AWS recently at work that
| said "hey! Your lambda writes to a queue that invokes the
| same lambda, that seems wack". We need it that way, but
| it's enough of a problem that they built a way to detect
| it automatically.
| redhale wrote:
| I might believe this if AWS allowed customers to specify
| their own self-imposed billing limit. Do they have that
| feature yet?
| TexanFeller wrote:
| IMO that's not why they _really_ do it. They have limits on
| everything because even at their scale they can't instantly
| accommodate your needs to suddenly scale or they need to
| prevent "noisy neighbor" situations where your sudden
| excessive usage impacts others' workloads. They still have
| to do relatively short term capacity planning to
| accommodate you. Like, I work for only a medium-large sized
| company and AWS has quoted us lead times of _weeks_ to make
| the instances we need for a workload available. We only
| needed 200-300 EC2 instances and they weren't even super
| unusual types. I think their infinite scaling on a dime
| claims are pure marketing jibber jabber.
| adobrawy wrote:
| One of the official reasons for the quota is to protect
| consumers from shooting themselves in the foot when they
| configure something incorrectly and start using the maximum
| available autoscaling resources which quickly makes bill to
| explode.
| danielklnstein wrote:
| I understand the sentiment behind your frustration - but it's
| worth noting that these support tickets are usually answered
| really quickly.
|
| Specifically as it relates to Lambdas there's solid rationale
| behind these limits, but I agree that in many other cases the
| limits seem arbitrary and annoying.
| tuetuopay wrote:
| The quotas are there for one good reason: the system running
| wild consuming way too much.
|
| - for limited resources like IPs, it avoids one customer
| eating all the stock. Yes he's paying for them, but other
| customers wouldn't be able to get some anymore, generating
| frustrated users and revenue loss - for most other "infinite
| stock" resources, it avoids the bill exploding. It's good for
| the customer, but also for the provider as they're sure to be
| paid and not take a billing decline or sucking up all of a
| startup's money.
| dig1 wrote:
| > Serverless architecture promises flexibility, infinite
| scalability, fast setups, cost efficiency, and abstracting
| infrastructure, allowing us to focus on the code.
|
| The only thing I know that serverless architecture promises are
| big bills and a steady income for a cloud provider. I'd be happy
| to see a serverless setup that won't be blown away with a (way
| cheaper) small/medium-sized VM.
| insanitybit wrote:
| You have to manage a VM. For example, ensuring that the VM has
| an up to date OS. If you don't care about that, ok, but that's
| something that Lambda offers.
|
| Ephemerality is a plus as well. Just from a security
| standpoint, having an ephemeral system means persistence is not
| possible.
| QuadrupleA wrote:
| sudo yum update, once a week
| insanitybit wrote:
| If it's one VM for a personal website that'll be fine. Good
| luck explaining that to a SOC2 auditor, or managing that
| across a fleet.
| steveBK123 wrote:
| Nooo I can't run that, I need to pay for Bezos 3rd yacht!
| flavius29663 wrote:
| that is not nearly enough. You need to make sure the system
| still works after the update, so you need to carefully
| control all the versions that go in, test them in lower
| environments. Also, for any serious application, you will
| need to do this multiple times, for hundreds or thousands
| of machines even for small companies.
|
| I am on your side, actually, I think managing machines is
| better than serverless, but it's not _that_ easy.
| declan_roberts wrote:
| This really doesn't work in practice. We brought down our
| whole compute cluster once when someone ran update and it
| changed some stupid thing.
|
| It will happen to you sooner or later also. Updates are
| always out of band for this reason.
|
| That's why everybody does builds and isolates isolates
| updates to that process.
| insanitybit wrote:
| > We brought down our whole compute cluster once when
| someone ran update and it changed some stupid thing
|
| I've seen this happen twice now, as well.
| buffet_overflow wrote:
| I feel like we're close to that with automated testing
| and rollbacks. Still, it seems like a ton of complexity
| for what feels like a fundamental need
| joshuanapoli wrote:
| a recipe for getting exploited by a supply chain attack
| dig1 wrote:
| These things can be easily automated, and many Linux LTS
| distros come with a preconfigured system auto-update.
|
| > Just from a security standpoint, having an ephemeral system
| means persistence is not possible.
|
| You still have to persist something somewhere, and there is a
| higher chance someone will figure out SQL injection or
| unfiltered POST request through your app than hack SSH access
| to the box. If someone wants to do any real damage, they'd
| just continuously DDoS that serverless setup, and the cloud
| provider will kill the company with the bill.
| insanitybit wrote:
| > These things can be easily automated,
|
| Is this something people are out there believing? That
| patching is something that's easy to automate? I find that
| kind of nuts, I thought everyone understood that this is,
| in fact, the opposite of easily automated...
|
| > You still have to persist something somewhere, and there
| is a higher chance someone will figure out SQL injection or
| unfiltered POST request through your app than hack SSH
| access to the box.
|
| "This entirely separate attack exists therefor completely
| removing an entire attack primitive haves no value" - how I
| read this comment.
| tg180 wrote:
| > "This entirely separate attack exists therefor
| completely removing an entire attack primitive haves no
| value"
|
| It has value, but it's also true that trusting cloud
| providers serveless infrastructure introduces additional
| sets of vulnerabilities due to various reasons.
|
| eg: https://sysdig.com/blog/exploit-mitigate-aws-lambdas-
| mitre/
|
| Reading your comments, I get the impression that you are
| used to dealing with clients whose infrastructure
| management skills are lacking, and they are making a mess
| of things.
|
| While serverless infrastructures certainly eliminate a
| range of vulnerability classes, it is adoption is
| unlikely to be sufficient to secure platforms that are
| inadequate for the threats they face.
|
| At the end of the day, someone has to put in the work to
| ensure that things are patched, safe, and secure, whether
| the computing model is serverless or not.
| insanitybit wrote:
| > Reading your comments, I get the impression that you
| are used to dealing with clients whose infrastructure
| management skills are lacking, and they are making a mess
| of things.
|
| I mean, I worked at Datadog when this happened: https://w
| ww.datadoghq.com/blog/engineering/2023-03-08-deep-d...
|
| Multi-day outage because of an apt update.
|
| Not the only one I've seen, and it's by no means the only
| issue that occurs with patching (extremely common that
| companies don't even know if they're patched for a given
| vuln).
| estebarb wrote:
| That scares me from using things like Firebase or
| serverless. I'm trusting the who bills me from protecting
| me from overbilling.
| kikimora wrote:
| I disagree, it is much much more likely that a former dev
| would leak a SSH key than someone would care to find an SQL
| injection.
| _fat_santa wrote:
| Personally I found serverless incredibly useful for very simple
| tasks, tasks where even a t2.micro would be overkill. I have a
| couple of static websites that are mostly static but still
| occasionally have to do "server stuff" like sending an email or
| talking to a database. For those instances a Lambda is
| incredibly useful because it costs you nothing compared to an
| EC2 and it less maintenance. But for bigger setups I agree it
| would be easier to just host on a small-medium VM (and I say
| that as someone whose got an entire API of like 200+ endpoints
| deployed in lambda)
| willio58 wrote:
| I've seen many claim this but I just haven't experienced it in
| production and I'm not sure why. We use lambda at my work and
| we serve several million users a month. Lambda bill comes out
| to around 200$ a month, not exaggerating. API gateway ends up
| being more expensive every month than lambda.
|
| I'm asking not to seem snarky, I truly want to know what is
| making people hit high prices with lambda. Is it like functions
| that are super computationally intensive and require queued
| lambda functions?
| joshuanapoli wrote:
| In my business, we heavily use FaaS, and I agree that it
| seems economical. It's a little surprising, though. AWS
| Lambda is 5 to 6 times higher price per second than an EC2
| instance for a given level of memory and CPU. Our application
| simply doesn't need much CPU time. Other aspects (database,
| storage) are more expensive.
|
| The main advantage, though, is predictability of operations.
| The FaaS services "just work". If we accidentally make a
| change to one endpoint to consume too much resources, it
| doesn't affect anything else. It's great for allowing fast
| changes to new functionality without much risk of breaking
| mature features.
| toomuchtodo wrote:
| A use case where you're executing arbitrary code provided by
| users and you don't want to have to maintain the environment
| for doing so (reliability, security boundaries, etc).
| timenova wrote:
| A reasonably priced serverless (kinda) setup is possible on
| Fly.io.
|
| Fly charges for a VM by the second, and when a VM is off, RAM
| and CPU are not charged (storage is still charged). They also
| allow you to easily configure shutting down machines when there
| are no active requests (right from their config file
| `fly.toml`), and support a more advanced method which involves
| your application essentially terminating itself when there's no
| work remaining, which kills the VM. When a new request arrives,
| it starts back up.
|
| Here are the docs [0]. And here's a blog post on how to
| terminate the VM from inside a Phoenix app for example [1].
|
| So essentially, you can write an app which processes multiple
| requests on the same VM (so not really serverless), but also
| saves costs when its not in use (essentially the promise of
| serverless).
|
| [0] https://fly.io/docs/apps/autostart-stop/
|
| [1] https://fly.io/phoenix-files/shut-down-idle-phoenix-app/
| cyanf wrote:
| That's an on-demand server, not serverless.
| kikimora wrote:
| Your devs would need one such VM each for testing. It at least
| you'll need 2 VMs - staging and production.
| jeswin wrote:
| Given that AWS still has no billing caps (despite it being one of
| the most requested features), you're exposing yourself to
| uncapped downside.
|
| In addition to lambdas being a poor architectural choice in most
| cases, that is.
| insanitybit wrote:
| AFAIK you can pretty easily cap the number of concurrent lambda
| executions. Of all of AWS's services, Lambda is probably the
| easiest one to configure limits on.
| dmattia wrote:
| For lambdas in particular, you can set reserved concurrency,
| which is the most of a particular lambda that can run
| concurrently at any point in time: https://docs.aws.amazon.com/
| lambda/latest/dg/configuration-c....
| vcryan wrote:
| The conclusions are what is already widely understood about
| Lambdas. They could have just researched the topic up front and
| chose a better architecture.
| ecshafer wrote:
| The biggest lesson i learned when i was in an org thay started
| using serverless heavily (because its the future!) is that its an
| unmaintanable mess. You end with code spaghetti, but now its
| split across 100 repos that might be hard to find, and designed
| purely on architecture diagrams.
|
| From what i can see its basically recreating mainframe batch
| processing but in the cloud. X happens which triggers Y Job which
| triggers Z job and so on.
| politelemon wrote:
| That sounds like micro services in general rather than lambda
| the Aws service selectively. I've seen the same unsustainable
| mess with k8s crazy teams.
|
| The lesson I've learned instead is to start boring and
| traditional, then use serverless tech when you hit a problem
| the current setup cannot solve
| ecshafer wrote:
| I agree that its like microservices, but the problem is
| turned up to 11 with serverless. 1 microservice now becomes
| 10 lambdas. The issue is fundamentally one of discovery, and
| as teams churn out more and more functions, that arent all in
| the same repo, youre bound to lose traxk of whats happening.
| losteric wrote:
| Serverless has it's faults, but spaghetti code spread across
| 100 repos is definitely a "user error"...
| ecshafer wrote:
| How do you solve the discoverability issue when theres 1000
| serverless functions written by 10 different teams then?
| Serverless worsens the issue of having knowledge of the
| entire system i think, and i dont think this is even
| solvable.
| inhumantsar wrote:
| Lambdas are rarely entirely standalone, they support a
| larger service or glue services together.
|
| Creating one repo per Lambda is going to make things messy
| of course, just as breaking every little internal library
| out into its own repo.
|
| Regardless of the system or what it runs on, it's an easy
| trap to fall into but it's absolutely solvable with some
| technical leadership around standards.
| cassianoleal wrote:
| > Lambdas are rarely entirely standalone, they support a
| larger service or glue services together.
|
| I wish that was the case in real life. Unfortunately, the
| trend I've been noticing is to run anything that's an API
| call in Lambda, and then chaining multiple Lambdas in
| order to process whatever is needed for that API call.
| latchkey wrote:
| One serverless function is effectively a http router that
| knows how to call the appropriate code path to reach the
| 1000 handlers.
| politelemon wrote:
| I think this does a decent job of talking about the tech without
| hyping up or shitting on it, just speaking matter-of-factly. They
| seem to have found a decent fit for the tech, and are able to
| recognize where it works and doesn't work, and are still able to
| 'diversify' to other things. Good post thanks for sharing.
| robbles wrote:
| > 10GB is hardly enough. Once you import Pandas you're on the
| limit. You can forget Pandas and scipy at the same Lambda.
|
| This sounds way off to me. 10 GB to install a Python library?
| acd wrote:
| The name serverless is missleading. Of course there is functions
| written in programming languages that runs on a real physicall
| servers in data centers.
|
| This should be called Fuction as a service. Acronym Faas.
| hobobaggins wrote:
| > serverless.. as long as we have enough money to throw on it
|
| This one speaks truth.
___________________________________________________________________
(page generated 2023-11-12 23:02 UTC)