[HN Gopher] Serverless at Scale: Lessons from 200M Lambda Invoca...
       ___________________________________________________________________
        
       Serverless at Scale: Lessons from 200M Lambda Invocations
        
       Author : thunderbong
       Score  : 56 points
       Date   : 2023-11-11 03:39 UTC (1 days ago)
        
 (HTM) web link (insights.adadot.com)
 (TXT) w3m dump (insights.adadot.com)
        
       | Aeolun wrote:
       | In other words, they had an average of 6.3 lambda invocations per
       | second.
       | 
       | Why make it sound so sensational? I did much more than that on a
       | single xeon machine.
        
         | geraldwhen wrote:
         | Yes, but that wouldn't require so many hoops to jump through to
         | make work, and it would probably cost a lot less.
         | 
         | You have to use lambda so you can overcome artificial
         | engineering constraints so that you can write blog posts.
        
         | BiteCode_dev wrote:
         | A cheap VPS could output much more, for 5 dollars a month.
        
         | QuadrupleA wrote:
         | Most of our apps, on a single t3a.nano (about $3/mo) can handle
         | about 250 req / sec in stress tests. In sluggish python no
         | less. People don't seem to understand modern compute speeds.
        
           | paulddraper wrote:
           | Does your app do more or less work?
        
         | citrin_ru wrote:
         | Xeon has a long history, but even around 2005 50 rps from a
         | single box with perl web app was not considered highload.
        
         | tomnipotent wrote:
         | They claim to need 25 medium or 6 xlarge EC2 instances to
         | handle ~17M monthly invocations, which seems insane. I don't
         | know everything they're doing under the hood, but I'd expect to
         | be able to handle billions of requests with that much hardware
         | considering the product offering.
        
           | Nihilartikel wrote:
           | I'm similarly mystified. I have first hand experience with
           | that volume of traffic..
           | 
           | Per day. Per host. In 2009. On python.
           | 
           | How is everyone making everything so slow?
        
         | kgeist wrote:
         | I thought lambdas are practical when they're called rarely,
         | with a surge of traffic occasionally? So that you don't pay for
         | the servers when they're unused, and occasionally you can
         | withstand black friday traffic.
         | 
         | If a lambda is called 6 times per second, I suspect the
         | underlying VMs/containers that power lambdas are rarely shut
         | down (I don't know how AWS works but that's how it works in
         | another cloud provider I'm familiar with - they wait for a
         | little for new requests before shutting down the container). So
         | might've as well just used an always-on server.
         | 
         | I also wonder why their calculations show that 6rps (that's
         | what their "17 mln monthly lambda invocations" really means)
         | would require 25 servers. We have a single mediocre VM which
         | serves around 6 rps on average as well without issues...
         | Although, of course, it all depends on what kind of load each
         | request has. We don't do number crunching and most of the time
         | is spent in the database.
        
           | civilitty wrote:
           | If they're called rarely, they have poor latency due to cold
           | start times.
           | 
           | IMO lambdas are most practical when someone else in the org
           | is responsible for spinning up VMs and your boss is in a
           | pissing match with their boss, preventing you from getting
           | any new infrastructure to get work done. The technical merits
           | rarely have anything to do with it.
        
             | steveBK123 wrote:
             | "it costs 2x as much but doesn't require me to navigate my
             | messed up IT org" is an underappreciated motivation for a
             | lot of architectures you see people pump.
             | 
             | Your IT org political problems may not match, and therefore
             | their architecture may not either. Also there's the
             | question of what problem you are trying to scale at what
             | scale, with what resiliency/uptime requirements, etc...
        
           | tetha wrote:
           | Mh, we've been looking at e.g. OpenFaaS to implement rare or
           | slow and resource intensive processes as functions so they
           | don't have to run all the time. Think of customer
           | provisioning, larger data exports, model trainings and such
           | things. Here, a slow startup might add a few seconds to a few
           | minute long process.
           | 
           | But our outcome is: Outside of really heavyweight processes
           | like model trainings, it's a lot of infrastructural effort to
           | run something like this on our own systems, opposed to just
           | sticking that code into a rarely called REST endpoint in some
           | application we're already running anyway. We'd need a lot
           | more volume of somewhat rarely executed tasks to make it
           | worth running it.
        
           | islewis wrote:
           | > _why their calculations show that 6rps (that 's what their
           | "17 mln monthly lambda invocations" really means) would
           | require 25 servers_
           | 
           | I'm guessing that it has something to do with the average job
           | taking 15 minutes. 6rps represents 6 jobs being created per
           | second, but each one takes 15 minutes to run until
           | completion. Another way to look at it is each second 90
           | minutes of lambda work is created.
           | 
           | If you consider 15 minutes the rolling window, which looks
           | like a fair assumption based on the graph provided, there
           | could be up to 5400 (15 min x 60 sec x 6 rps) functions
           | running at once. Working backwards, 25 medium instance
           | provide (25 instances x 4 GB mem) an 100gb memory pool, or
           | 100,000mb. That leaves around ~18mb for each of those 5400
           | jobs, if you don't consider OS resource overhead.
           | 
           | Looking at averages in this situation very possibly can give
           | a warped perception of reality, but 25 instances doesn't seem
           | out of the realm of possibility. I'm sure they have much more
           | relevant metrics to back up that number as well.
           | 
           | Whether the functions really need this much time to run is
           | another issue entirely, and hard to answer with the
           | information given.
        
         | tuetuopay wrote:
         | They talk about lambdas that exceed the max runtime of 15
         | minutes. For a single invocation. They go to great lengths
         | talking about cron jobs, background jobs triggered by events,
         | etc. I highly doubt the bulk of the lambdas they talk about are
         | simple API calls. Otherwise yes, 6-7rps is peanuts, if we're
         | talking about API calls. And since the very first point of the
         | article is them highlighting what goes to lambda and what does
         | not (public api calls go to a dedicated box), I think it's safe
         | to say those 6 invocations/sec are definitely not API calls.
         | 
         | TL;DR your Xeon box is their always-on api box.
        
       | reese_john wrote:
       | For our region that limit is set to 1000. That might sound like a
       | lot when you start, but you quickly realise it's easy to reach
       | once you have enough Lambdas and you scale up. We found ourselves
       | hitting that limit a lot once our traffic and therefore our
       | demands from our system started scaling.
       | 
       | You can file a support ticket to have that limit raised.
       | 
       | https://docs.aws.amazon.com/servicequotas/latest/userguide/r...
        
         | Jamie9912 wrote:
         | That's really annoying though.. Why should I have to go out of
         | my way to increase capacity given that i'm paying for it
         | anyway?
        
           | ebbp wrote:
           | Generally these limits exist so customers don't accidentally
           | spend more than they intend to -- e.g. implementing a sort of
           | infinite loop where Lambdas call each other constantly.
           | Sounds implausible but I've seen that more than once!
        
             | maccard wrote:
             | > Sounds implausible but I've seen that more than once!
             | 
             | The textbook example of this going wrong is a lambda that
             | is invoked on uploading to S3 that writes the result to S3.
             | There's even an AWS article on it - [0]
             | 
             | [0] https://aws.amazon.com/blogs/compute/avoiding-
             | recursive-invo...
        
               | easton wrote:
               | We actually got an email from AWS recently at work that
               | said "hey! Your lambda writes to a queue that invokes the
               | same lambda, that seems wack". We need it that way, but
               | it's enough of a problem that they built a way to detect
               | it automatically.
        
             | redhale wrote:
             | I might believe this if AWS allowed customers to specify
             | their own self-imposed billing limit. Do they have that
             | feature yet?
        
             | TexanFeller wrote:
             | IMO that's not why they _really_ do it. They have limits on
             | everything because even at their scale they can't instantly
             | accommodate your needs to suddenly scale or they need to
             | prevent "noisy neighbor" situations where your sudden
             | excessive usage impacts others' workloads. They still have
             | to do relatively short term capacity planning to
             | accommodate you. Like, I work for only a medium-large sized
             | company and AWS has quoted us lead times of _weeks_ to make
             | the instances we need for a workload available. We only
             | needed 200-300 EC2 instances and they weren't even super
             | unusual types. I think their infinite scaling on a dime
             | claims are pure marketing jibber jabber.
        
           | adobrawy wrote:
           | One of the official reasons for the quota is to protect
           | consumers from shooting themselves in the foot when they
           | configure something incorrectly and start using the maximum
           | available autoscaling resources which quickly makes bill to
           | explode.
        
           | danielklnstein wrote:
           | I understand the sentiment behind your frustration - but it's
           | worth noting that these support tickets are usually answered
           | really quickly.
           | 
           | Specifically as it relates to Lambdas there's solid rationale
           | behind these limits, but I agree that in many other cases the
           | limits seem arbitrary and annoying.
        
           | tuetuopay wrote:
           | The quotas are there for one good reason: the system running
           | wild consuming way too much.
           | 
           | - for limited resources like IPs, it avoids one customer
           | eating all the stock. Yes he's paying for them, but other
           | customers wouldn't be able to get some anymore, generating
           | frustrated users and revenue loss - for most other "infinite
           | stock" resources, it avoids the bill exploding. It's good for
           | the customer, but also for the provider as they're sure to be
           | paid and not take a billing decline or sucking up all of a
           | startup's money.
        
       | dig1 wrote:
       | > Serverless architecture promises flexibility, infinite
       | scalability, fast setups, cost efficiency, and abstracting
       | infrastructure, allowing us to focus on the code.
       | 
       | The only thing I know that serverless architecture promises are
       | big bills and a steady income for a cloud provider. I'd be happy
       | to see a serverless setup that won't be blown away with a (way
       | cheaper) small/medium-sized VM.
        
         | insanitybit wrote:
         | You have to manage a VM. For example, ensuring that the VM has
         | an up to date OS. If you don't care about that, ok, but that's
         | something that Lambda offers.
         | 
         | Ephemerality is a plus as well. Just from a security
         | standpoint, having an ephemeral system means persistence is not
         | possible.
        
           | QuadrupleA wrote:
           | sudo yum update, once a week
        
             | insanitybit wrote:
             | If it's one VM for a personal website that'll be fine. Good
             | luck explaining that to a SOC2 auditor, or managing that
             | across a fleet.
        
             | steveBK123 wrote:
             | Nooo I can't run that, I need to pay for Bezos 3rd yacht!
        
             | flavius29663 wrote:
             | that is not nearly enough. You need to make sure the system
             | still works after the update, so you need to carefully
             | control all the versions that go in, test them in lower
             | environments. Also, for any serious application, you will
             | need to do this multiple times, for hundreds or thousands
             | of machines even for small companies.
             | 
             | I am on your side, actually, I think managing machines is
             | better than serverless, but it's not _that_ easy.
        
             | declan_roberts wrote:
             | This really doesn't work in practice. We brought down our
             | whole compute cluster once when someone ran update and it
             | changed some stupid thing.
             | 
             | It will happen to you sooner or later also. Updates are
             | always out of band for this reason.
             | 
             | That's why everybody does builds and isolates isolates
             | updates to that process.
        
               | insanitybit wrote:
               | > We brought down our whole compute cluster once when
               | someone ran update and it changed some stupid thing
               | 
               | I've seen this happen twice now, as well.
        
               | buffet_overflow wrote:
               | I feel like we're close to that with automated testing
               | and rollbacks. Still, it seems like a ton of complexity
               | for what feels like a fundamental need
        
             | joshuanapoli wrote:
             | a recipe for getting exploited by a supply chain attack
        
           | dig1 wrote:
           | These things can be easily automated, and many Linux LTS
           | distros come with a preconfigured system auto-update.
           | 
           | > Just from a security standpoint, having an ephemeral system
           | means persistence is not possible.
           | 
           | You still have to persist something somewhere, and there is a
           | higher chance someone will figure out SQL injection or
           | unfiltered POST request through your app than hack SSH access
           | to the box. If someone wants to do any real damage, they'd
           | just continuously DDoS that serverless setup, and the cloud
           | provider will kill the company with the bill.
        
             | insanitybit wrote:
             | > These things can be easily automated,
             | 
             | Is this something people are out there believing? That
             | patching is something that's easy to automate? I find that
             | kind of nuts, I thought everyone understood that this is,
             | in fact, the opposite of easily automated...
             | 
             | > You still have to persist something somewhere, and there
             | is a higher chance someone will figure out SQL injection or
             | unfiltered POST request through your app than hack SSH
             | access to the box.
             | 
             | "This entirely separate attack exists therefor completely
             | removing an entire attack primitive haves no value" - how I
             | read this comment.
        
               | tg180 wrote:
               | > "This entirely separate attack exists therefor
               | completely removing an entire attack primitive haves no
               | value"
               | 
               | It has value, but it's also true that trusting cloud
               | providers serveless infrastructure introduces additional
               | sets of vulnerabilities due to various reasons.
               | 
               | eg: https://sysdig.com/blog/exploit-mitigate-aws-lambdas-
               | mitre/
               | 
               | Reading your comments, I get the impression that you are
               | used to dealing with clients whose infrastructure
               | management skills are lacking, and they are making a mess
               | of things.
               | 
               | While serverless infrastructures certainly eliminate a
               | range of vulnerability classes, it is adoption is
               | unlikely to be sufficient to secure platforms that are
               | inadequate for the threats they face.
               | 
               | At the end of the day, someone has to put in the work to
               | ensure that things are patched, safe, and secure, whether
               | the computing model is serverless or not.
        
               | insanitybit wrote:
               | > Reading your comments, I get the impression that you
               | are used to dealing with clients whose infrastructure
               | management skills are lacking, and they are making a mess
               | of things.
               | 
               | I mean, I worked at Datadog when this happened: https://w
               | ww.datadoghq.com/blog/engineering/2023-03-08-deep-d...
               | 
               | Multi-day outage because of an apt update.
               | 
               | Not the only one I've seen, and it's by no means the only
               | issue that occurs with patching (extremely common that
               | companies don't even know if they're patched for a given
               | vuln).
        
             | estebarb wrote:
             | That scares me from using things like Firebase or
             | serverless. I'm trusting the who bills me from protecting
             | me from overbilling.
        
             | kikimora wrote:
             | I disagree, it is much much more likely that a former dev
             | would leak a SSH key than someone would care to find an SQL
             | injection.
        
         | _fat_santa wrote:
         | Personally I found serverless incredibly useful for very simple
         | tasks, tasks where even a t2.micro would be overkill. I have a
         | couple of static websites that are mostly static but still
         | occasionally have to do "server stuff" like sending an email or
         | talking to a database. For those instances a Lambda is
         | incredibly useful because it costs you nothing compared to an
         | EC2 and it less maintenance. But for bigger setups I agree it
         | would be easier to just host on a small-medium VM (and I say
         | that as someone whose got an entire API of like 200+ endpoints
         | deployed in lambda)
        
         | willio58 wrote:
         | I've seen many claim this but I just haven't experienced it in
         | production and I'm not sure why. We use lambda at my work and
         | we serve several million users a month. Lambda bill comes out
         | to around 200$ a month, not exaggerating. API gateway ends up
         | being more expensive every month than lambda.
         | 
         | I'm asking not to seem snarky, I truly want to know what is
         | making people hit high prices with lambda. Is it like functions
         | that are super computationally intensive and require queued
         | lambda functions?
        
           | joshuanapoli wrote:
           | In my business, we heavily use FaaS, and I agree that it
           | seems economical. It's a little surprising, though. AWS
           | Lambda is 5 to 6 times higher price per second than an EC2
           | instance for a given level of memory and CPU. Our application
           | simply doesn't need much CPU time. Other aspects (database,
           | storage) are more expensive.
           | 
           | The main advantage, though, is predictability of operations.
           | The FaaS services "just work". If we accidentally make a
           | change to one endpoint to consume too much resources, it
           | doesn't affect anything else. It's great for allowing fast
           | changes to new functionality without much risk of breaking
           | mature features.
        
         | toomuchtodo wrote:
         | A use case where you're executing arbitrary code provided by
         | users and you don't want to have to maintain the environment
         | for doing so (reliability, security boundaries, etc).
        
         | timenova wrote:
         | A reasonably priced serverless (kinda) setup is possible on
         | Fly.io.
         | 
         | Fly charges for a VM by the second, and when a VM is off, RAM
         | and CPU are not charged (storage is still charged). They also
         | allow you to easily configure shutting down machines when there
         | are no active requests (right from their config file
         | `fly.toml`), and support a more advanced method which involves
         | your application essentially terminating itself when there's no
         | work remaining, which kills the VM. When a new request arrives,
         | it starts back up.
         | 
         | Here are the docs [0]. And here's a blog post on how to
         | terminate the VM from inside a Phoenix app for example [1].
         | 
         | So essentially, you can write an app which processes multiple
         | requests on the same VM (so not really serverless), but also
         | saves costs when its not in use (essentially the promise of
         | serverless).
         | 
         | [0] https://fly.io/docs/apps/autostart-stop/
         | 
         | [1] https://fly.io/phoenix-files/shut-down-idle-phoenix-app/
        
           | cyanf wrote:
           | That's an on-demand server, not serverless.
        
         | kikimora wrote:
         | Your devs would need one such VM each for testing. It at least
         | you'll need 2 VMs - staging and production.
        
       | jeswin wrote:
       | Given that AWS still has no billing caps (despite it being one of
       | the most requested features), you're exposing yourself to
       | uncapped downside.
       | 
       | In addition to lambdas being a poor architectural choice in most
       | cases, that is.
        
         | insanitybit wrote:
         | AFAIK you can pretty easily cap the number of concurrent lambda
         | executions. Of all of AWS's services, Lambda is probably the
         | easiest one to configure limits on.
        
         | dmattia wrote:
         | For lambdas in particular, you can set reserved concurrency,
         | which is the most of a particular lambda that can run
         | concurrently at any point in time: https://docs.aws.amazon.com/
         | lambda/latest/dg/configuration-c....
        
       | vcryan wrote:
       | The conclusions are what is already widely understood about
       | Lambdas. They could have just researched the topic up front and
       | chose a better architecture.
        
       | ecshafer wrote:
       | The biggest lesson i learned when i was in an org thay started
       | using serverless heavily (because its the future!) is that its an
       | unmaintanable mess. You end with code spaghetti, but now its
       | split across 100 repos that might be hard to find, and designed
       | purely on architecture diagrams.
       | 
       | From what i can see its basically recreating mainframe batch
       | processing but in the cloud. X happens which triggers Y Job which
       | triggers Z job and so on.
        
         | politelemon wrote:
         | That sounds like micro services in general rather than lambda
         | the Aws service selectively. I've seen the same unsustainable
         | mess with k8s crazy teams.
         | 
         | The lesson I've learned instead is to start boring and
         | traditional, then use serverless tech when you hit a problem
         | the current setup cannot solve
        
           | ecshafer wrote:
           | I agree that its like microservices, but the problem is
           | turned up to 11 with serverless. 1 microservice now becomes
           | 10 lambdas. The issue is fundamentally one of discovery, and
           | as teams churn out more and more functions, that arent all in
           | the same repo, youre bound to lose traxk of whats happening.
        
         | losteric wrote:
         | Serverless has it's faults, but spaghetti code spread across
         | 100 repos is definitely a "user error"...
        
           | ecshafer wrote:
           | How do you solve the discoverability issue when theres 1000
           | serverless functions written by 10 different teams then?
           | Serverless worsens the issue of having knowledge of the
           | entire system i think, and i dont think this is even
           | solvable.
        
             | inhumantsar wrote:
             | Lambdas are rarely entirely standalone, they support a
             | larger service or glue services together.
             | 
             | Creating one repo per Lambda is going to make things messy
             | of course, just as breaking every little internal library
             | out into its own repo.
             | 
             | Regardless of the system or what it runs on, it's an easy
             | trap to fall into but it's absolutely solvable with some
             | technical leadership around standards.
        
               | cassianoleal wrote:
               | > Lambdas are rarely entirely standalone, they support a
               | larger service or glue services together.
               | 
               | I wish that was the case in real life. Unfortunately, the
               | trend I've been noticing is to run anything that's an API
               | call in Lambda, and then chaining multiple Lambdas in
               | order to process whatever is needed for that API call.
        
             | latchkey wrote:
             | One serverless function is effectively a http router that
             | knows how to call the appropriate code path to reach the
             | 1000 handlers.
        
       | politelemon wrote:
       | I think this does a decent job of talking about the tech without
       | hyping up or shitting on it, just speaking matter-of-factly. They
       | seem to have found a decent fit for the tech, and are able to
       | recognize where it works and doesn't work, and are still able to
       | 'diversify' to other things. Good post thanks for sharing.
        
       | robbles wrote:
       | > 10GB is hardly enough. Once you import Pandas you're on the
       | limit. You can forget Pandas and scipy at the same Lambda.
       | 
       | This sounds way off to me. 10 GB to install a Python library?
        
       | acd wrote:
       | The name serverless is missleading. Of course there is functions
       | written in programming languages that runs on a real physicall
       | servers in data centers.
       | 
       | This should be called Fuction as a service. Acronym Faas.
        
       | hobobaggins wrote:
       | > serverless.. as long as we have enough money to throw on it
       | 
       | This one speaks truth.
        
       ___________________________________________________________________
       (page generated 2023-11-12 23:02 UTC)