[HN Gopher] Leveraging mispriced AWS spot instances
___________________________________________________________________
Leveraging mispriced AWS spot instances
Author : ericpauley
Score : 105 points
Date : 2022-10-21 13:33 UTC (9 hours ago)
(HTM) web link (pauley.me)
(TXT) w3m dump (pauley.me)
| bluelightning2k wrote:
| The article makes a HUGE assumption.
|
| They spot an inconsistency between two prices, and decide that
| the fair market value must be the very highest part of the
| spread. Anything under this is therefore "under-priced".
|
| Is it not possible instead that people are _overpaying_ for the
| popular ones through sub-optimal bids - instead of simply
| assuming that only these inflexible /least sophisticated bidding
| strategies represent the fair market value.
|
| They actually go further. They assume that AWS could realize this
| value, and that encouraging more flexible bids through tooling
| etc. would move everyone to the top of the spread, instead of
| smoothing it out towards the average. And that what is
| essentially a price-increase can be achieved without hurting the
| overall value (price-performance vs. flexibility). Given the
| entire point of this is auctioning unused cycles at a discount,
| clearly any overall increase would decrease the overall demand.
|
| Having said this it's a great article. I think the overall
| quality of the article made it so surprising to see this missed.
| bushbaba wrote:
| There's underlying capacity as well. Would you rather pay a bit
| more to get 100 r6g.4xl OR pay a bit less to have 90 r6g.4xl +
| 10 r6gd.4xl.
|
| Many workloads do not have deployment configuration supporting
| a non-homogenous fleet of instances. Over time this will be
| addressed, but it could be a current major contributor to the
| discrepancies viewed.
| ericpauley wrote:
| Author here. This is definitely a big assumption. I cut the
| price differences in half to account for market movement, but
| the price difference could definitely be more or less
| especially as these pools are probably thinner markets.
| bluelightning2k wrote:
| As I said - it's a great article! This was just one thing I
| noticed which I pointed out as it made me think.
|
| Keep up the good work
| benlivengood wrote:
| Interestingly GCP already offers over 75% discounts for n2d (AMD)
| spot instances that don't rely on any internal market, and the
| discounts for other families are fairly close.
|
| We see individual spot instances go away every few days which
| works pretty well for GKE. The older preemptible class of
| instances restarted every 24 hours which was more of a pain
| (mitigated a bit with a preemptible killer to spread the restarts
| out).
| plantain wrote:
| I've spent a lot of time trying to capitalize on these mispricing
| - and often they're priced like that because the capacity in that
| region/configuration is much lower and you are exposed to more
| more preemptions than in higher priced region/configurations.
| JoachimSchipper wrote:
| > Compute-optimized (C) instances can substitute for a general-
| purpose (M) instances of half the size
|
| They do have the same amount of memory (and twice the CPU). But
| if you run a workload that automatically scales to the number of
| available cores, starting twice the number of processes / threads
| might well run you out of memory.
|
| The article is interesting, but blindly running your code on
| unexpected instance types may be more "exciting" than the author
| makes it sound.
| ericpauley wrote:
| Author here. If you design the workload you can ignore the
| extra instances. You can actually hide these cpu cores from
| instances within the AWS api (see setting instance vCPU) so it
| is truly transparent.
| hosh wrote:
| One thing to note is volatility. Spot instances are great for
| workloads that can absorb spot instance interruptions, and those
| interruptions tend to happen more if everyone else is trying to
| get spot instances at that time. Stateless web workloads that can
| startup and shutdown fast are a good example.
|
| Some workloads might not. You wouldn't want to run stateful
| workloads on spot, for instance. In our case, we have something
| that doesn't handle bootup under load very well, and until we can
| improve that, the overall reliability is not as good.
|
| I also like GCP's way of pricing these: you say whether your
| workload is preemptible or not, and you get discounts. You
| automatically get discounts if you run the workload for a long
| time.
| StratusBen wrote:
| As someone who spends entirely too much time thinking about cloud
| infrastructure costs ( I'm co-founder of https://www.vantage.sh/
| which maintains https://ec2instances.info/ ) I just want to
| recognize that amount of effort that went into this blog post to
| collect the data and express an interesting perspective for a
| fairly complicated topic.
|
| Kudos to the author on producing this.
| pradn wrote:
| Great work on vantage.sh and ec2instances.info!
|
| Quick, small fix: This instance is shown as having 0 GBs of
| memory, but in fact it has 0.5 GB.
| https://instances.vantage.sh/aws/ec2/t4g.nano
| StratusBen wrote:
| It's community supported! We just pay the bills and maintain
| hosting it :)
|
| Do you mind opening an issue on the repo here?
| https://github.com/vantage-sh/ec2instances.info
|
| Thank you for the report!
| awsthrowawy5767 wrote:
| You should know we use this tool _inside_ of AWS as well. Not
| in EC2 itself, but many many other places
| alex_duf wrote:
| Oh I've used ec2instances.info very _very_ often, so thank you
| for that. So useful!
| kureikain wrote:
| I run a service that has an API. which can help get spot price
| https://ec2.shop/
|
| Simplify do:
|
| curl 'https://ec2.shop?region=us-west-2&filter=m5&json' | jq
|
| You can pipe it to whatever your system store to get the real
| time price without dealing with AWS Price API
| Havoc wrote:
| Quite surprised that there is this degree of mispricing. I would
| have thought it's a market that is big and diverse enough to iron
| that out. Especially given that the participants in question
| would tend toward the analytical side of things
| milesvp wrote:
| I was thinking the same thing. I'm wondering if the price
| differences are reflecting a general demand for certain sizes?
| When I was maintaining AWS servers, I don't think it would have
| been easy for me to take advantage of spot prices that were
| outside of the sizes I was already using. I'd tuned things such
| that I knew the sizes I tended to need to have the redundancy I
| needed, and then could auto scale when necessary. Which means,
| I would never have bid on spot instances that were bigger than
| what I needed, because it would have been way more complicated
| to analyze the state of the system as a whole and make sure
| scaling happened when it needed to. Which also introduces risk
| that probably was never worth the savings. So if you had a lot
| of people like me, you'd get m3.large (or whatever current
| naming) as the thing that gets bid up the most, because it hit
| an autoscaling sweet spot
| Havoc wrote:
| > it would have been way more complicated to analyze the
| state of the system as a whole and make sure scaling happened
| when it needed to
|
| Yeah that's probably what's going on here. Complexity & that
| its just a bit counterintuitive
| pid-1 wrote:
| > For example, you can make all these substitutions:
|
| >c6g.2xlarge-c6g.4xlarge-m6g.4xlarge-r6g.4xlarge-r6gd.4xlarge
|
| A long standing ticket in my personal project backlog is
| comparing different instance types performance. I'm not sure this
| equivalente is without caveats.
|
| Anyhow, the reason "misprices" exist is because:
|
| - Many AWS products are elastic but only allow one to choose a
| single instance type. So you need to guess the best instance for
| a workload and stick with it.
|
| - No AWS product exposes a "Just give me the cheapest VM with x
| CPU and Y memory" API
| universa1 wrote:
| The end of the article shows a request for just that, doesn't
| it? No clue about the API though...
|
| Depending on your workload you might be able to actually
| substitute a single 8xlarge with two 4xlarge for example... A
| while back I was actually doing something like that to save
| some money :-)
| pid-1 wrote:
| Wow totally missed that. Cool stuff!
| alFReD-NSH wrote:
| Autoscaling group with mix instance type spot strategy does
| that. You can even give weights to instance type, giving more
| performant/higher capacity higher weight and it can choose the
| cheapest one with the weight in mind.
| pclmulqdq wrote:
| AWS doesn't want you to have that API! That's a significant
| part of their margin.
| jftuga wrote:
| I wrote a program to get AWS spot instance pricing. This program
| is similar to using "aws ec2 describe-spot-price-history" but is
| faster and has a few more options.
|
| https://github.com/jftuga/spotprice
| Moissanite wrote:
| It is completely incorrect to characterize these observations as
| "mispricing" - this is a quirk of automatically-determined prices
| across very different products. If the author actually tried to
| use these instances in any significant volume they would
| understand the driver - capacity pools are nowhere near equal,
| and not as interchangeable for AWS as the article implies they
| would be for a user. Prices reflect demand munged with available
| capacity - uncommon instance types are uncommon precisely because
| they aren't used as much, so there aren't the same signals to
| drive the price up and down automatically.
|
| Instances with attached NVMe are available in much lower volumes
| than others, as are AMD instances. Obviously these pools cannot
| be used as a drop-in replacement for non-"d" instances or Intel
| families.
| ericpauley wrote:
| Author here. The key here is that customers can leverage these
| pools in addition to their existing pools, improving capacity
| and price. AWS actually supports this out of the box (including
| substituting instances with drives) by specifying core and
| memory requirements directly instead of instance types.
| snake_doc wrote:
| > Across all AWS availability zones instances are mispriced
| by roughly $400/hr at any given time. This means that, with
| just a single instance of each type, Amazon is missing out on
| $200/hr or roughy $1.7 million each year. This is over
| roughly 15,000 pools of instances. Given Amazon controls
| roughly 100 million IPs, we can guess that each instance pool
| probably has on the order of 1000 instances (more for smaller
| instances, less for larger instances). Given this, the
| average mispriced pool might have hundreds of instances,
| meaning hundreds of millions each year in missed revenue due
| to mispriced spot instances. Because amazon keeps their
| number of instances a secret, it's difficult to make a
| precise estimate from the outside, but the missed revenue
| probably falls somewhere in this range.
|
| You are hypothesizing that the price differences produce
| "lost" revenues.
|
| An alternative hypothesis can be that the price differences
| produce similar or higher level of revenues for AWS through
| price segmentation, with Amazon recognizing the lack of
| adoption of certain spot instance bidding features and
| auction markets reacting appropriately.
|
| Unless you have the capacity and quantity demanded for each
| instance types, you can't prove your hypothesis. You are
| assuming scenario 3 (below) with no insights into price
| elasticity of the underlying customers.
|
| Example: Baseline:
|
| Instance types A and B are equivalent.
|
| A is priced at $3, with capacity of 1000, quantity demanded
| of 800. B is priced at $2, with capacity of 1000, quantity
| demanded of 200. Total quantity demanded = 1,000.
|
| Revenues from instance type A = $3 x 800 = $2,400 Revenues
| from instance type B = $2 x 200 = $ 400
|
| Total revenues = $2,800 Scenario 1: All
| customers purchase instance B instead due to better price
| discovery.
|
| Revenues from instance type A = $3 x 0 = $0 Revenues from
| instance type B = $2 x 1,000 = $ $2,000 Total quantity
| demanded = 1,000.
|
| Total revenues = $2,000
|
| Amazon loses $800 in revenues, there are no "lost" revenues"
| recovered. Scenario 2: Amazon changes
| instance type B price to $3. Total quantity demand decreases
| to 900 due to price elasticity of instance type B customers.
|
| Revenues from instance type A = $3 x 800 = $2,400 Revenues
| from instance type B = $3 x 100 = $300
|
| Total revenues = $2,700
|
| Amazon loses $100 in revenues, there are no "lost" revenues
| recovered. Scenario 3: Amazon changes
| instance type B price to $3. Total quantity demand remains at
| 1,000.
|
| Revenues from instance type A = $3 x 800 = $2,400 Revenues
| from instance type B = $3 x 200 = $600
|
| Total revenues = $3,000
|
| Amazon recovers $200 in "lost" revenues.
| ericpauley wrote:
| The missing component of your analysis is that amazon has
| 4th option: re-sell instances of B as instances of A when A
| is more expensive, and otherwise allowing the market to
| adjust. The analysis is strictly limited to instances where
| amazon could, in theory, do this (e.g., reselling c6gd as
| c6g).
|
| Assuming the market is in equilibrium, the above scenarious
| aren't realistic, as demand at the market price would equal
| supply at the current price ( _roughly_ , of course).
|
| Suppose there are 1000 c6g and 200 c6gd, with equilibrium
| price of $3 and $2, respectively (i.e., all instances have
| demand). Amazon re-SKUs c6gd as c6g until there are 1100
| c6g selling fro $2.90 and 100 c6gd selling at $2.90. Total
| revenue is $3480 vs. $3400. Of course it's impossible to
| know the true numbers without hidden knowledge of the
| market, but this is more akin to what would occur. Amazon
| effectively has a risk-free arbitrage opportunity here, so
| it stands to reason that there is revenue to be made.
| Customers don't have this option (since you can't short
| spot instances), so the best you can do is diversify and
| save money.
|
| Edit: Actually, the AWS spot market is often out of
| equilibrium in a way that makes this reselling _even more
| effective_. For instance, in the example in the article the
| c6gd instance is actually pegged at the minimum price, so
| some number of those instances could be resold as c6g
| without moving the c6gd price _at all_.
| snake_doc wrote:
| I think you're think about the revenue functions for spot
| instances in isolation of the larger supply base of all
| instances. Spot instances are already a result of revenue
| management of a fixed supply base that increases in
| discrete increments over time. Instance capacity overall
| usually leads instance demand, shortage costs are very
| high in data centers.
|
| Spot instance capacities are a function of the all
| instance capacity for the same type and on-demand
| instance usage. Spot instance pricing can influence the
| quantity demanded of on-demand instances of the same
| type, and vice-versa.
|
| Anyhow, there's no way we can figure out whether you're
| right or wrong with any reasonable level of certainty.
| ericpauley wrote:
| While it's tough to say with certainty how much revenue
| is lost, there is certainly lost revenue. Consider that
| many substitute instances are available at the minimum
| allowable price (i.e., won't go any lower, there is
| unused capacity). These could be resold without moving
| the substitute market.
| pclmulqdq wrote:
| The mispricing is likely good for Amazon. It indicates that
| most people aren't doing this arbitrage, so Amazon can milk
| them for extra money.
| Moissanite wrote:
| Totally agree with that; it is a pretty common approach. The
| only part I don't agree with is calling out the price
| differences as some kind of "gotcha" that AWS somehow missed,
| particularly given the speculative "lost revenue" data which
| have no basis in reality.
| ericpauley wrote:
| See the emphasis on transparent substitutes in the article.
| This analysis is limited _strictly_ to sets of instances
| that are fully hardware compatible, meaning AWS could
| resell one instance as another. There are way more savings
| to be had as a customer by leveraging instances that aren
| 't transparent substitutes.
| Moissanite wrote:
| I read it all, and don't agree with your interpretation
| of "transparent substitutes" in several of the cases.
| ericpauley wrote:
| Which instances are not transparent substitutes, in your
| opinion? Keep in mind the defintion here is that _Amazon_
| could substitute the image transparently, e.g., by
| ignoring the additional resources in hypervisor, not that
| the instances are by default indistinguishable.
|
| That being said, the substitute instances considered
| could be trivially accepted by any task running on the
| original instance, so long as it doesn't misbehave when
| given too many resources. In the case of vCPU, you can
| even hide extra vCPU cores, so a c6g.xlarge can be made
| effectively indistinguishable from a m6g.2xlarge by
| disabling the vCPUs at the hypervisor level.
| pclmulqdq wrote:
| In financial markets, this quirk of automatically-determined
| prices across different products is frequently called
| "mispricing" when those products logically _should_ have a
| relationship with each other.
|
| Straightforwardly: All hosts with space for a c6gd spot
| instance have space for a c6g instance. If Amazon is willing to
| host a c6gd instance in that slot for $X, they should be
| willing to also host a c6g instance there for $X.
|
| In financial markets, the way this gets handled is through
| arbitrage: someone will buy the equivalent of the c6gd
| instance, and sell the c6g part for the higher price (they may
| also sell the "d" part for even more money). This has the
| effect of "correcting" the price. The AWS spot market does not
| allow you to do arbitrage, and AWS doesn't appear to do the
| arbitrage for you.
|
| AWS probably likes this inefficiency in their market: some
| instance types are more popular than others, and some customers
| make assumptions that require them to use a very specific
| instance type (ie a c6gd would not work as a substitute for
| their c6g instance). However, the vast majority of users
| probably could work just fine if their c6g instance were a
| c6gd, and don't look for the arbitrage opportunity. That means
| Amazon gets paid extra.
| Moissanite wrote:
| > If Amazon is willing to host a c6gd instance in that slot
| for $X, they should be willing to also host a c6g instance
| there for $X.
|
| The reality is that direct c6gd demand might be an order of
| magnitude lower than c6g direct demand - if AWS can get some
| more flexible people to adopt c6gd by offering a lower price,
| c6g capacity is slightly stabilized for on-demand usage by
| people who don't value the flexibility.
|
| Also note that c6g to c6gd has a non-zero switching cost -
| extra NVMe on the instance adds a new source of potential
| hardware failure, increasing the probability of termination
| very slightly. There might be other software-related costs
| depending on whether your application makes any ill-advised
| assumptions about attached storage during setup.
|
| So overall, I would just be happier to read this article if
| it was framed as "PSA: having more features in an ec2
| instance is sometimes cheaper! Don't rule yourself out of
| extra savings by making overly-constrained fleet requests."
| The extra commentary about foregone revenue makes too many
| assumptions and detracts from the core point.
| pclmulqdq wrote:
| The point is that Amazon doesn't have to fill that slot
| with a c6gd. They can also fill it with a c6g. They just
| choose not to.
|
| The fact that you have to host a c6gd to get that price
| instead of a c6g is an inefficiency in the spot market that
| likely makes Amazon money, but is a little customer-
| hostile. I think the article is probably wrong that Amazon
| is foregoing revenue due to this. This is a form of price
| discrimination and it is likely making Amazon money, but in
| a scummy way.
| ericpauley wrote:
| Agreed that it's definitely difficult to know the true
| missed revenue here without internal data, and even then
| you'd be making some assumptions. I am confident there is
| _some_ missed revenue here, as amazon routinely has spot
| capacity constraints under existing prices so could
| definitely sell some substitute instances without moving
| the original instance market (even one instance per pool
| substituted equates to >$1M per year). In either case, a
| savvy organization can definitely benefit from the price
| discrepancy even if Amazon couldn't.
| Moissanite wrote:
| I can agree that there is missed revenue - but
| realistically it wouldale much more sense to sell that
| capacity via Fargate (which is closer to undifferentiated
| generic compute and RAM) rather than monkeying with the
| spot pricing algorithm.
| ericpauley wrote:
| Great point on Fargate, I'd be very curious on whether
| they select capacity for that from EC2 capcity or if
| there's a separate physical footprint for it.
| phamilton wrote:
| We run a very large installation 100% on spot and have done for a
| few years. We serve our web traffic, do background work, etc. all
| on spot instances.
|
| We see similar mismatched pricing all the time and take advantage
| of it. One additional area not called out here is the difference
| between c5.24xlarge and c5.metal instance pricing. These are
| pretty much identical hardware but metal instances are often
| cheaper.
|
| As you go down this path, do expect to see a lot of weird things
| that you'll have to track down. For example, when we introduced
| metal instances we found that the default ubuntu AMI launched
| with a powersave cpu governor. Non-metal instances don't support
| CPU throttling so it never came up with c5.24xlarges. When we
| first launched metal instances the performance per instance was
| significantly worse and took a bit of work to track down.
|
| Recently we've seen a lot more spot interruptions and it's
| pushing us to incorporate more 6th gen instances to get us more
| diversity. We've also temporarily switched to capacity optimized
| over price optimized and we've enable capacity rebalancing.
|
| It's absolutely a win for us from a pricing perspective. Our
| traffic is extremely variable each day and very seasonal
| throughout the year. RIs don't make sense given <12 hrs daily
| peak and 10x difference between July and September. However, just
| plan for some odd surprises along the way.
| Moissanite wrote:
| Have you observed metal instances taking longer to boot? I did
| last time I checked, and the difference was big enough to
| affect pricing in a non-trivial way, given that performance is
| the same and that you start paying immediately.
| TheP1000 wrote:
| If you want to leverage cheap spot, use us-east-2 / Ohio region.
| The prices are typically half of what you see in us-east-1.
|
| Also, it really helps to analyze at the AZ level. Certain AZs
| lack instances or have very low spot availability and contrary to
| recommended best practice, reducing AZs can sometimes be
| beneficial (I am looking at you eu-central-1a).
|
| While lowest price sounds nice, they can be really messy in terms
| of spot interruption rate. It is much better to set a max price
| and choose capacity optimized with as many instances as possible.
| playingalong wrote:
| > eu-central-1a
|
| FYI, AZ names are not universal. Your eu-central-1a might be
| someone else's eu-central-1b.
| bscanlan wrote:
| Fun article, the phenomenon is interesting to see in practice,
| I've seen it regularly with newer instance types as it can take
| time for people to add them to their configurations.
|
| We're heavy users of spot here in Intercom. I spot-checked our
| biggest workload, and this week we could have paid around 10%
| less if we were able to get the cheapest spot host possible in
| us-east-1 that is suitable for our workload (all 16xlarge
| Gravitons). However that would be at the cost of fleet stability,
| I think that to run relatively large production services used in
| realtime on spot you need to prioritise fleet stability, so
| choosing the "Capacity Optimized" strategy. We've seen incessant
| fleet churn when trying out cost optimised strategies.
| socialismisok wrote:
| Is there tooling to find the global minimum price for an instance
| with certain characteristics?
|
| I found it easy enough to do that in one region, but I've got
| some compute workloads that just read/write from S3 and are not
| latency sensitive.
|
| They do need 128 GB RAM and ephemeral disks.
| DJBunnies wrote:
| > compute workloads that just read/write from S3
|
| > need 128 GB RAM
|
| Eh?
| tyingq wrote:
| I took "just read/write from S3" to mean that they didn't
| interact with any other AWS services apart from S3. Such that
| they didn't care where in the world it ran.
|
| Not that they didn't do anything memory intensive.
| socialismisok wrote:
| You got it. It's some drone image processing. Read in data
| from S3, do analysis, write results.
| ericpauley wrote:
| Spot fleet requests allow you to set minimum specs for
| instances, and the fleet will be composed of any instances that
| meet the spec. If it's asynchronous work, you could pick lowest
| price allocation and not worry too much about interruptions. In
| fact, if your work is tolerant of interruptions (batch size
| <2min), you can actually save even more by being interrupted,
| as you don't get billed for partial hours:
| https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/billing-...
___________________________________________________________________
(page generated 2022-10-21 23:00 UTC)