[HN Gopher] Using AZs can eat up your budget - From Prometheus t...
       ___________________________________________________________________
        
       Using AZs can eat up your budget - From Prometheus to
       VictoriaMetrics
        
       Author : shscs911
       Score  : 65 points
       Date   : 2024-12-26 08:09 UTC (3 days ago)
        
 (HTM) web link (engineering.prezi.com)
 (TXT) w3m dump (engineering.prezi.com)
        
       | tomalaci wrote:
       | I've used VictoriaMetrics in past (~4 years ago) for collection
       | of not just service monitoring data but also for network switch
       | and cell tower module metrics. At the time I found it to be the
       | most efficient Prometheus-like service in terms of query speed,
       | data compression and, more importantly, being able to handle high
       | cardinality (over 10s or 100s of millions of series).
       | 
       | However, I later switched to Clickhouse because I needed extra
       | flexibility of running occasional async updates or deletes. In
       | VictoriaMetrics you usually need to wipe out the entire series
       | and re-ingest it. That may not be possible or would be quite
       | annoying if you are dealing with a long history and you just
       | wanted to update/delete some bad data in a month.
       | 
       | So, if you want a more efficient Prometheus drop-in replacement
       | and don't think limited update/delete ability is an issue then I
       | highly recommend VictoriaMetrics. Otherwise, Clickhouse (larger
       | scale) or Timescale (smaller scale) has been my go to for
       | anything time series.
        
         | brunoqc wrote:
         | Btw, both clickhouse and timescale are open core. If you care
         | about that.
        
           | thayne wrote:
           | So is VictoriaMetrics
        
             | brunoqc wrote:
             | You are right. I guess I just saw the Apache 2 license and
             | assumed it was foss.
        
           | hipadev23 wrote:
           | Is there a reason you drop this comment on every product
           | mention that's not 100% OSS
        
             | presspot wrote:
             | Because it's helpful and adds context? Why do you care?
        
               | hipadev23 wrote:
               | Because it's frustrating. They do it to belittle the
               | projects and shame the authors for trying to make a
               | living.
               | 
               | Not every project wants to end up as another bloated
               | abandonware in Apache Software Foundation.
        
               | simfree wrote:
               | FOSS washing software is similarly frustrating.
               | 
               | When I see a license on a project I expect that project
               | will provide the code under that license and function
               | fully at runtime, not play games of "Speak to a sales rep
               | to flip that bit or three to enable that codepath".
        
       | raffraffraff wrote:
       | I'd love to see a comparison with Mimir. Some of the problems
       | that this article describes with Prometheus are also solved by
       | Mimir. I'm running it in single binary mode, and everything is
       | stored in S3. I'm deploying Prometheus in agent mode so it just
       | scrapes and remote writes to Mimir, but doesn't store anything.
       | The helm chart is a bit hairy because I have to use a fork for
       | single binary mode, but it has actually been extremely stable and
       | cheap to run. The same AZ cost saving rules apply, but my traffic
       | is low enough right now for it not to matter. But I suppose I
       | could also run ingesters per AZ to eliminate cross-AZ traffic.
        
         | thelittleone wrote:
         | Interesting. I'm fairly new to the field, but would this
         | configuration help reduce the cost of logging security events
         | from multiple zones/regions/providers to a colocated cluster?
        
           | raffraffraff wrote:
           | Not really. On AWS, you're always going to pay an egress cost
           | to get those logs out of AWS to your colo. If you were to
           | ship your security logs to S3 and host your security log
           | indexing and search services on EC2 within the same AWS
           | region as the S3 bucket, you wouldn't have to worry about
           | egress.
        
         | FridgeSeal wrote:
         | I was on a team once where we ran agent-mode Prometheus into a
         | Mimir cluster and it was endless pain and suffering.
         | 
         | Parts of it would time out and blow up, one of the dozen
         | components (slight hyperbole) they have you run would go down
         | and half the cluster would go with it. It often had to be
         | nursed back to health by hand, it was expensive to run, queries
         | ate not even that fast.
         | 
         | Absolutely would not repeat the experience. We cheered the
         | afternoon we landed the PR to dump it.
        
           | raffraffraff wrote:
           | I definitely think that running the microservices deployment
           | of Mimir (and Loki) looks hairy. But the monolithic
           | deployments can handle pretty large volumes.
        
       | dantillberg wrote:
       | This excessive inter-AZ data transfer pricing is distorting
       | engineering best practices. It _should_ be cheap to operate HA
       | systems across 2-3 AZs, but because of this price distortion on
       | inter-AZ traffic charges, we lean towards designs that either
       | silo data within an AZ, or that leverage S3 or other hosted
       | solutions as a sort of accounting workaround (i.e. there are no
       | data transfer charges to read/write an S3 bucket from any AZ in
       | the same region).
       | 
       | While AWS egress pricing gets a lot of attention, I think that
       | the high cost of inter-AZ traffic is much less defensible. This
       | is transfer on short fat pipes completely owned by Amazon. And at
       | $0.01/GB, that's 2~10X what smaller providers charge for
       | _internet_ egress.
        
         | hipadev23 wrote:
         | I assume it's to discourage people from architecting designs
         | that abuse the network. A good example would be collecting
         | every single metric, every second, across every instance for no
         | real business reason.
        
           | thayne wrote:
           | Or maybe it is price discrimination. A way to extract more
           | money from customers that need higher availability and
           | probably have higher budgets.
        
             | hipadev23 wrote:
             | Price discrimination is when you charge different amounts
             | for the same thing to different customers. And usually the
             | difference in those prices are not made apparent. Like when
             | travel websites quote iOS users more than Android because
             | they generally can afford to pay more.
             | 
             | This is just regular ole pricing.
        
               | thayne wrote:
               | So what is the correct term for "charge an extremely high
               | markup for a feature that some, but not all, of your
               | customers need"?
        
               | spondylosaurus wrote:
               | Price gouging?
        
               | jchanimal wrote:
               | I came here to say the same thing. When you're selling
               | cloud services, the hardest thing to do is segment your
               | customers by willingness to pay.
               | 
               | Cross AZ traffic is exactly the sort of thing companies
               | with budgets need, that small projects don't.
        
               | mcmcmc wrote:
               | Supply and demand
        
               | hansvm wrote:
               | It's a bit of a mix, but price discrimination isn't far
               | off. It's like the SSO tax; all organizations are paying
               | for effectively the same service, but the provider has
               | found a minor way to cripple the service that selectively
               | targets people who can afford to pay more.
               | 
               | If we want to call this just regular ole pricing, it's
               | not a leap to call most textbook cases of price
               | discrimination "regular ole pricing" as well. An online
               | game charges more if your IP is from a certain geography?
               | That's not discrimination; we've simply priced the
               | product differently if you live in Silicon Valley; don't
               | buy it it you don't want it.
        
             | cowsandmilk wrote:
             | Sending traffic between AZs doesn't necessarily improve
             | availability and can decrease it. Each of your services can
             | be multi-az, but have hosts that talk just to other
             | endpoints in their AZ.
        
               | thayne wrote:
               | Unless your app is completely stateless, you will need
               | some level of communication across AZs.
               | 
               | And you often want cross-zone routing on your load
               | balancers so that if you lose all the instances in one AZ
               | traffic will still get routed to healthy instances.
        
             | KaiserPro wrote:
             | > Or maybe it is price discrimination.
             | 
             | It very much is, because scaling bandwidth between phyical
             | datacenters which are not located next to each other is
             | very expensive. So pricing it means that people don't use
             | it as much as if it was free.
        
               | mcmcmc wrote:
               | That's not what price discrimination is
        
           | koolba wrote:
           | You can still do that if you buffer it and push it to S3. Any
           | AZ to S3 is free and the net result is the same.
        
         | KaiserPro wrote:
         | _I don 't work for AWS_
         | 
         | However I do work for a company with >1 million servers.
         | Scaling inter datacentre bandwidth is quite hard. Sure the
         | datacentres might be geographically close, but laying network
         | cables over distance is expensive. Moreover unless you spend
         | uber millions, you're never going to get as much bandwidth as
         | you have inside the datacentre.
         | 
         | So you either apply hard limits per account, or price it so
         | that people think twice about using it.
        
           | themgt wrote:
           | OK but $10/TB has gotta be like >99% profit margin for AWS.
           | _After_ massively jacking up their prices, Hetzner internet
           | egress is only EUR1 /TB. Also AWS encourages / in some cases
           | practically forces you to do multi-AZ.
           | 
           | I remember switching to autoscaling spot instances to save a
           | few bucks, then occasionally spot spinup would fail due to
           | lack of availability within an AZ so I enabled multi-AZ spot.
           | Then got hit with the inter-AZ bandwidth charges and wasn't
           | actually saving any money vs single-AZ reserved. This was
           | about the point I decided DIY Kubernetes was simpler to
           | reason about.
        
             | everfrustrated wrote:
             | Apples and Oranges. Hetzner doesn't even have multiple AZ's
             | by AWS's definition - all Hetzners DCs eg Falkenstein 1-14
             | would be the same AZ zone.
             | 
             | AWS network is designed with a lot more internal capacity
             | and reliability than Hetzner which costs a lot more -
             | multiple uplinks to independent switches, etc.
             | 
             | AWS is also buying current gen network gear which is much
             | more pricey - Hetzner is mostly doing 1 Gig ports or 10 gig
             | at a push which means they can get away with >10 year old
             | switches (if you think they buy new switches I have a
             | bridge you might be interested in buying).
             | 
             | This costs at least an order of magnitude more.
        
               | iscoelho wrote:
               | I agree with this post that Hetzner is a bad example.
               | They are focused on a budget deployment.
               | 
               | I do not agree that a state-of-the-art high capacity
               | deployment is as expensive as you think it is. If an
               | organization pays MSRP on everything, has awful
               | procurement with nonexistent negotiation, and multiple
               | project failures, sure, maybe. In the real world though,
               | we're not all working for the federal government (-:
        
               | dantillberg wrote:
               | While your caveats are all noteworthy, I'll add that
               | Hetzner also offers unlimited/free bandwidth between
               | their datacenters in Germany and Finland. That's sort of
               | like AWS offering free data transfer between us-east-1
               | and us-east-2.
        
           | iscoelho wrote:
           | In Ashburn, VA, I can buy Dark Fiber for $750 MRC to any
           | datacenter in the same city. I can buy Dark Fiber for $3-5K
           | MRC to any random building in the same city.
           | 
           | That Duplex Dark Fiber with DWDM can run 4TBPS of capacity at
           | 100GE (40x 100GE). Each 100GE transceiver costs $2-4K NRC
           | dependent on manufacturer - $160K NRC for 40x. (There are
           | higher densities as well, like 200/400/800GE, 100GE is just
           | getting cheap.)
           | 
           | In AWS, utilizing 1x100GE will cost you >$1MM MRC. For
           | significantly less than that, let's say absolutely worst-case
           | 5K MRC + 200K NRC, you can get 40x100GE.
           | 
           | Now you have extra money for 4x redundancy, fancy routers,
           | over-spec'd servers, world-class talent, and maybe a yacht if
           | your heart desires.
        
             | bobbob1921 wrote:
             | I'm just throwing out a hypothetical, so I may be
             | completely off base: perhaps aws charges high inter-AZ
             | bandwidth prices to keep users from tunneling traffic
             | between availabilities zones to arbitrage lower
             | Internet/egress costs at AZ 1 vs AZ 3.
             | 
             | Outside of my statement above, I do agree that the cost
             | Amazon pays for bandwidth between their sites, has to be
             | practically nothing at their scale/size (and thus they
             | should charge their customers very little for it,
             | especially considering easy-multi AZ is a big
             | differentiator for cloud vs self-hosting / colo). The user
             | above's dark fiber MRC prices are spot on.
        
           | dilyevsky wrote:
           | You should do the math though because it's expensive but
           | nowhere near 1c/g expensive
        
       | thayne wrote:
       | > while it's tempting to use the infinitely-scalable object
       | storage (like S3), the good old block storage is just cheaper and
       | more performant
       | 
       | How is it cheaper? Object storage is cheaper per GB. Does using
       | s3 have another component that is more expensive, maybe a caching
       | layer? Is the storage format significantly less efficient? Are
       | you not using a vpc enpoint to avoid egress charges?
        
         | jdreaver wrote:
         | You are correct that storage is cheaper in S3, but S3 charges
         | per request to GET, LIST, POST, COPY, etc objects in your
         | bucket. Block storage can be cheaper when you are frequently
         | modifying or querying your data.
        
           | thayne wrote:
           | That's a lot of requests.
        
             | hansvm wrote:
             | It is, but it's not _that_ many. AWS pricing is
             | complicated, but for fairly standard services and assuming
             | bulk discounts at ~100TB level, your break-even points for
             | requests/network vs storage happens at:
             | 
             | 1. (modifications) 4200 requests per GB stored per month
             | 
             | 2. (bandwidth) Updating each byte more than once every 70
             | days
             | 
             | You'll hit the break-even sooner, typically, since you
             | incur both bandwidth and request charges.
             | 
             | That might sound like a lot, but updating some byte in each
             | 250KB chunk of your data once a month isn't that hard to
             | imagine. Say each user has 1KB of data, 1% are active each
             | month, and you record login data. You'll have 2.5x the
             | break-even request count and pay 2.5x more for requests
             | than storage, and that's only considering the mutations,
             | not the accesses.
             | 
             | You can reduce request costs (not bandwidth though) if you
             | can batch them, but that's not even slightly tenable till a
             | certain scale because of latency, and even when it is you
             | might find that user satisfaction and retention are more
             | expensive than the extra requests you're trying to avoid.
             | Batching is a tool to reduce costs for offline workloads.
        
       | nathan_jr wrote:
       | Is it possible to query cloud watch to calculate the current cost
       | attributed to inter AZ traffic ?
        
       | sgarland wrote:
       | My only hope is that as more and more companies find stuff like
       | this out, there will be a larger shift towards on-prem / colo.
       | Time is a circle.
        
       ___________________________________________________________________
       (page generated 2024-12-29 23:01 UTC)