[HN Gopher] Prometheus, but Bigger
       ___________________________________________________________________
        
       Prometheus, but Bigger
        
       Author : kiyanwang
       Score  : 43 points
       Date   : 2021-06-13 12:25 UTC (10 hours ago)
        
 (HTM) web link (luizrojo.medium.com)
 (TXT) w3m dump (luizrojo.medium.com)
        
       | Ne02ptzero wrote:
       | > For the monthly expenses, with most of the components running
       | on-premises, there was a 90.61% cost reduction, going from US$
       | 38,421.25 monthly to US$ 3,608.99, including the AWS services
       | cost.
       | 
       | I might be missing something, but $3k/month (let alone
       | $38k/month) sounds absolutely insane to me for how little metrics
       | they're collecting (4k5 metrics per second, 2.7TB of data per
       | year). Is the money going for network bandwidth or something
       | along those lines?
        
         | lcw wrote:
         | I agree. I'm working with a metrics system that takes just over
         | 1 million metrics a second and it has a similar run rate to
         | ~38k a month.
        
         | ericbarrett wrote:
         | AWS, for example, charges for cross-AZ data transfers. Naive
         | setups with multiple AZs (us-west-1a, 1b, etc.) and a
         | centralized Prometheus setup will rack up quite the cost.
        
           | marcinzm wrote:
           | They say they ingest 226gb per month. The cross-az transfer
           | cost is $0.01 per gb. So that should come out to only
           | $2.26/month for them.
        
           | gregimba wrote:
           | The reducing cross-AZ data transfer savings on one service
           | resulted in a low 6 figure per year savings. Its something we
           | overlooked during initial setup and now its something I check
           | for when dealing with AWS networking.
        
             | otterley wrote:
             | Out of curiosity, what's your plan for recovery during an
             | AZ or regional outage?
        
               | manyxcxi wrote:
               | If RDS or other "hard to replicate very quickly during
               | disaster" infra is being run I personally would still
               | have cross A-Z replication at minimum, to reduce network
               | costs I would configure the "other zone" as a backup
               | replica only and not for performance clustering.
               | 
               | With automation we can spin up full new compute stacks,
               | including load balancers and DNS in about 5-10 minutes
               | per "unique" environment configuration.
               | 
               | While it guarantees we could never have a no downtime
               | failover, we're okay with it and have more than halved
               | our network costs (which admittedly were about number 8
               | on our AWS bill by cost).
        
       | nwmcsween wrote:
       | I dont understand the need to NiH things, why not clickhouse with
       | weekly, monthly, etc aggregation. It comes with built in
       | sharding, can store time series data somewhat efficiently and
       | doesn't have some hack h/a setup (in general not thanos).
        
       | [deleted]
        
       | benchess wrote:
       | The key constraint here is that the author has no access to
       | persistent disks and can only use object storage for persistence.
       | Otherwise Thanos would be extreme overkill for this number of
       | metrics.
       | 
       | Single-node VictoriaMetrics can easily handle 1M metrics/sec
        
         | zzyzxd wrote:
         | > Thanos would be extreme overkill for this number of metrics
         | 
         | Data volume is just one thing. Thanos makes Prometheus
         | stateless and easy to shard, all in a non-invasive approach
         | that is solid, boring, and just works. The architecture works
         | well in small scale systems. I even use it in a single node k8s
         | cluster in my homelab, pays only about ~$1 a month for
         | Backblaze B2 so I never worry about data retention or disk
         | usage.
         | 
         | > Single-node VictoriaMetrics can easily handle 1M metrics/sec
         | 
         | Even if I have disk access, I would think twice before
         | deploying a database and manage it myself when I don't have to.
         | Besides the maintenance burden and potential scaling issues in
         | the future, it may cost you more to use block storage like EBS
         | than S3.
         | 
         | Also, Prometheus memory usage overhead for remote write was
         | wild[1], so, good luck with capacity planning and config
         | tweaking.
         | 
         | 1. https://prometheus.io/docs/practices/remote_write/
        
         | znpy wrote:
         | > Single-node VictoriaMetrics can easily handle 1M metrics/sec
         | 
         | yeah sure but then what will you be posting on Linkedin with
         | buzzwords? "I have installed a boring software on a single
         | machine because it works, performs well enough and it's cheap"
         | ? what are you going to say, "I take daily snapshots of the vm
         | because 1-day rpo is fine for me"?
         | 
         | How will you get them likes??
         | 
         | (this is an ironic comment, before anyone starts a flame)
        
       ___________________________________________________________________
       (page generated 2021-06-13 23:01 UTC)