[HN Gopher] We compress Pub/Sub messages and more, saving a load...
       ___________________________________________________________________
        
       We compress Pub/Sub messages and more, saving a load of money
        
       Author : kiyanwang
       Score  : 32 points
       Date   : 2021-01-03 13:04 UTC (9 hours ago)
        
 (HTM) web link (blog.lawrencejones.dev)
 (TXT) w3m dump (blog.lawrencejones.dev)
        
       | memetherapy wrote:
       | Genuine question from someone from an entirely different world -
       | why on earth do you have 10 billion log entries? What is in them
       | and do you ever do anything with them that requires you to store
       | so much data rather than just a representative subset?
        
         | xeromal wrote:
         | Anytime its something ridiculous like this, I assume its for
         | compliance. A few industries require all info to be retained
         | for 7 years.
        
         | user5994461 wrote:
         | >>> why on earth do you have 10 billion log entries?
         | 
         | It's pretty low volume actually. A small company with < 100
         | developers and servers can generate a billion logs over a few
         | weeks.
         | 
         | Normal logs from the system, syslog, applications, databases,
         | web servers... nothing fancy really. It's common practice to
         | centralize all these into ElasticSearch or Splunk.
         | 
         | Their scale of 10 billion logs 60 TB means they're a regular
         | small to medium company.
        
           | Kranar wrote:
           | This seems suspect, that works out to approximately 25 log
           | messages per developer per second assuming a 10 hour work
           | day.
           | 
           | I work in a tightly regulated industry (finance), and even my
           | company doesn't have a need to log 25 messages per second per
           | person.
           | 
           | Is anyone else able to validate this claim that regular small
           | companies log this much data?
        
           | lawrjone wrote:
           | You've nailed this!
           | 
           | This logging system was for all https://gocardless.com/
           | systems. We're a B2C company which means we have different
           | economies of scale than many scale-ups of our size, but you
           | were close with your guess:
           | 
           | Currently 450 people worldwide, ~150 in product development,
           | of which ~100 fulltime developers.
        
         | lawrjone wrote:
         | Author here! These 10B log lines are from the last 60 days of
         | activity from https://gocardless.com/ systems.
         | 
         | It includes:
         | 
         | - System logs, such as our Kubernetes VM host logs, or our Chef
         | Postgres machines
         | 
         | - Application logs from Kubernetes pods
         | 
         | - HTTP and RPC logs
         | 
         | - Audit logs from Stackdriver (we use GCP for all our
         | infrastructure)
         | 
         | > do you ever do anything with them that requires you to store
         | so much data rather than just a representative subset?
         | 
         | Some of the logs are already sampled, such as VPC flow logs,
         | but the majority aim for 100% capture.
         | 
         | Especially for application logs, which are used for audit and
         | many other purposes, developers expect all of their logs to
         | stick around for 60d.
         | 
         | Why we do this is quite simple: for the amount of value we get
         | from storing this data, in terms of introspection,
         | observability and in some cases differentiated product
         | capabilities like fraud detection, the cost of running this
         | cluster is quite a bargain.
         | 
         | I suspect we'll soon cross a threshold where keeping everything
         | will cost us more than it's worth, but I'm confident we can
         | significantly reduce our costs with a simple tagging system,
         | where developers mark logs as requiring shorter retention
         | windows.
         | 
         | Hopefully that gives you a good answer! In case you're
         | interested, my previous post mentioned how keeping our HTTP
         | logs around in a queryable form was really useful for helping
         | make a product decision:
         | 
         | https://blog.lawrencejones.dev/connected-data/
        
           | memetherapy wrote:
           | Thanks for the response, really interesting to see how this
           | stuff is used.
        
       | lawrjone wrote:
       | Hey all- I'm the author and just noticed this post. Thanks for
       | the repost!
       | 
       | If you're interested, there was a nice discussion in /r/devops
       | about this the other day:
       | https://www.reddit.com/r/devops/comments/kmltbx/how_we_compr...
        
       | 29athrowaway wrote:
       | Compression has a CPU time cost, though. You spend less on
       | storage but use your CPUs more. Is the extra load from
       | compression cause your cluster to autoscale? if so, you may not
       | be saving money.
        
         | lawrjone wrote:
         | Author here! A few others have pointed this out, but to
         | restate: in our situation, compression was costing us almost
         | nothing.
         | 
         | 60TB of logs was 60 days worth of retention, so 1TB a day. That
         | means we process about 11MB/s on average, peaking at 100MB/s.
         | 
         | A single CPU core can manage 100MB/s of compression, so if you
         | assume we compress and decompress in multiple places, let's say
         | we're paying about 4 CPU cores constantly for this.
         | 
         | That's a pretty worse case scenario, and it would cost us
         | $37.50/month on GCP for those cores, in order to save about
         | 100x that amount.
         | 
         | The takeaway (for me at least) is that compressing an in-flight
         | transfer is almost always worthwhile if you're working in the
         | cloud and the transfer will eventually be stored somewhere. The
         | economics of CPU vs the total amount of data storage cost is a
         | no brainer.
         | 
         | Hope that makes sense!
        
         | zbjornson wrote:
         | FTA: "This post aimed to cover a scenario where the cost of
         | compression, in both compute resource and time-to-build, was
         | significantly outweighed by the savings it would make."
        
         | dilatedmind wrote:
         | my intuition is you can save cpu time when compressing before
         | sending over the network (and certainly wall time).
         | 
         | a quick test copying a 24M file (with similiar compression
         | ratios) to s3 showed a 6% decrease in cpu time when piping
         | through gzip.
        
           | saurik wrote:
           | Data sent to S3 is usually hashed (depending on
           | authentication type) in addition to being transport
           | encrypted; I imagine the majority of this cost here is the
           | encryption of a larger payload (which many would consider
           | indispensable, but I point this out because I do not
           | generally assume this when I merely consider "over the
           | network").
        
             | user5994461 wrote:
             | You assume incorrectly. SSL encryption is in the order of 1
             | GB/s on a recent CPU with AES instructions (anything from
             | this decade).
             | 
             | Gzip is in the order of 10 MB/s with default settings, down
             | to 1 MB/s with the strongest compression setting. It's
             | really really slow.
        
               | Kranar wrote:
               | GNU gzip the application is slow on the order of 10 MB/s
               | because of how it does file IO, but the DEFLATE algorithm
               | that gzip is based off of is much faster than 10 MB/s at
               | the default "level 6". For example the slz implementation
               | of DEFLATE compresses text at 1 GB/s [1]. Even the fairly
               | common zlib implementation can compress text at close to
               | 300 MB/s.
               | 
               | http://www.libslz.org/
        
           | EricBurnett wrote:
           | Depends on selected compressor, but yes, you can. I've
           | definitely observed zstd-1 to be a net savings, where
           | compression/decompression costs were offset by pushing fewer
           | bytes through the RPC and network layers - and this was only
           | from observing the endpoints, not even counting CPU on
           | intermediate proxies/firewalls/etc.
           | 
           | I wouldn't normally expect gzip to be a net savings (it's
           | comparatively more expensive), but depending on compression
           | ratio achieved and what layers you're passing the bytes
           | through, I'd definitely believe it can be in some contexts.
        
       | nextweek2 wrote:
       | Beware of on the fly compression, it adds to network latency if
       | you aren't careful. It's an important metric that gets overlooked
       | in many articles on compression.
        
       | EricBurnett wrote:
       | Related tip: anywhere you're looking to deploy compression in a
       | backend, consider zstd (https://facebook.github.io/zstd/). I've
       | found it suitable for streaming compression (Gbps+ flows without
       | exorbitant CPU usage), storage compression (high compression
       | ratios at higher levels; fast decompression), and found
       | implementations available to every language I've needed. It comes
       | out strictly better than gzip ~across the board in my experience,
       | and should be the default compressor to choose or start
       | evaluations from if you have no other constraints.
       | 
       | I don't think it's yet deployed in browsers, so I'm restricting
       | my recommendation to tools and backends. IIRC brotli is worth
       | considering for browsers, but I haven't deployed it myself.
        
       | dilatedmind wrote:
       | did you use any kind of message framing? Got bit by this at a
       | previous job where we needed to change the message format to
       | improve compression. Wound up figuring something out, but would
       | have been easier if we had reserved a byte for versioning.
        
         | lawrjone wrote:
         | I'm not quite sure what you mean by message framing.
         | 
         | If you mean marking messages as having been compressed, then
         | absolutely yet. The Pub/Sub messages were tagged with
         | `compress=true|false` so we could update downstream and
         | upstream independently.
         | 
         | If you mean buffering log messages into a batched 'frame', then
         | yes we did do this. We were taking about 500 log entries and
         | compressing them into a single message, which was part of why
         | the compression was so effective.
         | 
         | If you mean something different, then I'm at a loss and very
         | interested in what you meant by the term!
        
       ___________________________________________________________________
       (page generated 2021-01-03 23:02 UTC)