[HN Gopher] Saving Three Months of Latency with a Single OpenTel...
       ___________________________________________________________________
        
       Saving Three Months of Latency with a Single OpenTelemetry Trace
        
       Author : serverlessmom
       Score  : 81 points
       Date   : 2024-06-06 21:52 UTC (1 days ago)
        
 (HTM) web link (www.checklyhq.com)
 (TXT) w3m dump (www.checklyhq.com)
        
       | simonbarker87 wrote:
       | I wish posts like this would explore the relative savings rather
       | than the absolute. On its own I don't feel like that saving is
       | really telling me much, taken to the extreme you could just not
       | run the service at all and save all the time - a tongue in cheek
       | example but in context is this saving a big deal or is it just
       | engineering looking for small efficiencies to justify their time?
        
         | tnolet wrote:
         | Hey, I work at Checkly and asked my coworker (who wrote the
         | post) to give some more background on this. I can assure you,
         | we're busy and this was not done for some vanity price!
        
         | janOsch wrote:
         | I'm the author of the post. You raise a good point about
         | relative savings. Based on last week's data, our change reduced
         | the task time by 40ms from an average of 3440ms, and this task
         | runs 11 million times daily. This translates to a saving of
         | about 1% on compute.
        
           | simonbarker87 wrote:
           | Thanks for the follow up, sounds like a decent saving and
           | investment of time then.
        
             | janOsch wrote:
             | Fun fact: it probably took more time to write up and refine
             | the blog post than it did to hunt down that sneaky 40ms
             | savings.
        
               | simonbarker87 wrote:
               | True but the value of the hunt and fix may really come
               | from this blog post long term. Content marketing and all
               | that
        
           | hiatus wrote:
           | > This translates to a saving of about 1% on compute.
           | 
           | Does this translate to any tangible savings? I'm not sure
           | what the checkly backend looks like but if tasks are running
           | on a cluster of hosts vs invoked per-task it seems hard to
           | realize savings. Even per-task, 40 ms can only be realized on
           | a service like Lambda--ECS minimum billing unit is 1 second
           | afaik.
        
             | serverlessmom wrote:
             | I think that's flawed analysis, if you're running FaaS then
             | sure you can fail to see benefit from small improvements in
             | time (AWS Lambda changed their billing resolution a few
             | years back but before then the Go services didn't save much
             | money despite being faster) but if you're running thousands
             | of requests, and speeding them all up, you should be able
             | to realize tangible compute savings whatever your platform.
        
               | hiatus wrote:
               | Help me to understand, then. If this stuff is being done
               | on an autoscaling cluster, I can see it, but if you are
               | just running everything on an always-on box for instance,
               | it is less clear to me.
               | 
               | edit: Do you have an affiliation with the blog? I ask
               | because you have submitted several articles from checkly
               | in the past.
        
               | tnolet wrote:
               | Hey Checkly founder here, we changed our infra quite a
               | bit over the last ~1 year. Still, it's mostly ephemeral
               | compute. We started actually on AWS Lambda. We are on a
               | mix of AWS EC2 and EKS now, all autoscaled per region (we
               | run 20+ of them).
               | 
               | It seems tiny, but in aggregate this will have an impact
               | on our COGS. You are correct that if we had a fixed fleet
               | of instances, the impact would have been not super
               | interesting.
               | 
               | But still, for a couple of hours spent, this saves us
               | quite some $1Ks per year.
        
               | serverlessmom wrote:
               | Yes I work at Checkly, though I didn't answer
               | authoritatively since this one wasn't written by me!
        
         | brunoarueira wrote:
         | I agree, but this post looks like an advertisement about the
         | service itself.
        
           | ebcase wrote:
           | It's literally on the company's blog, which is partially
           | about promoting the company's service. What's the issue with
           | that?
           | 
           | (Long time happy Checkly user here, the service is fantastic)
        
             | brunoarueira wrote:
             | Not a problem, but the OP is questioning about the savings!
             | 
             | I, for example, like to dive more on insights like the
             | relative savings vs absolut to learn the approaches other
             | engineers take! It's all about metrics we should take care.
             | 
             | (I'll put this service on my list to try someday, looks
             | like fantastic indeed)
        
         | redman25 wrote:
         | I often ask myself the same question. We have some user facing
         | queries that slow the frontend down. I've fixed some slowness
         | but it's definitely not a priority. I wonder how much speed
         | improvements correlate with increased revenue by happy
         | customers.
        
           | encoderer wrote:
           | Think of this like changing the oil in your car.
           | 
           | Over-optimizing is not going to help you at all but if you
           | ignore it eventually it will all seize up.
           | 
           | You have to keep that stuff in check.
        
         | dmurray wrote:
         | The units seem wrong in any case. It's 3 months of compute _per
         | day_ , which is actually much more impressive.
         | 
         | If we think about the business impact, we don't usually think
         | of compute expenditure per-day, so you might reasonably say,
         | the fix saved 90 years of annual compute. Looks better in your
         | promotion packet, too.
        
       | cdelsolar wrote:
       | ms isn't picoseconds, it's microseconds, which are a million
       | times bigger...
        
         | janOsch wrote:
         | Thank you for pointing that out! You are correct, ms stands for
         | microseconds, not picoseconds. I've corrected the mistake, and
         | the update should be visible as soon as the CDN cache
         | invalidates.
        
           | serverlessmom wrote:
           | Every day I have more sympathy for the Mars Climate Orbiter
           | team. https://science.nasa.gov/mission/mars-climate-orbiter/
        
       | harisamin wrote:
       | On the noisy NodeJS auto-instrumentation, it is indeed very noisy
       | out of the box. Myself along with a bunch of other ppl finally
       | got the project to allow you to select the instrumentations via
       | configuration. Saves having to create your own tracer.ts/js file.
       | 
       | Here's the PR that got merged earlier in the year:
       | https://github.com/open-telemetry/opentelemetry-js-contrib/p...
       | 
       | The env var config is `OTEL_NODE_ENABLED_INSTRUMENTATIONS`
       | 
       | Anyways, love Opentelemetry success stories. Been working hard on
       | it at my current company and yielding fruits already :)
        
         | roboben wrote:
         | This is great. Last time I tried this I couldn't even find a
         | way in code to disable some.
        
         | tnolet wrote:
         | That is awesome. Had no idea this was available as an env var.
         | After diving into OTel for our backend, we also found some of
         | this stuff is just too noisy. We switch it of using this code
         | snippet, for anyone bumping into this thread:
         | instrumentations: [getNodeAutoInstrumentations({
         | '@opentelemetry/instrumentation-fs': {             enabled:
         | false,           },           '@opentelemetry/instrumentation-
         | net': {             enabled: false,           },
         | '@opentelemetry/instrumentation-dns': {             enabled:
         | false,           },
        
           | harisamin wrote:
           | yeah I totally turned all of those off...way too noisy :)
        
         | serverlessmom wrote:
         | This is so cool! I've had this exact problem before.
        
         | Veserv wrote:
         | Why would you disable instrumentation instead of just filtering
         | the recorded log?
         | 
         | That only makes sense if the instrumentation overhead itself is
         | significant. But, for a efficient recording implementation that
         | should only really start being a problem when your average span
         | is ~1 us.
        
           | tnolet wrote:
           | Oh, simple answer. The tools you use to inspect those traces
           | just blow up with noise. Like a trace that shows 600+ of file
           | reads that all take less than half a millisecond.
           | 
           | This is all noise when you are trying to debug more common
           | issues than your FS being too slow.
           | 
           | + also storage cost. Most vendors charge by Mb stored or span
           | recorded.
        
             | Veserv wrote:
             | That is why I mentioned post-filtering the recording as the
             | alternative. Grab the full recording then filter to just
             | the relevant results before inspection.
             | 
             | For that matter, why are a few hundred spans a problem? Are
             | the visualizers that poor? I usually use function tracing
             | where hundreds of millions to billions of spans per second
             | are the norm and there is no difficulty managing or
             | understanding those.
        
               | tnolet wrote:
               | In most cases these traces are shipped over the wire to a
               | vendor. Only that will cost $$. Then, not all vendors
               | have tail sampling as a "free" feature. So, in many cases
               | it's better to not record at all.
        
               | Veserv wrote:
               | That sounds positively dystopian. Is it really that hard
               | to dump to private/non-vendor storage for local analysis
               | using your own tools?
               | 
               | I do not do cloud or web development, so this is just
               | totally alien. I generate multi-gigabyte logs with
               | billions of events for just seconds of execution and get
               | to slice them however I want when doing performance
               | analysis. The inability to even process your own logs
               | seems crazy.
        
               | cweld510 wrote:
               | You can absolutely dump the traces somewhere and analyze
               | them yourself. The problem is that this falls apart with
               | scale. You are maybe serving thousands of requests per
               | second. Your service has a ton of instances. Capturing
               | all trace data for all requests from all services is just
               | difficult. Where do you store all of it? How do you
               | quickly find what you need? It gets very annoying very
               | fast. When you pay a vendor, you pay them to deal with
               | this.
        
       | bushbaba wrote:
       | Just a friendly call out that checkly is an awesome service.
        
       | AtlasBarfed wrote:
       | It's AWS, if you shake a stick at some network transfer
       | optimization or storage/EBS/S3 you'll save three engineers
       | salary.
        
         | tnolet wrote:
         | This deserves an updoot. We reached a level of scale right now
         | at Checkly that all of these things start adding up. We moved
         | workloads off of S3 to Cloudflare R2 because of this.
        
           | tmpz22 wrote:
           | > We moved workloads off of S3 to Cloudflare R2 because of
           | this.
           | 
           | So you moved from a mature but expensive storage solution to
           | a younger currently subsidized storage solution? What happens
           | when R2 jacks up pricing?
        
             | tnolet wrote:
             | let me nuance that a bit. 99% of our workload is write
             | heavy and is still on S3. We run monitoring checks that
             | snap a screenshot and record a video. We write that to S3.
             | Most folks will never view any of that as most checks pass
             | and these artefacts only become interesting when things
             | fail.
             | 
             | Enter a new product feature we launched (Visual Regression
             | Testing) which requires us to fetch an image from storage
             | on every "run" we do. These could be every 10sec. This is
             | where R2 shines. No egress cost for us. It's been rock
             | solid and saved us about 60x compared to AWS. Still, we run
             | most of our infra on AWS.
        
       | philsnow wrote:
       | Is latency the same thing as duration? I think of latency as
       | being more like a vector-with-starting-point (a "ray segment"?)
       | than a scalar, it's "rooted" to a point in time, so it doesn't
       | make sense to sum them.
        
       | Cthulhu_ wrote:
       | Given how high frequent this thing is, I'd say it's worth
       | exploring moving away from Node; I don't associate Node with high
       | performance / throughput myself.
        
       ___________________________________________________________________
       (page generated 2024-06-07 23:01 UTC)