[HN Gopher] Saving Three Months of Latency with a Single OpenTel...
___________________________________________________________________
Saving Three Months of Latency with a Single OpenTelemetry Trace
Author : serverlessmom
Score : 81 points
Date : 2024-06-06 21:52 UTC (1 days ago)
(HTM) web link (www.checklyhq.com)
(TXT) w3m dump (www.checklyhq.com)
| simonbarker87 wrote:
| I wish posts like this would explore the relative savings rather
| than the absolute. On its own I don't feel like that saving is
| really telling me much, taken to the extreme you could just not
| run the service at all and save all the time - a tongue in cheek
| example but in context is this saving a big deal or is it just
| engineering looking for small efficiencies to justify their time?
| tnolet wrote:
| Hey, I work at Checkly and asked my coworker (who wrote the
| post) to give some more background on this. I can assure you,
| we're busy and this was not done for some vanity price!
| janOsch wrote:
| I'm the author of the post. You raise a good point about
| relative savings. Based on last week's data, our change reduced
| the task time by 40ms from an average of 3440ms, and this task
| runs 11 million times daily. This translates to a saving of
| about 1% on compute.
| simonbarker87 wrote:
| Thanks for the follow up, sounds like a decent saving and
| investment of time then.
| janOsch wrote:
| Fun fact: it probably took more time to write up and refine
| the blog post than it did to hunt down that sneaky 40ms
| savings.
| simonbarker87 wrote:
| True but the value of the hunt and fix may really come
| from this blog post long term. Content marketing and all
| that
| hiatus wrote:
| > This translates to a saving of about 1% on compute.
|
| Does this translate to any tangible savings? I'm not sure
| what the checkly backend looks like but if tasks are running
| on a cluster of hosts vs invoked per-task it seems hard to
| realize savings. Even per-task, 40 ms can only be realized on
| a service like Lambda--ECS minimum billing unit is 1 second
| afaik.
| serverlessmom wrote:
| I think that's flawed analysis, if you're running FaaS then
| sure you can fail to see benefit from small improvements in
| time (AWS Lambda changed their billing resolution a few
| years back but before then the Go services didn't save much
| money despite being faster) but if you're running thousands
| of requests, and speeding them all up, you should be able
| to realize tangible compute savings whatever your platform.
| hiatus wrote:
| Help me to understand, then. If this stuff is being done
| on an autoscaling cluster, I can see it, but if you are
| just running everything on an always-on box for instance,
| it is less clear to me.
|
| edit: Do you have an affiliation with the blog? I ask
| because you have submitted several articles from checkly
| in the past.
| tnolet wrote:
| Hey Checkly founder here, we changed our infra quite a
| bit over the last ~1 year. Still, it's mostly ephemeral
| compute. We started actually on AWS Lambda. We are on a
| mix of AWS EC2 and EKS now, all autoscaled per region (we
| run 20+ of them).
|
| It seems tiny, but in aggregate this will have an impact
| on our COGS. You are correct that if we had a fixed fleet
| of instances, the impact would have been not super
| interesting.
|
| But still, for a couple of hours spent, this saves us
| quite some $1Ks per year.
| serverlessmom wrote:
| Yes I work at Checkly, though I didn't answer
| authoritatively since this one wasn't written by me!
| brunoarueira wrote:
| I agree, but this post looks like an advertisement about the
| service itself.
| ebcase wrote:
| It's literally on the company's blog, which is partially
| about promoting the company's service. What's the issue with
| that?
|
| (Long time happy Checkly user here, the service is fantastic)
| brunoarueira wrote:
| Not a problem, but the OP is questioning about the savings!
|
| I, for example, like to dive more on insights like the
| relative savings vs absolut to learn the approaches other
| engineers take! It's all about metrics we should take care.
|
| (I'll put this service on my list to try someday, looks
| like fantastic indeed)
| redman25 wrote:
| I often ask myself the same question. We have some user facing
| queries that slow the frontend down. I've fixed some slowness
| but it's definitely not a priority. I wonder how much speed
| improvements correlate with increased revenue by happy
| customers.
| encoderer wrote:
| Think of this like changing the oil in your car.
|
| Over-optimizing is not going to help you at all but if you
| ignore it eventually it will all seize up.
|
| You have to keep that stuff in check.
| dmurray wrote:
| The units seem wrong in any case. It's 3 months of compute _per
| day_ , which is actually much more impressive.
|
| If we think about the business impact, we don't usually think
| of compute expenditure per-day, so you might reasonably say,
| the fix saved 90 years of annual compute. Looks better in your
| promotion packet, too.
| cdelsolar wrote:
| ms isn't picoseconds, it's microseconds, which are a million
| times bigger...
| janOsch wrote:
| Thank you for pointing that out! You are correct, ms stands for
| microseconds, not picoseconds. I've corrected the mistake, and
| the update should be visible as soon as the CDN cache
| invalidates.
| serverlessmom wrote:
| Every day I have more sympathy for the Mars Climate Orbiter
| team. https://science.nasa.gov/mission/mars-climate-orbiter/
| harisamin wrote:
| On the noisy NodeJS auto-instrumentation, it is indeed very noisy
| out of the box. Myself along with a bunch of other ppl finally
| got the project to allow you to select the instrumentations via
| configuration. Saves having to create your own tracer.ts/js file.
|
| Here's the PR that got merged earlier in the year:
| https://github.com/open-telemetry/opentelemetry-js-contrib/p...
|
| The env var config is `OTEL_NODE_ENABLED_INSTRUMENTATIONS`
|
| Anyways, love Opentelemetry success stories. Been working hard on
| it at my current company and yielding fruits already :)
| roboben wrote:
| This is great. Last time I tried this I couldn't even find a
| way in code to disable some.
| tnolet wrote:
| That is awesome. Had no idea this was available as an env var.
| After diving into OTel for our backend, we also found some of
| this stuff is just too noisy. We switch it of using this code
| snippet, for anyone bumping into this thread:
| instrumentations: [getNodeAutoInstrumentations({
| '@opentelemetry/instrumentation-fs': { enabled:
| false, }, '@opentelemetry/instrumentation-
| net': { enabled: false, },
| '@opentelemetry/instrumentation-dns': { enabled:
| false, },
| harisamin wrote:
| yeah I totally turned all of those off...way too noisy :)
| serverlessmom wrote:
| This is so cool! I've had this exact problem before.
| Veserv wrote:
| Why would you disable instrumentation instead of just filtering
| the recorded log?
|
| That only makes sense if the instrumentation overhead itself is
| significant. But, for a efficient recording implementation that
| should only really start being a problem when your average span
| is ~1 us.
| tnolet wrote:
| Oh, simple answer. The tools you use to inspect those traces
| just blow up with noise. Like a trace that shows 600+ of file
| reads that all take less than half a millisecond.
|
| This is all noise when you are trying to debug more common
| issues than your FS being too slow.
|
| + also storage cost. Most vendors charge by Mb stored or span
| recorded.
| Veserv wrote:
| That is why I mentioned post-filtering the recording as the
| alternative. Grab the full recording then filter to just
| the relevant results before inspection.
|
| For that matter, why are a few hundred spans a problem? Are
| the visualizers that poor? I usually use function tracing
| where hundreds of millions to billions of spans per second
| are the norm and there is no difficulty managing or
| understanding those.
| tnolet wrote:
| In most cases these traces are shipped over the wire to a
| vendor. Only that will cost $$. Then, not all vendors
| have tail sampling as a "free" feature. So, in many cases
| it's better to not record at all.
| Veserv wrote:
| That sounds positively dystopian. Is it really that hard
| to dump to private/non-vendor storage for local analysis
| using your own tools?
|
| I do not do cloud or web development, so this is just
| totally alien. I generate multi-gigabyte logs with
| billions of events for just seconds of execution and get
| to slice them however I want when doing performance
| analysis. The inability to even process your own logs
| seems crazy.
| cweld510 wrote:
| You can absolutely dump the traces somewhere and analyze
| them yourself. The problem is that this falls apart with
| scale. You are maybe serving thousands of requests per
| second. Your service has a ton of instances. Capturing
| all trace data for all requests from all services is just
| difficult. Where do you store all of it? How do you
| quickly find what you need? It gets very annoying very
| fast. When you pay a vendor, you pay them to deal with
| this.
| bushbaba wrote:
| Just a friendly call out that checkly is an awesome service.
| AtlasBarfed wrote:
| It's AWS, if you shake a stick at some network transfer
| optimization or storage/EBS/S3 you'll save three engineers
| salary.
| tnolet wrote:
| This deserves an updoot. We reached a level of scale right now
| at Checkly that all of these things start adding up. We moved
| workloads off of S3 to Cloudflare R2 because of this.
| tmpz22 wrote:
| > We moved workloads off of S3 to Cloudflare R2 because of
| this.
|
| So you moved from a mature but expensive storage solution to
| a younger currently subsidized storage solution? What happens
| when R2 jacks up pricing?
| tnolet wrote:
| let me nuance that a bit. 99% of our workload is write
| heavy and is still on S3. We run monitoring checks that
| snap a screenshot and record a video. We write that to S3.
| Most folks will never view any of that as most checks pass
| and these artefacts only become interesting when things
| fail.
|
| Enter a new product feature we launched (Visual Regression
| Testing) which requires us to fetch an image from storage
| on every "run" we do. These could be every 10sec. This is
| where R2 shines. No egress cost for us. It's been rock
| solid and saved us about 60x compared to AWS. Still, we run
| most of our infra on AWS.
| philsnow wrote:
| Is latency the same thing as duration? I think of latency as
| being more like a vector-with-starting-point (a "ray segment"?)
| than a scalar, it's "rooted" to a point in time, so it doesn't
| make sense to sum them.
| Cthulhu_ wrote:
| Given how high frequent this thing is, I'd say it's worth
| exploring moving away from Node; I don't associate Node with high
| performance / throughput myself.
___________________________________________________________________
(page generated 2024-06-07 23:01 UTC)