hngopher.com

       [HN Gopher] Datadog's $65M/year customer mystery solved
       ___________________________________________________________________
        
       Datadog's $65M/year customer mystery solved
        
       Author : thunderbong
       Score  : 92 points
       Date   : 2025-06-30 18:31 UTC (4 hours ago)
        
 (HTM) web link (blog.pragmaticengineer.com)
 (TXT) w3m dump (blog.pragmaticengineer.com)
        
       | mrkramer wrote:
       | And who says that SaaS doesn't pay off?! It pays off like hell!
        
       | delichon wrote:
       | > For observability, Coinbase spun up a dedicated team with the
       | goal of moving off of Datadog, and onto a
       | Grafana/Prometheus/Clickhouse stack.
       | 
       | We recently did the same, and our Datadog bill was only five
       | figures. We're finding the new stack to not be a poor man's
       | anything, but more flexible, complete and manageable than yet
       | another SaaS. With just a little extra learning curve
       | observability is a domain where open source trounces proprietary,
       | and not just if you don't have money to set on fire.
        
         | oulipo wrote:
         | Have you tried the ClickStack?
         | https://news.ycombinator.com/item?id=44194082
        
       | asnyder wrote:
       | There's also https://openobserve.ai, while not as stable as
       | Grafana/Prometheus/Clickhouse, feels a bit easier to setup and
       | manage. Though has a bit of ways to go, does the basics and more
       | without issue.
       | 
       | Crazy crazy they spent so much on observability. Even with
       | DataDog they could've optimized that spend. DataDog does lots of
       | bad things with billing where by default, especially with on-
       | demand instances you get charged significantly more than you
       | should as they have (had?) pretty deficient counting towards
       | instance hours and instances.
       | 
       | For example, rather than run the agent (which counts as an
       | instance regardless of if it's on for a minute), you can send the
       | logs, metrics, etc. directly to their ingestion endpoints and not
       | have those instances counted towards their usage other than log
       | and metric usage.
       | 
       | Maybe at that level they don't even get into actual by usage
       | anymore, and they just negotiate arbitrary amounts for some
       | absurd quota of use.
        
       | ljm wrote:
       | I wonder how much that no-expense-spared, money-is-no-object
       | attitude to buying SaaS impacts an engineers ability to make
       | sensible decisions around infra and architecture. Coinbase might
       | have been fine blowing 65 mil but take that approach to a new
       | startup and you could trivially eat up a significant amount of
       | runway with it.
       | 
       | I won't single out Datadog on this because the exact same thing
       | happens with cloud spend, and it's very literally burning money.
        
         | swyx wrote:
         | the visible cost of burning runway on a bill is very often far
         | less than the invisible cost of burning engineer time
         | rebuilding undifferentiated heavy lifting rather than working
         | on product/customer needs
        
           | 9283409232 wrote:
           | People say this but I wonder about this from time to time. I
           | don't think anyone is asking to rebuild datadog from scratch
           | for your company but surely it's worth it to migrate to
           | something not as expensive even if it takes a bit of elbow
           | grease.
        
             | closeparen wrote:
             | Assuming there's nothing else you could do with that elbow
             | grease that would create more value than the SaaS bill
             | costs.
        
           | pphysch wrote:
           | Most of the complexity in observability is clientside.
           | 
           | It is not hard to spin up Grafana and VictoriaMetrics (and
           | now VictoriaLogs) and keep them running. It is not hard to
           | build a Grafana dashboard that correlates data across both
           | metrics and logs sources, and alerting functionality is
           | pretty good now.
           | 
           | The "heavy lift" is instrumenting your applications and
           | infrastructure to provide valuable metrics and logs without
           | exceeding a performance budget. I'm skeptical that Datadog
           | actually does much of that heavy-lifting and that they are
           | actually worth the money. You can probably save 10x with
           | same/better outcomes by paying for managed Grafana + managed
           | DBs and a couple FTEs as observability experts.
        
             | lerchmo wrote:
             | You could hire 100 people to manage your timeseries data
             | and save 70%
        
         | viccis wrote:
         | >I wonder how much that no-expense-spared, money-is-no-object
         | attitude to buying SaaS impacts an engineers ability to make
         | sensible decisions around infra and architecture
         | 
         | I saw this a lot at a previous company. Being able to just
         | "have more Lambdas scale up to handle it" got some very
         | mediocre engineers past challenges they encountered. But it did
         | so at the cost of wasting VAST amounts of money and saddling
         | themselves with tech debt that completely hobbled the company's
         | ability to scale.
         | 
         | It was very frustrating to be too junior to be able to change
         | minds. Even basic things like "I know it worked for you with
         | old on-prem NFS designs but we shouldn't be storing our data in
         | 100kb files in S3 and firing off thousands of Lambda
         | invocations to process workloads, we should be storing it in
         | 100mb files and using industry leading ETL frameworks on it".
         | They were old school guys who hadn't adjusted to best practices
         | for object storage and modern large scale data loads (this was
         | a 1M event per second system) and so the company never really
         | succeeded despite thousands of customers and loads of revenue.
         | 
         | I consider cost consideration and profiling to be an essential
         | skill that any engineer working in cloud style environments
         | should have, but it's especially important that a staff
         | engineer or person in a similar position have this skill set
         | and be ready to grill people who come up with wasteful
         | solutions.
        
         | JohnMakin wrote:
         | > Coinbase might have been fine blowing 65 mil but take that
         | approach to a new startup and you could trivially eat up a
         | significant amount of runway with it.
         | 
         | Most startups are not going to have anywhere near the scale to
         | generate anything approaching this bill.
         | 
         | > I won't single out Datadog on this because the exact same
         | thing happens with cloud spend, and it's very literally burning
         | money.
         | 
         | Unless you're in the business of deploying and maintaining
         | production-ready datacenters at scale, it very literally isn't.
        
         | closeparen wrote:
         | That's the point of usage-based pricing: it's cheap to adopt
         | when you're small.
        
       | abxyz wrote:
       | (May 2023)
        
       | everfrustrated wrote:
       | >Originally published on 11 May 2023
        
       | cybice wrote:
       | An article that's basically an ad for Datadog: Pay us a ton of
       | money - it's still cheaper in the long run.
        
       | decimalenough wrote:
       | > _Assume that Datadog cuts the number of outages by half, by
       | preventing them with early monitoring. That would mean that
       | without Datadog, we'd look at 24 hours' worth of downtime, not
       | 12. Let's also assume that using Datadog results in mitigating
       | outages 50% faster than without - thanks to being able to connect
       | health metrics with logs, debug faster, pinpoint the root cause
       | and mitigate faster. In that case, without Datadog, we could be
       | looking at 36 hours worth of total downtime, versus the 12 hours
       | with Datadog. To put it in numbers: the company would make around
       | $9M in revenue it would otherwise lose, Now that $10M /year fee
       | practically pays for itself!_
       | 
       | Those are some pretty heroic assumptions. In particular, they
       | assume the only options are Datadog or nothing, when there are
       | far cheaper alternatives like the Prometheus/Grafana/Clickhouse
       | stack mentioned in the article itself.
        
         | passivepinetree wrote:
         | Another assumption that bothers me here is that the $9M in
         | revenue would be completely lost during an outage. I imagine
         | many customers would simply wait until the outage was resolved
         | before performing their intended transactions, meaning far less
         | than $9M would be lost.
        
           | calt wrote:
           | On the other hand, customers can become frustrated at being
           | unable to trade when they need during an outage to and go to
           | a competitor.
        
         | secondcoming wrote:
         | We are moving from Datadog to Prometheus/Grafana and it's
         | really not all a bed of roses. You'll need monitoring on your
         | monitoring.
        
       | cloudking wrote:
       | What problems does Datadog solve that you can't solve with
       | cheaper solutions?
        
       | therein wrote:
       | I should have known it was Coinbase. I know that Coinbase used to
       | spend $35,000 a month to back up the data directory of ETH nodes.
        
       | aeyes wrote:
       | > we really work with customers to restructure their contracts
       | 
       | Does anyone have such an experience with Datadog? A few million
       | wasn't enough to get them to talk about anything, always paid
       | list price and there was no negotiating either when they
       | restructured their pricing.
        
       | GuinansEyebrows wrote:
       | > To put it in numbers: the company would make around $9M in
       | revenue it would otherwise lose, Now that $10M/year fee
       | practically pays for itself!
       | 
       | am i misunderstanding, or is the author saying it's better to
       | spend $10m than $9m?
        
       | gneray wrote:
       | This person is like the Gossip Guy of tech. Who cares?
        
       | generalpf wrote:
       | When did this guy stop writing about engineering and start
       | running a tech gossip rag?
        
       | willejs wrote:
       | I have run ELK, Grafana + Prom, Grafana + Thanos/Coretex, New
       | relic and all of the more traditional products for
       | monitoring/observability. More recently in the last few years, I
       | have been running full observability stacks via either The
       | Grafana LGTM stack or datadog at a reasonable scale and
       | complexities. Ultimately you want one tool that can alert you off
       | a metric, present you some traces, and drill down into logs, all
       | the way down the stack.
       | 
       | I have found Datadog to be, by far hands down the best developer
       | experience from the get go, the way it glues the mostly decent
       | products together is unparalleled in comparison to other products
       | (Grafana cloud/LGTM). I usually say if your at a small to medium
       | scale business just makes sense, IF you understand the product
       | and configure it correctly which is reasonably easy. The seamless
       | integration between tracing, logging and metrics in the platform,
       | which you can then easily combine with alerts is great. However,
       | its easy to misconfigure it and spend a lot of money on seemingly
       | nothing. If you do not implement tracing and structured logs (at
       | the right volume and level) with trace/span ids etc all the way
       | through services its hard to see the value, and seems expensive.
       | It requires some good knowledge, and configuration of the product
       | to make it pay off. The rest of the product features are
       | generally good, for example their security suite is a good entry
       | level to cloud security monitoring and SEIM too.
       | 
       | However, when you get to a certain scale, the cost of APM and
       | Infrastructure hosts in Datadog can become become somewhat
       | prohibitive. Also, Datadogs custom metrics pricing is somewhat
       | expensive and its query language cababilities does not quite
       | match the power of promql, and you start to find yourself needed
       | them to debug issues. At that point, the self hosted LGTM stack
       | starts to make sense, however, it involves a lot more education
       | for end users in both integration (a little less now Otel is
       | popular) and querying/building dashboards etc, but also running
       | it yourself. The grafana cloud platform is more attractive
       | though.
        
       ___________________________________________________________________
       (page generated 2025-06-30 23:00 UTC)