[HN Gopher] Launch HN: Dashdive (YC W23) - Track your cloud cost...
       ___________________________________________________________________
        
       Launch HN: Dashdive (YC W23) - Track your cloud costs precisely
        
       Hi, HN. We (Adam, Micah and Ben) are excited to show you Dashdive
       (https://www.dashdive.com/), which calculates the cloud cost
       incurred by each user action taken in your product. There's a demo
       video at https://www.dashdive.com/#video and an interactive demo
       here: https://demo.dashdive.com.  We talked to dozens of software
       engineers and kept hearing about three problems caused by poor
       cloud cost observability:  (1) _Cost anomalies are slow to detect
       and hard to diagnose._ For example, a computer vision company
       noticed their AWS costs spiking one month. Costs accrued until they
       identified the culprit: one of their customers had put up a life-
       size cutout of John Wayne, and they were running non-stop facial
       recognition on it.  (2) _No cost accountability in big orgs._ For
       example, a public tech company's top priority last year was to
       increase gross margin. But they had no way to identify the highest
       cost managers /products or measure improvement despite tagging
       efforts.  (3) _Uncertain and variable per-customer gross margins._
       For example, a SaaS startup had one customer generating  >50% of
       its revenue. That customer's usage of certain features had recently
       1,000x'ed, and they weren't sure the contract was still profitable.
       (If you've had an experience like this, we'd love to hear about it
       in the comments.)  We built Dashdive because none of the existing
       cloud cost dashboard products solves all three of these problems,
       which often requires _sub-resource_ cost attribution.  Existing
       tools combine AWS, GCP, Datadog, Snowflake, etc. cost data in a
       single dashboard with additional features like alerting and cost
       cutting recommendations. This is sufficient in many cases, but it
       falls short when a company (a) wants per-customer, per-team or per-
       feature cost visibility and (b) has a multitenant architecture.  By
       contrast, Dashdive uses observability tools to collect granular
       cloud usage data at the level of individual user actions (e.g. each
       API call or database transaction). We attribute this activity to
       the corresponding feature, the responsible customer and team and
       estimate its cost based on the applicable rate. The result is more
       detailed cost and usage data than can be obtained with tagging.
       This information can be used to detect anomalies in real-time and
       identify costly teams, features and customers. One of our customers
       is even using Dashdive to charge customers for their cloud usage.
       We use Kafka to ingest large volumes (>100m/day) of product usage
       events, and our web dashboard supports real-time querying thanks to
       ClickHouse. This makes it fast and easy to answer questions like:
       "Over the past 14 days, how much vCPU time did customer X use on
       Kubernetes cluster A, and how much did that cost me?" You can
       answer such questions even when the same container or pod is shared
       by multiple customers, features and/or teams.  You can test drive
       the product with example data here: https://demo.dashdive.com/.
       Given the high per-customer cost of our infrastructure and the
       manual steps required for setup on our part, we don't offer self-
       serve onboarding or a public "free tier" to monitor your own cloud
       usage, but this demo gives a basic view of our product.  Right now,
       Dashdive supports S3 and S3-compatible object storage providers.
       We're working to add support for other providers and services,
       particularly compute services (EC2, GCP VMs, ECS, EKS, GKE, etc.).
       If there's any service in particular you want to see supported,
       please tell us in the comments. We're eager to see your comments,
       questions, concerns, etc.
        
       Author : ashug
       Score  : 56 points
       Date   : 2024-01-29 17:03 UTC (5 hours ago)
        
       | ericb wrote:
       | Very cool!
       | 
       | Feature request: I have really struggled with turning the thing
       | costing me money _off_ in AWS.
       | 
       | If, with the right master credentials, I could consistently and
       | easily do that somehow, that'd be a 10x feature. If you made that
       | use-case free, you'd get tons of installations from people who
       | desperately need this in the top of your sales funnel.
       | 
       | edit: This used to say "in your app" and that wasn't quite what I
       | want, so I changed that language, but jedberg's objections in the
       | comments below were, and are, valid concerns with what I was
       | stating and any implementation.
        
         | jedberg wrote:
         | That would be a security nightmare. You don't want to give such
         | powerful credentials to anyone, much less a 3rd party.
         | 
         | But a good stopgap would be a feature to spit out an API
         | command that someone could run (or a CloudFormation or TF file)
         | where you can put your own credentials in and run it yourself.
        
           | abraae wrote:
           | If you could only turn things off then perhaps that is less
           | than a nightmare in some settings.
        
             | jedberg wrote:
             | The only way it would be effective is if that credential
             | had broad abilities to destroy, and I wouldn't want such a
             | credential to get stolen. It would be bad enough for your
             | most trusted operator to have it honestly.
             | 
             | The best way to do it would be to run the delete with _no_
             | access, see what permission errors you get, and then only
             | give those permissions until you 've successfully deleted
             | the object.
             | 
             | The safest way (but obviously more work) to do one off work
             | like this is start permissionless and slowly open up. There
             | are tools that can help with this, extracting the
             | permission errors and generating the files to update the
             | permissions.
        
         | nathanwallace wrote:
         | Perhaps give Flowpipe [1] a try? It provides "Pipelines for
         | DevOps", including a library of AWS actions [2], that can be
         | run on a schedule (e.g. daily) or a trigger (e.g. instance
         | start) to do things like turn off, update or delete expensive
         | assets. It can also be combined with Steampipe [3] and the
         | queries from the AWS Thrifty mod [4] to do common queries. We'd
         | love your feedback if you give it a spin!
         | 
         | 1 - https://github.com/turbot/flowpipe 2 -
         | https://github.com/turbot/flowpipe-mod-aws 3 -
         | https://github.com/turbot/steampipe 4 -
         | https://github.com/turbot/steampipe-mod-aws-thrifty
        
         | scapecast wrote:
         | We solved that exact problem with our open source tool Resoto.
         | Specifically our "Defrag" module, which cleans up unused and
         | expired resources:
         | 
         | https://resoto.com/defrag
         | https://github.com/someengineering/resoto
         | 
         | The magic behind the clean up is Resoto's inventory graph - the
         | graph captures the clean up steps for each individual AWS
         | resource.
         | 
         | One of Resoto's users, D2iQ (now part of Nutanix), reduced
         | their monthly cloud bill by ~78%, decreased from $561K to $122K
         | per month. There's a step-by-step tutorial on our blog how they
         | did it.
         | 
         | I don't mean to hijack Dashdive's thunder here though, congrats
         | on the launch!
        
         | ashug wrote:
         | This is an interesting point, and we could definitely consider
         | something like this. Where exactly do you run into problems?
         | 
         | For example, let's say you've figured out that a particular EC2
         | instance or database is too costly. What is the sticking point
         | for you in turning it off? Is it that the resource has other
         | essential functions unrelated to the cost spike? Or is it the
         | identification of the exact resource that's the problem?
        
           | ericb wrote:
           | When we evolved to SSO and subaccounts with various
           | roles/access, there were resources running/used by different
           | accounts. I would see the instance in the costs. But then,
           | I'd try and find that instance, started by another dev, and
           | get access to shut if off. And even though I'm the main
           | account owner--which to me, means I should be able to nuke
           | whatever, since I pay the bills, I always had trouble getting
           | to it, and getting the permissions for what I wanted.
           | 
           | I used Vantage, which helped me see the problem, but then
           | taking action on it was traumatic.
           | 
           | The barriers are:
           | 
           | - who owns it?
           | 
           | - what service is it in (if it is logs, for example)?
           | 
           | - where is the screen it is on?
           | 
           | - how do I get the permissions to kill it?
        
             | ashug wrote:
             | Makes a lot of sense - in my opinion AWS doesn't do the
             | best job with this. We had a similar problem with EKS where
             | even as the root user I couldn't view cluster details
             | (https://medium.com/@elliotgraebert/comparing-the-top-
             | eight-m....).
             | 
             | I agree that this would be a great feature. To be honest,
             | our product isn't currently focused on this sort of
             | automatic management of resource lifecycle; we're much
             | better at data collection. But thank you for flagging this!
             | We'll definitely keep it in mind as we add support for
             | compute services (right now we only support S3 and there's
             | nothing to "spin down").
             | 
             | Edit: The part less related to permissions (how can I kill
             | it) and more related to discoverability (which resource is
             | it) is more adjacent to what we've already built and is
             | something we can take a look at soon. Perhaps we can take a
             | crack at the permissions aspect afterwards.
        
         | kapilvt wrote:
         | The CNCF project, Cloud Custodian fits these sorts of use cases
         | pretty well, and supports periodic or event based triggers.
         | https://cloudcustodian.io/docs/aws/examples/
        
       | learner007 wrote:
       | Looks like a good idea, my only gripe is the base plan is too
       | expensive for MVP products that are not earning anything yet.
        
       | ravivyas wrote:
       | Do you plan to have on prem usage (if not cost ) in the future?
       | 
       | Also what about utilisation rates ?
        
         | ashug wrote:
         | Usage is what we track directly, and then we apply the cloud
         | provider's billing rules for the given service (e.g. S3) to
         | calculate the resulting costs. So utilization rates for
         | something like a Kubernetes cluster are easy to derive with the
         | data already collected - just take the usage we've tracked and
         | divide by the total resources available to the cluster. We
         | haven't finished the k8s offering yet, but this would be a
         | great feature / view for us to include (the same goes for most
         | other compute offerings, e.g. EC2, ECS).
         | 
         | The same goes for on prem. We don't have any on prem customers
         | currently, but it would be easy to add a feature where you
         | input the total capacity and/or monthly cost of your on prem
         | infrastructure, and use the collected usage data to calculate
         | utilization rate and "effective cost" incurred by each feature,
         | customer, etc. Thanks for the questions!
        
       | helloericsf wrote:
       | Many engineering teams unquestionably find this challenging. Just
       | a quick question, does it solely track usage, or can it also aid
       | in cutting down costs?
        
         | ashug wrote:
         | You can use the usage and cost data Dashdive collects to
         | identify cost spikes or ongoing inefficiencies (e.g. this
         | particular feature is using more vCPU than should be
         | necessary). But we won't do any automatic cost cutting for you
         | (some products allow you to buy reserved instances or rightsize
         | from directly within their app).
        
       | neom wrote:
       | I've been using vantage.sh - how are you thinking about
       | differentiation there?
        
         | ashug wrote:
         | The key differentiation from Vantage and other similar products
         | is the level of granularity.
         | 
         | Vantage is in the category mentioned above: it combines AWS,
         | GCP, Datadog, Snowflake, etc. cost data in a single dashboard
         | and supports tagging. For example, if I have a single tenant
         | architecture where every customer has their own Postgres RDS
         | instance, I can tag each RDS node with `customerId:XXX`. Then I
         | can get cost broken down by customer ID in Vantage.
         | 
         | However, if my entire app (including every customer) uses the
         | same large RDS instance, or if I'm using a DBaaS like Supabase,
         | tools like Vantage, which rely on tagging at the resource
         | level, cannot show a breakdown of usage or cost per customer.
         | By contrast, we record each query to your monolith or
         | "serverless" DB (SELECT/INSERT/UPDATE) along with the vCPU and
         | memory consumed, tag the query with `customerId` and other
         | relevant attributes, and calculate the cost incurred based on
         | the billing rules of RDS or Supabase.
        
       | kingnothing wrote:
       | How does your product differentiate from Yotascale?
        
         | ashug wrote:
         | Yotascale is in many ways similar to Vantage, so a similar
         | answer to this one
         | (https://news.ycombinator.com/item?id=39178753#39181486)
         | applies.
         | 
         | When we were originally researching to see if anyone had done
         | something like Dashdive already (i.e., specifically applying
         | observability tools / high volume event ingestion to cloud
         | costs), I did manage to find a Yotascale video in which they
         | mentioned defining custom usage events in Kubernetes for a
         | specific customer. It seemed more like a custom feature than a
         | generally available part of their product, but I could be
         | mistaken.
        
       | jrhizor wrote:
       | How does this compare to Ternary?
        
         | ashug wrote:
         | Ternary is similar to Vantage in its offering so this answer
         | (https://news.ycombinator.com/item?id=39178753#39181486) also
         | applies here.
         | 
         | There are quite a few cloud cost tools out there which use
         | AWS's cost and usage reports (or GCP/Azure equivalents) as
         | their sources of truth, and as a consequence their data is
         | largely based on tagging cloud resources with attributes of
         | interest (e.g. EC2 instance XXX has tags customerId:ABC,
         | teamId:DEF, featureId:GHI). These include Ternary, Vantage,
         | Yotascale, and others (CloudChipr, CloudZero, Archera,
         | Cloudthread, Finout, Spot by NetApp, DoiT, Tailwarden, CAST AI,
         | Densify, GorillaStack, Economize Cloud). Some of these offer
         | AI-based automatic tagging as well.
         | 
         | But even so, if I - for example - have a lots of k8s replica
         | Pods which serve most of my application traffic, I can't use
         | any of these products to figure out which customers, API
         | endpoints, code paths, features, etc. are costing me the most.
         | At best I could tag the entire Deployment, or maybe even each
         | Pod. But the problem is that every Pod is serving lots of
         | endpoints, customers, and features. However, Dashdive can give
         | you this info.
         | 
         | From a technical implementation standpoint, Dashdive is much
         | closer to application performance monitoring (APM) products or
         | usage based billing products than it is to most cloud cost
         | dashboard products.
        
       | i_like_pie1 wrote:
       | cool/nice work
        
       | darkbatman wrote:
       | how are you ingesting from kinesis to clickhouse. are you using
       | some custome sink connector or processes on ec2 or lambda?
        
         | ashug wrote:
         | We actually use Kafka rather than Kinesis, although they're
         | very similar. For writing to ClickHouse from Kafka, we use the
         | ClickHouse Kafka sink connector:
         | https://github.com/ClickHouse/clickhouse-kafka-connect.
        
       ___________________________________________________________________
       (page generated 2024-01-29 23:00 UTC)