[HN Gopher] Show HN: Infracost (YC W21): Be proactive with your ...
___________________________________________________________________
Show HN: Infracost (YC W21): Be proactive with your cloud costs
Hi, we are Ali, Hassan, and Alistair, co-founders of Infracost
(https://www.infracost.io/). Infracost helps engineers see the cost
of each Terraform change before launching resources. When changes
are made, it posts a comment with the cloud cost impact. For
example, "you've added 2 instances and volumes, and change an
instance type from medium to large, your bill will increase by 25%
next month, from $1000 to $1250 per month". We launched in
February 2021 (https://news.ycombinator.com/item?id=26064588), and
Infracost is now being actively used by over 3,000 companies.
However, there is a shift happening in the cloud cost management
space. New teams, called FinOps teams (a combination of "Finance"
and "DevOps"), are being formed within companies to manage cloud
costs. One of the first tasks assigned to these teams is to
determine "who is using what" - that is, which teams, business
units, products, etc. are spending the most on cloud. To accomplish
this, they use tags. Tags are labels that all cloud resources
should have and are key-value pairs. For example, a server could be
tagged with: product=HackerNews; environment=production;
team=blueTeam. So if resources are not tagged properly, then you
can't tell who is using what. However, FinOps teams face
challenges because their tools are reactive. These tools begin by
analyzing cloud bills and providing visibility of tags from there.
This means that they are looking at resources that are already
running in production and costing money. A customer recently
shared, "I want all resources to be properly tagged. But if they
are not, I would rather a resource not be tagged at all than be
tagged incorrectly." My "aha" moment! FinOps teams can define a
tagging policy that can be validated in CI/CD before resources are
launched. This is important because if code is shipped with the
wrong tags, FinOps teams will have to fight for sprint time to fix
them. Even if you shut down an untagged resource directly in the
cloud, the next time Terraform runs, the resource will launch again
with no tag. You need to fix the issue at its root. I'd love your
feedback on our solution to the tagging problem. You define your
tag key-value policy in our SaaS product, and Infracost checks all
Terraform resources per change. If anything fails the policy, it
posts a comment with the details of which resources need tags, and
what the allowed values are. Once fixed, it will let the code be
shipped to production. Try it out by going to
https://dashboard.infracost.io/, setting up with the GitHub app or
GitLab app, and defining your tagging policy. It will then scan
your repository and inform you of any missing tags and their file
and line number. You can use the free trial, but if you need more
time, please message me and I'll extend it for you. I would also
love to hear how others ensure that the correct tag keys and values
are applied to all resources, and whether this is done proactively
or reactively. Additionally, I would be interested in hearing about
any lessons learned in the process. Cheers
Author : hkh
Score : 72 points
Date : 2023-08-09 13:01 UTC (10 hours ago)
| alexambarch wrote:
| I've seen Infracost around and think it looks cool, do you have
| any plans to add support for Pulumi? One advantage Terraform
| seems to have over Pulumi is the ecosystem of tools that support
| it.
| hkh wrote:
| Yep, for sure! It's on the roadmap. We are friends with the
| folks at Pulumi too. Love what they are building, so hopefully
| we will get some bandwidth and add support. And Cloud Formation
| too. Azure ARM ... haha there is a lot more to build :)
| jaxxstorm wrote:
| I'm a a Pulumi employee, if you'd like to chat about how we
| can help add support, email me lbriggs[at]pulumi.com
| beaviskhan wrote:
| I have used this tool in the past, though free tier only. It was
| easy to get up and running and easy to plug into a CICD pipeline.
| The problem we had with it in practice was that we largely
| preferred serverless technologies in AWS where the cost depended
| mostly or even completely on actual usage - things like Lambda
| invocations, SQS operations, or autoscaling ECS services, for
| example. In this case the estimates we got from Infracost were
| not very useful. Providing a meaningful cost estimate requires
| projecting usage, which is something that our development teams
| were very bad at, if they could be bothered to care at all.
|
| I like the idea of implementing tagging enforcement in the
| pipeline. In a perfect world you would use cloud policies to do
| this, but in practice this is a big loser in AWS where a
| staggering number of resources are created by one API call and
| then tagged as a followup API call, meaning an SCP to prevent
| launch of untagged resources won't ever work.
| aliscott wrote:
| Great point about the multiple API calls. One of the big
| problems we've heard about using SCPs is that they are too
| late. If a deployment fails because of them the developer needs
| to go through another pull request/code review.
|
| Estimating costs for serverless technologies upfront is
| definitely challenging. We're thinking of bringing in the last
| 30 days of usage for these resources to give engineers some
| visibility.
| easton wrote:
| > but in practice this is a big loser in AWS where a staggering
| number of resources are created by one API call and then tagged
| as a followup API call
|
| We have a bot at work that sends you (or a DL with a bunch of
| people) a nastygram if you forget to tag your resources, but it
| doesn't know this. So if CloudFormation isn't done, you'll get
| the email and then have to respond to everyone with a
| screenshot showing that you didn't in fact goof it up. I wonder
| if you can make it so EventBridge (or however it's implemented,
| I'm not sure) delay an event for 30 seconds so they don't
| actually look until CF is done tagging.
| danpalmer wrote:
| I've not used the product, so it may already do this, but does
| it ask you for the data it needs in the Pull Request?
|
| I have experience interacting with a logging system, where any
| diff to the logged data would need a tag like
| `log_size_increase=3 bytes` - the CICD system would then turn
| this, with the data already available, into an estimate of the
| overall extra storage needed.
|
| Perhaps the same could be done. Rather than figuring out
| "usage" of some serverless systems, which is a very vague
| question and therefore hard to answer, perhaps it could be more
| specific. For example, how many requests per second is it
| expected to receive? Or, which other serverless functions call
| it (and therefore which will it necessarily have the scale of).
| Or, what increase in usage would be expected for this change.
| beaviskhan wrote:
| It's been a while since I used this tool, but as best I can
| recall there was a way to provide usage estimates to feed the
| variable cost calculations. The biggest problem we had was
| getting development teams to know and care enough to provide
| accurate numbers. The suggestion in the post below to provide
| 30 days historical data as a starting point could be a great
| way to have a meaningful baseline. If someone had better
| projections, they could provide them, but at least it
| wouldn't be a total crapshoot.
| superdeeda wrote:
| Sounds useful!
|
| We're using service control policies to enforce tagging on
| certain resource types, and retroactively for the rest.
|
| Considering to use a "shift-left" tool as well, but it would need
| to support Terraform, CDK, Serverless and Cloudformation.
| aliscott wrote:
| Awesome, yeah we've seen people using this method and the main
| complaint we've heard is this is annoying for developers since
| it blocks their deployments when they run `terraform apply`, so
| they need to create new pull requests and wait for another code
| review. Combining both can definitely help with this.
| cube2222 wrote:
| Not sure if with shift-left you mean specifically shifting left
| infracost and FinOps or general Infrastructure-as-Code shift
| left.
|
| In case it's the latter, I can recommend Spacelift[0] - a
| specialized CI/CD tool for IaC and supports all the tools
| you've mentioned. It basically helps you build policies and
| orchestrate your infra (don't want to go into too much detail
| in this comment) to scale it to bigger teams and setups.
| Policies to enforce tagging would indeed be a good example.
|
| It integrates with infracost too, but obviously just for the
| tools infracost works with, no CloudFormation.
|
| Disclaimer: Work at Spacelift so obviously take the
| recommendation with a grain of salt, but I do legitimately
| think it's a great tool.
|
| [0]: https://spacelift.io
|
| P.S. Congrats on the Show HN Infracost team!
| lispisok wrote:
| Cloud costs to easily balloon out of control and I bet this is
| helping companies save money but this FinOps stuff also seems
| like something straight out an HBO's Silicon Valley skit.
| rchandna wrote:
| The Infracost Terraform Cloud run task is awesome!
| akh wrote:
| Thanks! Yep, we're partners with HashiCorp and worked on that
| integration early on :)
| l-a wrote:
| So as a developer advocate and a tinkerer with a little home lab,
| I am often setting up and tearing down infrastructure to test
| things out. I use AWS because that's what I am most familiar and
| I try to be super careful about not running up a crazy bill, but
| I am still occasionally caught off guard. Now I am thinking about
| testing out Infracost to help prevent unwanted and unintentional
| spending.
|
| As far as a solution to consistent tagging -- if I am
| understanding the problem space correctly -- something like Cloud
| Custodian could possibly help. It's open source and you can set
| up auto-tagging policies as well as use Cloud Custodian to
| backfill tags. These policies use lambda functions to respond to
| certain actions (ie, spinning up an ec2 instance, etc) and auto-
| tag with the resource creator/owner.
| hkh wrote:
| Bingo - so Infracost will tell you before you launch anything
| how much it'll cost. Now scale that to a few thousand
| developers across a large company, and it's very impactful.
|
| Backfilling tags works, but the issue is if Terraform isn't
| updated, it causes drift - it's much better to fix it at the
| root, so that's what Infracost helps with
| toshk wrote:
| I love how we just build complexity upon complexity. A tool for
| all the problems that this new tool gave that was solving all
| these other tools. A never ending mountain of complexity. In that
| sense coding (and hosting) is like the law. The entire ecosystem
| will just keep expanding in complexity decade by decade
| nine_zeros wrote:
| How else will people make money (if they are a startup) or get
| promotions (if they are in a corporate).
| GrandPoobahLOL wrote:
| oh, this is interesting, we're currently using Vantage
| (https://vantage.sh) how would you say infracost compares?
| hkh wrote:
| Vantage is awesome - I've talked with Ben (their CEO) a few
| times. There are a lot of tools that start from the cloud
| bills, and give you visibility of everything (Vantage,
| Cloudability, Cloud Health Tech, Flexera etc) - all of these
| tools are reactive in nature as they start from the cloud
| bills. Infracost sits where your code sits, and therefore it
| can be proactive; before anything is launched and costs money,
| it'll tell you how much it is going to cost. So if you have a
| budget of $1K, and you try to launch a 24xl instance, it'll
| tell you that you budget will be blown, before you've launched
| the resource. Making it all proactive.
| plasma wrote:
| I'm not a target user, but you mentioned the tagging problem and
| git integration, perhaps you could infer at least the git user
| responsible for each resource cost (git blame the TF file and
| identify the username who added the resource) as a minimum amount
| of detail provided out of the box?
| akh wrote:
| Interesting idea! The pull request authors are shown out of the
| box but we hadn't thought of using git to find the user for
| each resource on the main branch. Most organizations end-up
| tagging the resources with some sort of owner or team so they
| can group the costs using that and track it per
| team/service/product over time. That's often how FinOps teams
| start to create a sense of ownership for cloud costs amongst
| teams.
| frellus wrote:
| Would like to know more about how Infracost does dynamic cost
| estimation, for example if I allocate an S3 bucket I have no idea
| how much it'll grow to so what does it show? Or What ab out EC2
| w/ batch, or Lamba? Does it force the developer to estimate the
| usage pattern, or...?
| cbcoutinho wrote:
| Yes, usage metrics are set via a configuration file, which you
| can also check into git. Changes to resources as well as usage
| estimates contribute to the forecasted costs
| keepamovin wrote:
| I love it! But as human processes go, it will need to surmount
| the "flaky tests" problem of, "let's just turn off this test
| because it's flaky and we need to merge this branch". I guess
| that means FinOps teams will still have to fight to be heard, but
| I think you are helping shift a lot of their burden!
|
| What remains seems more like organizational dynamics, but what
| are your thoughts?
| akh wrote:
| Great point - indeed FinOps teams consistently rank "empowering
| engineers to take action" as their number 1 challenge
| (https://data.finops.org) - and by that they mean the human and
| organization dynamics of the culture change they want to create
| across the org.
|
| The testing analogy is a good one as this feature also shows
| the engineers the current "failing policies" on the main branch
| too, so whilst they could merge the pull request without fixing
| the tagging issue, it'll just get added to the list. And maybe
| like tests, they group them into one task and go through to fix
| them all every so often to get the main branch back to green!
| keepamovin wrote:
| Nice! What did you start out doing, if you don't mind me
| asking? And how did you come to this, pivot, if that's what
| it is?
| akh wrote:
| We started out with the Infracost CLI showing engineers
| cost estimates in the terminal before they deployed their
| code. The learning was that it also makes sense to check
| for other things like tagging policy issues and best
| practices not being followed as these things are more
| actionable than showing engineers a cost estimate. The cost
| estimate is actually more useful to trigger notifications
| on, e.g. if an engineer is adding $10K worth of databases,
| let the engineering management or FinOps teams know so
| they're not surprised by the spike in the bill and can
| adjust budgets if needed.
| vasco wrote:
| For anyone following at home, once you've identified a test as
| flaky, your next action should be to turn it off. Nothing good
| comes from keeping flaky tests around. Detect them as soon as
| you can and either fix them _right there_ or skip them.
|
| I've used this in practice in a company of ~80 developers at
| the time, applied it because read about it in some Dropbox
| papers, and have since seen it work in 2 other companies. Skip
| your flaky tests!!
| akh wrote:
| I suppose the difference between flaky tests and typos in
| tags/missing tags is that the latter is less about flaky-
| ness, and more about the engineer deciding not to fix the
| tagging issue and merging anyway. In Terraform, tags are
| fairly easy to fix and don't require the resource to be
| recreated so it feels like it should be a quicker fix then
| fixing/refactoring tests.
|
| I think the easier we make it for engineers to fix tagging
| issues, the more likely it'll be for engineers to take
| action. Send me an email asking me to read the company's wiki
| page on tagging policy and I'll delete the email; tell me I
| have a typo on line 8 as soon as I open my pull request, I'll
| fix it and move on.
| haxiel wrote:
| Hi, Azure admin here. The Azure Policy service includes a set of
| built-in policies to handle tags. There's one policy that
| requires new resource groups to be created with specific tags.
| Another policy allows resources within the resource group to
| inherit the same tags. I think this combination of policies would
| solve the tagging problem quite neatly, though I haven't tested
| it myself.
| hkh wrote:
| Hi, I think the key issue with both the Azure policy, and the
| Amazon services is that they only work after a pull request has
| been merged. Then the build fails, and the engineer has to come
| back to their code, make a new Pull Request and then send it
| again, till it passes.
|
| That's the feedback we got from users, so with Infracost, the
| Pull Request itself tells the engineer what needs to be done,
| along with exact code line numbers etc before going any
| further, so everything is fixed within the same pull request.
| Also, it works across all cloud providers, so FinOps can set
| central tags in a uniform manner no matter where the engineers
| are launching resources.
___________________________________________________________________
(page generated 2023-08-09 23:02 UTC)