[HN Gopher] Terraform best practices for reliability at any scale
       ___________________________________________________________________
        
       Terraform best practices for reliability at any scale
        
       Author : holoway
       Score  : 79 points
       Date   : 2023-08-04 19:22 UTC (3 hours ago)
        
 (HTM) web link (substrate.tools)
 (TXT) w3m dump (substrate.tools)
        
       | Terretta wrote:
       | This should be mandatory reading for anyone doing IaC, using TF
       | and AWS or not, less for how you do it, more for what and why.
       | 
       |  _/ / shout out to AWS CAB alums_
        
       | thunfisch wrote:
       | We're using Terragrunt with hundreds of AWS accounts and
       | thousands of Terraform deployments/states.
       | 
       | I'll never want to do this without Terragrunt again. The
       | suggested method of referencing remote states, and writing out
       | the backends will fall apart instantly at that scale. It's just
       | way too brittle and unwieldy.
       | 
       | Terragrunt with some good defaults that will be included, and
       | separated states for modules (which makes partial applies a
       | breeze) as well as autogenerated backend configs (let Terragrunt
       | inject it for you, with templated values) is the way to go.
        
         | ckdarby wrote:
         | Have you spent any time with Pulumi?
         | 
         | I've kind of found terraform is dying and encourages a lot of
         | bad practices but everyone agrees with them because HCL and it
         | is transferable as most companies are just using TF.
        
         | DelightOne wrote:
         | Do you need to chain multiple Terragrunt executions to first
         | bring the Kubernetes cluster up and then the containers, or
         | does Terragrunt fix that?
        
           | miduil wrote:
           | Yes, with terragrunt you can do a `terragrunt run-all apply`
           | and based on `output` to `variable` in each module data can
           | be passed from one state/module to the next one, terragrunt
           | knows how to run them in the right order so you can bootstrap
           | your EKS cluster by having one module which bootstraps the
           | account, then another one which bootstraps EKS, then one that
           | configures the cluster, installs your "base pods" and then
           | later everything else.
        
       | swozey wrote:
       | This was a good read but really if you already follow the common
       | best practices of IAC/terraform/aws multi-account I don't think
       | you're going to learn much.
       | 
       | The comments in here kind of made me think I was going to hop in
       | and take away some huge wins I hadn't considered. But I have been
       | working with Terraform and AWS for a very long time.
       | 
       | If you're unfamiliar with AWS multi-account best practices this
       | is a good read.
       | 
       | https://aws.amazon.com/organizations/getting-started/best-pr...
        
       | xyzzy123 wrote:
       | Here's my #1 tip, most important:
       | 
       | Try to keep your stateful resources / services in different
       | "stacks" than your stateful things.
       | 
       | Absolutely 100% completely obvious, maybe too obvious? Because
       | none of these guides ever mention it.
       | 
       | If you have state in a stack it becomes 10x more expensive and
       | difficult to "replace your way out of trouble" (aka destroy and
       | recreate as last resort). You want as much as possible in
       | stateless, disposable stacks. DONT put that customer exposed
       | bucket or DB in the same state/stack as app server resources.
       | 
       | I don't care about your folder structure, I care about what % of
       | the infra I can reliably burn to the ground and replace using
       | pipelines and no manual actions.
        
         | sverhagen wrote:
         | Is a "stack" here a (root) folder on which you'd do a
         | "terraform apply"? I've never know what to call those, surely
         | they aren't "modules".
         | 
         | And, so, you're saying: try to have a separate deployment
         | (stack then?) that contains the state, so you can wipe away
         | everything else if you want to, without having to manage the
         | state?
        
           | xyzzy123 wrote:
           | It's not exactly about the folder, the IaC from a single
           | folder / project can be instantiated in multiple places. Each
           | time you do that, it has a unique state file, so I usually
           | hear it referred to as a "state". In cfn you can similarly
           | deploy the same thing lots of times and each instantiation is
           | called a "stack", so stack/state tend to get used inter-
           | changeably.
           | 
           | And yes, that's a succinct rephrasing.
           | 
           | When you first use iac it maybe seems logical to put your db
           | and app server in the same "thing" (stack or state file) but
           | now that thing is "pet like" and you have to take care of it
           | forever. You can't safely have a "destroy" action in your
           | pipeline as a last resort.
           | 
           | If you put the stateful stuff in a separate stack you can
           | freely modify the things in the stateless one with much less
           | worry.
        
             | swozey wrote:
             | Can you elaborate on this? I've never heard of this IAC
             | structure and I'm trying to figure out what the
             | benefit/cons are. Maybe it's just Friday and I'm checked
             | out already.
             | 
             | If you run a terraform apply and only update microservices
             | but you also have your dbs/stateful things in the same
             | stack/app, you're only updating the microservices so how
             | would this affect the db/stateful at all?
             | 
             | On the opposite end - I feel like there would be scenarios
             | where I needed to update the stateful AND stateless
             | services with the same terraform apply. Maybe I'm adding a
             | new cluster and adding a db region/replica/securitygroup
             | and that new cluster needs to point at the new db region.
             | 
             | In your scenario I would have updated microservices trying
             | to reach out to a db in a region that doesn't exist yet
             | because I have to terraform apply two different stacks. How
             | would you deal with a depends_on?
             | 
             | Maybe I'm misunderstanding this.
        
               | rcrowley wrote:
               | (Hi, I'm one of the authors of the article at the root of
               | this thread.)
               | 
               | Considering your hypothetical stateless microservice
               | change in the same root module as stateful services,
               | problems arise when _someone else_ has merged changes
               | that concern the stateful services, leaving you little
               | room to apply your changes individually.
               | 
               | It's also worth remembering that, even if a stateless
               | service and a stateful service are managed in the same
               | root module, applying changes is absolutely not atomic.
               | Applying tightly coupled changes to two services "at the
               | same time" is likely to result in brief service
               | interruptions, even if everything returns to normal as
               | soon as the whole changeset is applied.
        
               | swozey wrote:
               | Ok I think we're talking about two separate things here -
               | you're referencing a root module and not a "stack", as in
               | a stack is a full service/application that uses multiple
               | modules to deploy. Your db module, eks module, etc. All
               | independent modules, not combined into one singular
               | module. Say it's sitting in the
               | /terraform/app1/services/db(&)app folders type of
               | scenario.
               | 
               | I _think_ you 're talking about putting stateful and
               | stateless objects inside of a single module. So you've
               | got /terraform/modules/mybigapp/main.tf that has your
               | microservice + database inside of it.
               | 
               | If I'm right and that's what you mean that's really
               | interesting I don't think I've ever seen or done that but
               | now I'm curious. I'm pretty sure I've never created an
               | "app1" module with all of its resources.
               | 
               | Am I totally off here?
        
               | [deleted]
        
               | rcrowley wrote:
               | I stuck with my typical term, root module, synonymous
               | with how folks are using "stack" and "state" in various
               | parts of this thread.
               | 
               | A module is any directory with Terraform code in it. A
               | root module is one that has a configuration for how to
               | store and lock Terraform state, providers, and possibly
               | tfvars files. Both modules and root modules may reference
               | other modules. You run `terraform init|plan|apply` in
               | root modules.
               | 
               | I think my comment makes sense in that if you mix two
               | services into the same root module (directly or via any
               | amount of modules-instantiating-other-modules) you can
               | end up with changes from two people affecting both
               | services that you can't easily sever.
               | 
               | Happy to clarify further if I'm still not addressing your
               | original comment.
        
               | swozey wrote:
               | @rcowley -- I'm going to preface this with I'm a Staff
               | SRE at an adtech corp that does billions and have been a
               | k8s and terraform contributor since 2015 (k8s 1.1 I
               | forget the tf versions). I don't mean this to brag I just
               | want to set my experience expectation since I'm a random
               | name on hn who you'd never know.
               | 
               | I think calling a service/stack (or whatever, app, etc) a
               | "root module" is a very, very confusing thing to do.
               | Terraform has actual micro objects called modules. We
               | work with them every day. I get how you could consider
               | encompassing an entire chunk of terrafrom code that calls
               | various modules a "root module".. but I think this is
               | just going to lead to absolute confusion to anyone not
               | familiar with your terminology. I don't know every TF
               | conversation but I can't think of a single time where
               | I've heard root module in that context. Very good chance
               | I've just missed those conversation and am ignorant to
               | them.
               | 
               | I'm currently hiring SRE 2s and 3s so I've been
               | interviewing lots of terraform writers and one of my tech
               | questions is to ask someone what makes them to decide to
               | write a terraform module and what type of modules they've
               | written - it's always ALBs, EKS, dbs, etc. components
               | indepedently that go into creating a service/stack. I've
               | definitely not heard anyone mention that they write "root
               | modules" in the sense of an entire service/stack.
               | 
               | I don't mean you're right or wrong, maybe more people are
               | aware of that verbage than I am. I just wanted to mention
               | that in my personal case I think it's confusing so I
               | would assume that there are a lot of people in my shoes
               | who would also be confused by it.
        
               | [deleted]
        
             | waffletower wrote:
             | That's right, stacks can be instantiated across repos even
             | depending upon the organization (both meanings of
             | 'organization' are valid here).
        
         | robertlagrant wrote:
         | Makes sense, but how do you connect the two so e.g. credentials
         | from one are surfaced in the other?
        
           | dharmab wrote:
           | Use Data Sources to reference resources in a different state:
           | https://developer.hashicorp.com/terraform/language/data-
           | sour...
        
           | [deleted]
        
           | paulddraper wrote:
           | terraform_remote_state
           | 
           | The root module can have outputs just like any other module.
           | These outputs can be accessed from other stacks from the
           | backend.
           | 
           | And if you use CDKTF the references are handled
           | transparently.
        
       | pezh0re wrote:
       | This is a great read, but I always seem to run into cases where I
       | need to define something like a security group and then reference
       | it when deploying ec2 instances. I'd love to decouple to reduce
       | my plan time, but I haven't figured a way out as of yet.
       | 
       | To be fair, I haven't used terraform -chdir yet.
        
         | c0Re69 wrote:
         | Try Terragrunt https://terragrunt.gruntwork.io/docs/
        
         | JohnMakin wrote:
         | you can pull it in via a data source, but then of course this
         | creates a coupling between multiple modules/state files.
        
       | spicyusername wrote:
       | Everybody in here is recommending tarragrunt, but I'm not sure
       | what value it provides over regular terraform.
       | 
       | After using it for a few months all of the features found in
       | tarragrunt are in terraform.
        
       | cube2222 wrote:
       | The article recommends to split up your state files for various
       | advantages, but also expands into how to manage it later in a
       | custom way.
       | 
       | I agree with the splitting, but based on many home-grown
       | automation systems I've seen around this I'd really recommend you
       | to use one of the specialized CI/CD systems that are built around
       | automating these kinds of workflows. Once you reach the "many
       | state files" phase, you'll save a lot of engineering time this
       | way.
       | 
       | They'll take care of, among others, running the right state
       | files, in the right order, with the right parameters. But they'll
       | also take care of many other things you need to run Terraform at
       | scale and with big amounts of engineers (happy to expand but
       | don't want to kitchen-sink this comment).
       | 
       | Disclaimer: Take this with a sensible grain of salt, as I work at
       | Spacelift[0] - one of the TACOS (and of course the one I'll
       | shamelessly link and recommend!).
       | 
       | But really, don't use tools like Jenkins for this as you scale,
       | it'll likely hurt you in the long run.
       | 
       | [0]: https://spacelift.io
        
         | swozey wrote:
         | I'm sure that you have no control over this but I really wish
         | Spacelift would increase the cost of its cloud tier and lower
         | the cost of Ent. I'm in the anti-goldilocks zone. Ent seems
         | priced for large teams when I practically fit into the cloud
         | offering sans missing a few required features.
         | 
         | Great product though from what I've experienced.
        
           | sausagefeet wrote:
           | Disclaimer: Co-founder of Terrateam.
           | 
           | For Terrateam[0], we have probably 70% of the enterprise
           | offering but at around 1/10th the price. If there are any
           | features that are deal breaker, feel free to reach out to me
           | and we'll see what we can do. That being said, Spacelift is a
           | much more luxurious piece of software than us. We are very
           | utilitarian, but we have to rationalize that low price-point
           | somehow.
           | 
           | [0] https://terrateam.io
        
           | cube2222 wrote:
           | Sorry to hear that! Pricing is hard.
           | 
           | If you haven't yet, please try talking to our sales team.
           | There's usually a way to make all sides happy with some
           | custom agreements - after all, we'd love for you to be able
           | to use our product as much as you need.
        
           | carty7 wrote:
           | Hi swozey. Spacelift sales leader here. Let's have a
           | conversation and I'll work with you to find the goldilocks
           | zone that you are looking for. Grab a demo with us and
           | mention this post and my name "Ryan". We can dive into the
           | features you require.
        
       | gerl1ng wrote:
       | The solution at the end almost looks like the manual setup of
       | terragrunt which we are using to manage lots of base infra in
       | many different accounts.
       | 
       | What would be interesting here would be to see how they actually
       | reference the outputs from one layer onto the next layer. That is
       | something that is not even solved nicely in terragrunt and one of
       | the major annoyances for me there. Using dependencies and the
       | mock_output option is creating lots of noise in the plan outputs
       | as the dependencies are only completely resolved when terragrund
       | applies all the modules.
       | 
       | But it seems I also missed a few additions to terraform - so
       | probably there are better ways to take outputs from one terraform
       | run into another one.
        
       | waffletower wrote:
       | While combining the word "best with "Terraform" in a sentence is
       | more than likely to result in an oxymoron, it is counter-
       | productive not to attempt to organize and utilize terraform as
       | elegantly and DRY as possible. We interact with stacks (which we
       | call projects typically) via Terragrunt and have a very large
       | surface of modules as we do have a fair amount of infrastructure
       | pieces. But we also try to expose Terraform infrastructure
       | changes by use of Atlantis; though bulky, github does provide a
       | reasonable means to dialogue and manage changes made by multiple
       | teams. The use of modules also helps us encapsulate
       | infrastructure, and state problems are rare with these
       | approaches, but the data sprawl inherent to Terraform is very
       | unwieldy regardless of so called "best" practices. The language
       | features are weak, awkward and directly encourage repetition and
       | specification bloat. We have had some success via Data Sources to
       | export logic outside of Terraform and provide much needed sanity
       | when interacting with very verbose infrastructure such as Lake
       | Formation.
        
       | mdhb wrote:
       | Sorry if this comes across weird or snotty it's not supposed to.
       | 
       | But I'm coming at this from a GCP lens and got half way through
       | the article about how the recommended unit of isolation in the
       | AWS environment is entirely different AWS accounts and I'm kind
       | of hung up on that. Is that really a thing people tend to do
       | often? Doesn't it get super unwieldy? How does billing work? What
       | about identity? I have so many questions.
       | 
       | EDIT: Despite the fact that the root resource in both GCP and AWS
       | is an organization, when I heard "account" I mistook that to be
       | AWS terminology for an organization.
        
         | swozey wrote:
         | The way this works with AWS is similar to you making a GCP
         | project.
         | 
         | At the top level you have an organization account, which is
         | where billing occurs.
         | 
         | From this org account you create accounts for the following
         | (typically):
         | 
         | 1. Security - AKA the account your USERS are in 2. Ops - The
         | account your monitoring, etc are in
         | 
         | From here where a lot of people seem to deviate (I've been
         | interviewing level 2-3 SREs for the last 3 weeks and have heard
         | all about different AWS structures that I don't like) is how to
         | break up your applications into their own accounts for a low
         | blast radius.
         | 
         | What I DO, and is well known as being the best practice, is to
         | create an AWS account for each environment of each application.
         | 
         | App1-sandbox App1-staging App1-production
         | 
         | Then your terraform is also structure by
         | application/environment/service. Each environment and
         | application has it's own state in s3 and dynamodb.
         | 
         | And so on.
         | 
         | Is this unwieldly? I have 40-50 AWS accounts and no it's not
         | unwieldly at all IMO. Cross account IAM and trust relationships
         | are set up very early on and they don't need to be modified
         | much if any at all until you create another AWS account.
         | Creating a new AWS account is kind of annoying, though. I need
         | to automate that process better.
         | 
         | https://aws.amazon.com/organizations/getting-started/best-pr...
        
           | mdhb wrote:
           | Cool, that was a genuinely fascinating window into AWS for
           | me. Thank you for sharing
        
             | swozey wrote:
             | FWIW I loathe AWS IAM and miss GCPs organization.
        
               | wpietri wrote:
               | Yeah, I started with AWS and then spent a year on GCP and
               | next time I'd much rather do GCP. It felt much more
               | manageable and supportive to me.
        
               | mdhb wrote:
               | I'm quite into learning a lot of cloud native security
               | stuff and I have to say my first impression was that it
               | seemed so much harder to think about creating a secure
               | environment using AWS IAM. I couldn't tell if it was just
               | a case of familiarity or not.
        
               | swozey wrote:
               | I'm sure it's because of it's age and them kind of
               | creating their version of IAM from scratch (someone
               | correct me if they copied this structure from elsewhere)
               | but you have to do a lot of goofy obtuse work with IAM
               | automation. There are times I have to go into the
               | console/cli and grab some sort of specific UID for an
               | object instead of using its name, things like that that
               | just make it annoying. Sometimes you can't use an account
               | name and have to use the org ID... I could go on. You
               | just kind of deal with it.
               | 
               | I haven't worked on GCP since maybe 2016-17 so I'm not
               | sure how it's going over there anymore.
        
               | mdhb wrote:
               | It really does sound like an entirely different level of
               | complexity.
               | 
               | GCP native API is basically the same thing as knative in
               | most ways. Just a bunch of various services and resources
               | that you all call and authenticate and even often
               | provision the same way.
               | 
               | As an example of that since we are talking about
               | infrastructure management I would say at its "smoothest"
               | level of integration there is a service you can use (or
               | host it yourself on Kubernetes if that's your thing for
               | some reason) where like any other Kubernetes resource I
               | would just "declare" what I wanted.
               | 
               | So now I'm not messing around with complicated Terraform
               | logic at all (Google got really good with automation, I
               | don't think there is anything close to an equivalent for
               | this is there?). I just declare say a BigQuery resource
               | or a Project (AWS Account equivalent) resource and the
               | service will do all the hard work of making sure that's
               | the state my account is in at any given point.
               | 
               | I can also stick policy controls around it like I would
               | with K8s so only certain people can create certain
               | resources under certain conditions.
               | 
               | It's really easy to just stick that into a git repo and
               | still do all of the IAC stuff mentioned in this article
               | but it's also easy to do the cross environment stuff and
               | manage the roll out between each of them.
               | 
               | Overall, it's very predictable, the IAM is really
               | intuitive but also incredibly granular so it's very easy
               | to model things on top of and to feel fairly confident
               | that I'm not accidentally doing something stupid so I
               | really like it from that point of view.
               | 
               | My number one bit of advice for GCP is see how easily you
               | can architect your way into using Cloud Run as much as
               | possible unless you have some really wild use case. You
               | can get to a really sophisticated set up with only a tiny
               | team. Followed by read Google's API guidelines (aip.dev)
               | to understand how to build things in a way where you're
               | going to continuing having a good time.
        
         | dragonwriter wrote:
         | > But I'm coming at this from a GCP lens and got half way
         | through the article about how the recommended unit of isolation
         | in the AWS environment is entirely different AWS accounts and
         | I'm kind of hung up on that. Is that really a thing people tend
         | to do often? Doesn't it get super unwieldy?
         | 
         | There are AWS systems above the account level for managing it
         | (Organizations), so its not quite as bad as it might naively
         | seem, but, yes, its more unwieldy than GCP's projects.
        
           | mdhb wrote:
           | Oh thank god, that's much better than I naively thought.
           | Thanks for the heads up.
        
         | stock_toaster wrote:
         | You can have sub-accounts that roll up billing to a main
         | account. Still messy, but probably cleaner (security policy
         | wise) and possibly safer (fewer production impacting accidental
         | config changes?) than having a single giant account with _lots_
         | of things mixed together.
        
         | cube2222 wrote:
         | It's extremely common and recommended.
         | 
         | Billing works by having a billing aws account that all other
         | accounts are in a sense "children" of.
        
         | lgreiv wrote:
         | You should be able to organize accounts hierarchically using
         | AWS Organizations, which allows to have cost centers and
         | centralized billing (and some imposed policies over all
         | accounts).
        
       | badblock wrote:
       | Some of this seems like old advice, instead of having directories
       | per environment you should be using workspaces to keep your
       | environments consistent so you don't forget to add your new
       | service to prod.
        
         | rcrowley wrote:
         | (Hi, I'm one of the authors of the article at the root of this
         | thread.)
         | 
         | I've gone back and forth on workspaces versus more root
         | modules. On balance, I like having more root modules because I
         | can orient myself just by my working directory instead of both
         | my working directory and workspace. Plus, I feel better about
         | stuffing more dimensions of separation into a directory tree
         | than into workspace names. YMMV.
        
       | time0ut wrote:
       | I've been using Terragrunt [0] for the past three years to manage
       | loosely coupled stacks of Terraform configurations. It allows you
       | to compose separate configurations almost as easily as you
       | compose modules within a configuration. Its got its own learning
       | curve, but its a solid tool to have in the tool box.
       | 
       | Gruntwork is a really cool company that makes other tools in this
       | space like Terratest [1]. Every module I write comes with
       | Terratest powered integration tests. Nothing more satisfying than
       | pushing a change, watching the pipeline run the test, and then
       | automatically release a new version that I know works (or at
       | least what I tested works).
       | 
       | [0] https://terragrunt.gruntwork.io/
       | 
       | [1] https://terratest.gruntwork.io/
        
       ___________________________________________________________________
       (page generated 2023-08-04 23:00 UTC)