[HN Gopher] Terraform best practices for reliability at any scale
___________________________________________________________________
Terraform best practices for reliability at any scale
Author : holoway
Score : 79 points
Date : 2023-08-04 19:22 UTC (3 hours ago)
(HTM) web link (substrate.tools)
(TXT) w3m dump (substrate.tools)
| Terretta wrote:
| This should be mandatory reading for anyone doing IaC, using TF
| and AWS or not, less for how you do it, more for what and why.
|
| _/ / shout out to AWS CAB alums_
| thunfisch wrote:
| We're using Terragrunt with hundreds of AWS accounts and
| thousands of Terraform deployments/states.
|
| I'll never want to do this without Terragrunt again. The
| suggested method of referencing remote states, and writing out
| the backends will fall apart instantly at that scale. It's just
| way too brittle and unwieldy.
|
| Terragrunt with some good defaults that will be included, and
| separated states for modules (which makes partial applies a
| breeze) as well as autogenerated backend configs (let Terragrunt
| inject it for you, with templated values) is the way to go.
| ckdarby wrote:
| Have you spent any time with Pulumi?
|
| I've kind of found terraform is dying and encourages a lot of
| bad practices but everyone agrees with them because HCL and it
| is transferable as most companies are just using TF.
| DelightOne wrote:
| Do you need to chain multiple Terragrunt executions to first
| bring the Kubernetes cluster up and then the containers, or
| does Terragrunt fix that?
| miduil wrote:
| Yes, with terragrunt you can do a `terragrunt run-all apply`
| and based on `output` to `variable` in each module data can
| be passed from one state/module to the next one, terragrunt
| knows how to run them in the right order so you can bootstrap
| your EKS cluster by having one module which bootstraps the
| account, then another one which bootstraps EKS, then one that
| configures the cluster, installs your "base pods" and then
| later everything else.
| swozey wrote:
| This was a good read but really if you already follow the common
| best practices of IAC/terraform/aws multi-account I don't think
| you're going to learn much.
|
| The comments in here kind of made me think I was going to hop in
| and take away some huge wins I hadn't considered. But I have been
| working with Terraform and AWS for a very long time.
|
| If you're unfamiliar with AWS multi-account best practices this
| is a good read.
|
| https://aws.amazon.com/organizations/getting-started/best-pr...
| xyzzy123 wrote:
| Here's my #1 tip, most important:
|
| Try to keep your stateful resources / services in different
| "stacks" than your stateful things.
|
| Absolutely 100% completely obvious, maybe too obvious? Because
| none of these guides ever mention it.
|
| If you have state in a stack it becomes 10x more expensive and
| difficult to "replace your way out of trouble" (aka destroy and
| recreate as last resort). You want as much as possible in
| stateless, disposable stacks. DONT put that customer exposed
| bucket or DB in the same state/stack as app server resources.
|
| I don't care about your folder structure, I care about what % of
| the infra I can reliably burn to the ground and replace using
| pipelines and no manual actions.
| sverhagen wrote:
| Is a "stack" here a (root) folder on which you'd do a
| "terraform apply"? I've never know what to call those, surely
| they aren't "modules".
|
| And, so, you're saying: try to have a separate deployment
| (stack then?) that contains the state, so you can wipe away
| everything else if you want to, without having to manage the
| state?
| xyzzy123 wrote:
| It's not exactly about the folder, the IaC from a single
| folder / project can be instantiated in multiple places. Each
| time you do that, it has a unique state file, so I usually
| hear it referred to as a "state". In cfn you can similarly
| deploy the same thing lots of times and each instantiation is
| called a "stack", so stack/state tend to get used inter-
| changeably.
|
| And yes, that's a succinct rephrasing.
|
| When you first use iac it maybe seems logical to put your db
| and app server in the same "thing" (stack or state file) but
| now that thing is "pet like" and you have to take care of it
| forever. You can't safely have a "destroy" action in your
| pipeline as a last resort.
|
| If you put the stateful stuff in a separate stack you can
| freely modify the things in the stateless one with much less
| worry.
| swozey wrote:
| Can you elaborate on this? I've never heard of this IAC
| structure and I'm trying to figure out what the
| benefit/cons are. Maybe it's just Friday and I'm checked
| out already.
|
| If you run a terraform apply and only update microservices
| but you also have your dbs/stateful things in the same
| stack/app, you're only updating the microservices so how
| would this affect the db/stateful at all?
|
| On the opposite end - I feel like there would be scenarios
| where I needed to update the stateful AND stateless
| services with the same terraform apply. Maybe I'm adding a
| new cluster and adding a db region/replica/securitygroup
| and that new cluster needs to point at the new db region.
|
| In your scenario I would have updated microservices trying
| to reach out to a db in a region that doesn't exist yet
| because I have to terraform apply two different stacks. How
| would you deal with a depends_on?
|
| Maybe I'm misunderstanding this.
| rcrowley wrote:
| (Hi, I'm one of the authors of the article at the root of
| this thread.)
|
| Considering your hypothetical stateless microservice
| change in the same root module as stateful services,
| problems arise when _someone else_ has merged changes
| that concern the stateful services, leaving you little
| room to apply your changes individually.
|
| It's also worth remembering that, even if a stateless
| service and a stateful service are managed in the same
| root module, applying changes is absolutely not atomic.
| Applying tightly coupled changes to two services "at the
| same time" is likely to result in brief service
| interruptions, even if everything returns to normal as
| soon as the whole changeset is applied.
| swozey wrote:
| Ok I think we're talking about two separate things here -
| you're referencing a root module and not a "stack", as in
| a stack is a full service/application that uses multiple
| modules to deploy. Your db module, eks module, etc. All
| independent modules, not combined into one singular
| module. Say it's sitting in the
| /terraform/app1/services/db(&)app folders type of
| scenario.
|
| I _think_ you 're talking about putting stateful and
| stateless objects inside of a single module. So you've
| got /terraform/modules/mybigapp/main.tf that has your
| microservice + database inside of it.
|
| If I'm right and that's what you mean that's really
| interesting I don't think I've ever seen or done that but
| now I'm curious. I'm pretty sure I've never created an
| "app1" module with all of its resources.
|
| Am I totally off here?
| [deleted]
| rcrowley wrote:
| I stuck with my typical term, root module, synonymous
| with how folks are using "stack" and "state" in various
| parts of this thread.
|
| A module is any directory with Terraform code in it. A
| root module is one that has a configuration for how to
| store and lock Terraform state, providers, and possibly
| tfvars files. Both modules and root modules may reference
| other modules. You run `terraform init|plan|apply` in
| root modules.
|
| I think my comment makes sense in that if you mix two
| services into the same root module (directly or via any
| amount of modules-instantiating-other-modules) you can
| end up with changes from two people affecting both
| services that you can't easily sever.
|
| Happy to clarify further if I'm still not addressing your
| original comment.
| swozey wrote:
| @rcowley -- I'm going to preface this with I'm a Staff
| SRE at an adtech corp that does billions and have been a
| k8s and terraform contributor since 2015 (k8s 1.1 I
| forget the tf versions). I don't mean this to brag I just
| want to set my experience expectation since I'm a random
| name on hn who you'd never know.
|
| I think calling a service/stack (or whatever, app, etc) a
| "root module" is a very, very confusing thing to do.
| Terraform has actual micro objects called modules. We
| work with them every day. I get how you could consider
| encompassing an entire chunk of terrafrom code that calls
| various modules a "root module".. but I think this is
| just going to lead to absolute confusion to anyone not
| familiar with your terminology. I don't know every TF
| conversation but I can't think of a single time where
| I've heard root module in that context. Very good chance
| I've just missed those conversation and am ignorant to
| them.
|
| I'm currently hiring SRE 2s and 3s so I've been
| interviewing lots of terraform writers and one of my tech
| questions is to ask someone what makes them to decide to
| write a terraform module and what type of modules they've
| written - it's always ALBs, EKS, dbs, etc. components
| indepedently that go into creating a service/stack. I've
| definitely not heard anyone mention that they write "root
| modules" in the sense of an entire service/stack.
|
| I don't mean you're right or wrong, maybe more people are
| aware of that verbage than I am. I just wanted to mention
| that in my personal case I think it's confusing so I
| would assume that there are a lot of people in my shoes
| who would also be confused by it.
| [deleted]
| waffletower wrote:
| That's right, stacks can be instantiated across repos even
| depending upon the organization (both meanings of
| 'organization' are valid here).
| robertlagrant wrote:
| Makes sense, but how do you connect the two so e.g. credentials
| from one are surfaced in the other?
| dharmab wrote:
| Use Data Sources to reference resources in a different state:
| https://developer.hashicorp.com/terraform/language/data-
| sour...
| [deleted]
| paulddraper wrote:
| terraform_remote_state
|
| The root module can have outputs just like any other module.
| These outputs can be accessed from other stacks from the
| backend.
|
| And if you use CDKTF the references are handled
| transparently.
| pezh0re wrote:
| This is a great read, but I always seem to run into cases where I
| need to define something like a security group and then reference
| it when deploying ec2 instances. I'd love to decouple to reduce
| my plan time, but I haven't figured a way out as of yet.
|
| To be fair, I haven't used terraform -chdir yet.
| c0Re69 wrote:
| Try Terragrunt https://terragrunt.gruntwork.io/docs/
| JohnMakin wrote:
| you can pull it in via a data source, but then of course this
| creates a coupling between multiple modules/state files.
| spicyusername wrote:
| Everybody in here is recommending tarragrunt, but I'm not sure
| what value it provides over regular terraform.
|
| After using it for a few months all of the features found in
| tarragrunt are in terraform.
| cube2222 wrote:
| The article recommends to split up your state files for various
| advantages, but also expands into how to manage it later in a
| custom way.
|
| I agree with the splitting, but based on many home-grown
| automation systems I've seen around this I'd really recommend you
| to use one of the specialized CI/CD systems that are built around
| automating these kinds of workflows. Once you reach the "many
| state files" phase, you'll save a lot of engineering time this
| way.
|
| They'll take care of, among others, running the right state
| files, in the right order, with the right parameters. But they'll
| also take care of many other things you need to run Terraform at
| scale and with big amounts of engineers (happy to expand but
| don't want to kitchen-sink this comment).
|
| Disclaimer: Take this with a sensible grain of salt, as I work at
| Spacelift[0] - one of the TACOS (and of course the one I'll
| shamelessly link and recommend!).
|
| But really, don't use tools like Jenkins for this as you scale,
| it'll likely hurt you in the long run.
|
| [0]: https://spacelift.io
| swozey wrote:
| I'm sure that you have no control over this but I really wish
| Spacelift would increase the cost of its cloud tier and lower
| the cost of Ent. I'm in the anti-goldilocks zone. Ent seems
| priced for large teams when I practically fit into the cloud
| offering sans missing a few required features.
|
| Great product though from what I've experienced.
| sausagefeet wrote:
| Disclaimer: Co-founder of Terrateam.
|
| For Terrateam[0], we have probably 70% of the enterprise
| offering but at around 1/10th the price. If there are any
| features that are deal breaker, feel free to reach out to me
| and we'll see what we can do. That being said, Spacelift is a
| much more luxurious piece of software than us. We are very
| utilitarian, but we have to rationalize that low price-point
| somehow.
|
| [0] https://terrateam.io
| cube2222 wrote:
| Sorry to hear that! Pricing is hard.
|
| If you haven't yet, please try talking to our sales team.
| There's usually a way to make all sides happy with some
| custom agreements - after all, we'd love for you to be able
| to use our product as much as you need.
| carty7 wrote:
| Hi swozey. Spacelift sales leader here. Let's have a
| conversation and I'll work with you to find the goldilocks
| zone that you are looking for. Grab a demo with us and
| mention this post and my name "Ryan". We can dive into the
| features you require.
| gerl1ng wrote:
| The solution at the end almost looks like the manual setup of
| terragrunt which we are using to manage lots of base infra in
| many different accounts.
|
| What would be interesting here would be to see how they actually
| reference the outputs from one layer onto the next layer. That is
| something that is not even solved nicely in terragrunt and one of
| the major annoyances for me there. Using dependencies and the
| mock_output option is creating lots of noise in the plan outputs
| as the dependencies are only completely resolved when terragrund
| applies all the modules.
|
| But it seems I also missed a few additions to terraform - so
| probably there are better ways to take outputs from one terraform
| run into another one.
| waffletower wrote:
| While combining the word "best with "Terraform" in a sentence is
| more than likely to result in an oxymoron, it is counter-
| productive not to attempt to organize and utilize terraform as
| elegantly and DRY as possible. We interact with stacks (which we
| call projects typically) via Terragrunt and have a very large
| surface of modules as we do have a fair amount of infrastructure
| pieces. But we also try to expose Terraform infrastructure
| changes by use of Atlantis; though bulky, github does provide a
| reasonable means to dialogue and manage changes made by multiple
| teams. The use of modules also helps us encapsulate
| infrastructure, and state problems are rare with these
| approaches, but the data sprawl inherent to Terraform is very
| unwieldy regardless of so called "best" practices. The language
| features are weak, awkward and directly encourage repetition and
| specification bloat. We have had some success via Data Sources to
| export logic outside of Terraform and provide much needed sanity
| when interacting with very verbose infrastructure such as Lake
| Formation.
| mdhb wrote:
| Sorry if this comes across weird or snotty it's not supposed to.
|
| But I'm coming at this from a GCP lens and got half way through
| the article about how the recommended unit of isolation in the
| AWS environment is entirely different AWS accounts and I'm kind
| of hung up on that. Is that really a thing people tend to do
| often? Doesn't it get super unwieldy? How does billing work? What
| about identity? I have so many questions.
|
| EDIT: Despite the fact that the root resource in both GCP and AWS
| is an organization, when I heard "account" I mistook that to be
| AWS terminology for an organization.
| swozey wrote:
| The way this works with AWS is similar to you making a GCP
| project.
|
| At the top level you have an organization account, which is
| where billing occurs.
|
| From this org account you create accounts for the following
| (typically):
|
| 1. Security - AKA the account your USERS are in 2. Ops - The
| account your monitoring, etc are in
|
| From here where a lot of people seem to deviate (I've been
| interviewing level 2-3 SREs for the last 3 weeks and have heard
| all about different AWS structures that I don't like) is how to
| break up your applications into their own accounts for a low
| blast radius.
|
| What I DO, and is well known as being the best practice, is to
| create an AWS account for each environment of each application.
|
| App1-sandbox App1-staging App1-production
|
| Then your terraform is also structure by
| application/environment/service. Each environment and
| application has it's own state in s3 and dynamodb.
|
| And so on.
|
| Is this unwieldly? I have 40-50 AWS accounts and no it's not
| unwieldly at all IMO. Cross account IAM and trust relationships
| are set up very early on and they don't need to be modified
| much if any at all until you create another AWS account.
| Creating a new AWS account is kind of annoying, though. I need
| to automate that process better.
|
| https://aws.amazon.com/organizations/getting-started/best-pr...
| mdhb wrote:
| Cool, that was a genuinely fascinating window into AWS for
| me. Thank you for sharing
| swozey wrote:
| FWIW I loathe AWS IAM and miss GCPs organization.
| wpietri wrote:
| Yeah, I started with AWS and then spent a year on GCP and
| next time I'd much rather do GCP. It felt much more
| manageable and supportive to me.
| mdhb wrote:
| I'm quite into learning a lot of cloud native security
| stuff and I have to say my first impression was that it
| seemed so much harder to think about creating a secure
| environment using AWS IAM. I couldn't tell if it was just
| a case of familiarity or not.
| swozey wrote:
| I'm sure it's because of it's age and them kind of
| creating their version of IAM from scratch (someone
| correct me if they copied this structure from elsewhere)
| but you have to do a lot of goofy obtuse work with IAM
| automation. There are times I have to go into the
| console/cli and grab some sort of specific UID for an
| object instead of using its name, things like that that
| just make it annoying. Sometimes you can't use an account
| name and have to use the org ID... I could go on. You
| just kind of deal with it.
|
| I haven't worked on GCP since maybe 2016-17 so I'm not
| sure how it's going over there anymore.
| mdhb wrote:
| It really does sound like an entirely different level of
| complexity.
|
| GCP native API is basically the same thing as knative in
| most ways. Just a bunch of various services and resources
| that you all call and authenticate and even often
| provision the same way.
|
| As an example of that since we are talking about
| infrastructure management I would say at its "smoothest"
| level of integration there is a service you can use (or
| host it yourself on Kubernetes if that's your thing for
| some reason) where like any other Kubernetes resource I
| would just "declare" what I wanted.
|
| So now I'm not messing around with complicated Terraform
| logic at all (Google got really good with automation, I
| don't think there is anything close to an equivalent for
| this is there?). I just declare say a BigQuery resource
| or a Project (AWS Account equivalent) resource and the
| service will do all the hard work of making sure that's
| the state my account is in at any given point.
|
| I can also stick policy controls around it like I would
| with K8s so only certain people can create certain
| resources under certain conditions.
|
| It's really easy to just stick that into a git repo and
| still do all of the IAC stuff mentioned in this article
| but it's also easy to do the cross environment stuff and
| manage the roll out between each of them.
|
| Overall, it's very predictable, the IAM is really
| intuitive but also incredibly granular so it's very easy
| to model things on top of and to feel fairly confident
| that I'm not accidentally doing something stupid so I
| really like it from that point of view.
|
| My number one bit of advice for GCP is see how easily you
| can architect your way into using Cloud Run as much as
| possible unless you have some really wild use case. You
| can get to a really sophisticated set up with only a tiny
| team. Followed by read Google's API guidelines (aip.dev)
| to understand how to build things in a way where you're
| going to continuing having a good time.
| dragonwriter wrote:
| > But I'm coming at this from a GCP lens and got half way
| through the article about how the recommended unit of isolation
| in the AWS environment is entirely different AWS accounts and
| I'm kind of hung up on that. Is that really a thing people tend
| to do often? Doesn't it get super unwieldy?
|
| There are AWS systems above the account level for managing it
| (Organizations), so its not quite as bad as it might naively
| seem, but, yes, its more unwieldy than GCP's projects.
| mdhb wrote:
| Oh thank god, that's much better than I naively thought.
| Thanks for the heads up.
| stock_toaster wrote:
| You can have sub-accounts that roll up billing to a main
| account. Still messy, but probably cleaner (security policy
| wise) and possibly safer (fewer production impacting accidental
| config changes?) than having a single giant account with _lots_
| of things mixed together.
| cube2222 wrote:
| It's extremely common and recommended.
|
| Billing works by having a billing aws account that all other
| accounts are in a sense "children" of.
| lgreiv wrote:
| You should be able to organize accounts hierarchically using
| AWS Organizations, which allows to have cost centers and
| centralized billing (and some imposed policies over all
| accounts).
| badblock wrote:
| Some of this seems like old advice, instead of having directories
| per environment you should be using workspaces to keep your
| environments consistent so you don't forget to add your new
| service to prod.
| rcrowley wrote:
| (Hi, I'm one of the authors of the article at the root of this
| thread.)
|
| I've gone back and forth on workspaces versus more root
| modules. On balance, I like having more root modules because I
| can orient myself just by my working directory instead of both
| my working directory and workspace. Plus, I feel better about
| stuffing more dimensions of separation into a directory tree
| than into workspace names. YMMV.
| time0ut wrote:
| I've been using Terragrunt [0] for the past three years to manage
| loosely coupled stacks of Terraform configurations. It allows you
| to compose separate configurations almost as easily as you
| compose modules within a configuration. Its got its own learning
| curve, but its a solid tool to have in the tool box.
|
| Gruntwork is a really cool company that makes other tools in this
| space like Terratest [1]. Every module I write comes with
| Terratest powered integration tests. Nothing more satisfying than
| pushing a change, watching the pipeline run the test, and then
| automatically release a new version that I know works (or at
| least what I tested works).
|
| [0] https://terragrunt.gruntwork.io/
|
| [1] https://terratest.gruntwork.io/
___________________________________________________________________
(page generated 2023-08-04 23:00 UTC)