I recently had the joy of stepping into a role where I get to start almost from scratch in terms of Infrastructure-as-Code and CI/CD. Much of our infrastructure has been manually set up, but the services are mature. This gives me the unique position of knowing what dependencies and intricacies exist in the services that need to be recreated in a fully automated fashion using IaC.

This is both a challenge and a pleasure. Having never had total control over a solution at this scale before, I did a great deal of testing and research to settle on an approach.

I am used to creating solutions that deploy to multiple cloud accounts such as Dev, QA, and Production, managing the config and deployments using Terraform workspaces and tfvars files. However, this approach for our setup would require too much overhead and duplication to get working smoothly in CI/CD for my liking, for reasons explained below.

Though I have not worked with Terragrunt before, it has proven to be a brilliant solution for us.

There are a few reasons why using Terragrunt matched our use-case:

  • Separate AWS Account per Environment (Dev, QA, Prod, etc)
    This factor alone is hardly uncommon, and Terraform Workspaces and tfvars files would be able to handle this. However, with the next points in mind…
  • Duplicate Environments within some AWS Accounts
    For example, in our Dev AWS Account we actually maintain more than one set of our full infrastructure to allow for segregated testing. It is also beneficial to have the ability to spin up an entirely new set of services in one of our accounts on-demand for specific testing, for example.
  • Multi-region requirements
    Due to factors such as GDPR we require the ability to deploy duplicates of some, but not all, of our services in EU regions as well as the US. This requirement is only strictly necessary in Production where real client data is stored.

We are certainly not the only organisation who needs a mixture of environments and regions like this, but I struggled to find fully fleshed-out end-to-end examples of this kind of setup elsewhere. After some trial and error, I managed to get a proof of concept working.

Technical Summary

Our organisation’s services are able to be developed and deployed separately, despite all talking to each other in the Cloud. This means we can quite nicely separate the services into logical bundles of Terraform infrastructure, each its own separate module repository.

So for example, service_a, service_b, and service_c would all live in their own repos containing the Terraform code. They are created from a Template I set up to establish a standard format with best practices, and are versioned separately (we have automatic semantic version tagging using GitHub Actions, which is a modified implementation of this article).

We then have a separate ‘Central Infrastructure’ repository which uses Terragrunt to tie all these services together. This is key to Terragrunt’s purpose: Keep your code DRY (Don’t Repeat Yourself).

Our central infrastructure repository ends up containing no actual Terraform files, but simply the required Terragrunt configuration to pull in all the modules for each service:

An example of what dev/foo/service_a/terragrunt.hcl might look like:

As described in the docs here, we have used generate blocks in the Account level hcl files such as production/terragrunt.hcl to set the role to be used for authentication for all the modules being deployed to that account.

In this way, our modules for our services can be totally agnostic of things like the provider config for AWS Authentication.

With this set up, our Terraform changes and work can all be done in the separate services’ module repositories, and Terragrunt works from the Central Infrastructure repo as a sort of air traffic control, sending each service off to be deployed in the right place with the relevant config.

I really like how flexible this makes things; we can pick and choose which services to deploy to which Environments and Regions extremely easily, and none of the core Terraform code has to be duplicated to achieve this.

This may also allow developers to get hands-on with the Terraform more easily, as they can look at the Terraform for their team’s service in isolation without potentially affecting other services, or worrying about deploy-time complexities.

CI/CD in GitHub Actions

The GitHub Actions magic lives in the Central Infrastructure repository, as this is where all our modules get pulled together and deployed.

Pre-Requisite: Authentication from GitHub to AWS

Thanks to the semi-recent release of GitHub OIDC Authentication to AWS, GitHub Actions became the easy winner for choice of CI/CD tool, considering we don’t already have another tool established for this purpose.

In short, I chose to set up the OIDC authentication to a role in our central Shared Services Account (where no actual infrastructure lives other than our State S3 Bucket), and from there the role can assume into our other accounts:

High Level Auth flow for GitHub Actions to our AWS Accounts (only showing Dev and QA as examples)

Deployment Workflow

High level flow of GitHub Actions Workflow

I created a reusable composite action for Terragrunt which takes in inputs to dictate whether it’s just doing a plan, or an apply as well.

In the interest of brevity, I have removed some lines of code which are specific to our setup (variable names and such), or are self explanatory based on context, as shown by <...>

Note that as all our Service modules live in Private repositories, we have set up a GitHub App for authentication to allow the pipeline to clone the module code (Generate Token and Replace git config url steps above).

Our main deployment Workflow yaml can use this action in its jobs like this:

Thanks to GitHub Enterprise allowing for Environment set ups on Private repositories, we also have approval gates at the apply steps.

The workflow ends up looking something like the following:

Example GitHub Actions screenshot from my tests (which only went as far as QA env)

Summary and future

While I did get a proof-of-concept working, I have not fully finished this implementation across all our live environments, and so there could be some things to iron out, and certainly there will be more improvements to make along the way.

For example, I do think there could be some difficulties with the highly distributed setup, and we will have to be cautious to keep our modules up to good standards, not letting the number of variables get out of hand.

As a clean slate from which to start our Infrastructure Automation, I am happy with the result. Hopefully my summary is helpful to some of you.