Recently I worked on a governance project and I decided to take a look at AWS Control Tower. I found it much more mature than few years ago with good documentation. Also, Landing Zone is integrated in Control Tower which is really nice.
Having said that, especially when it comes to automation using code, Control Tower needs some improvements. Basically, there is no API for Control Tower and things such as creation and managing guardrails in CT can’t be done using code.
Although some operations in Control Tower can be automated. One of the nice things in Control Tower is Account Factory which enables us to create and manage AWS Accounts in an Organization’s landing zone. Even nicer, there is a GitOps model to automate the processes of account provisioning and account customization in AWS Control Tower, named Account Factory for Terraform (AFT). The official document explains how to deploy AFT very well. It will create a couple of lambda functions and pipelines; also Step functions and Service Catalog are configured in a way to process requests to create and manage AWS Accounts using Terraform.
When Account Factory is deployed, we need to work with 4 repositories that will trigger pipelines; each repository is responsible for a specific operation. For example, to create a new AWS Account you have to use aft-account-request repository. It uses a Terraform module with the same name and usually works well, considering you followed the documentation.
So far so good but if something goes wrong, the troubleshooting is a bit hard because this terraform module is very simple and you have to have a good understanding of how the whole workflow works to be able to troubleshoot. Let me give an example: when I was pushing one request to create a new AWS account, I wrote the name of organization unit (OU) all lower case while apparently the name of OU is case sensitive. When I pushed the code, terraform pipeline succeeded with no error because it just inserts a key/value pair in a DynamoDB table but obviously no account was created. To troubleshoot, I had to go over the whole procedure and check the logs of lambda functions and code builds to figure out the root cause of the issue. It was a great experience for me and gave me a deep knowledge of the procedure but it can be difficult for operators who don’t have access to those lambda functions and Codepipelines which are located in sensitive AWS Accounts.
As I’ve shown, if you are planning to use AFT, don’t rely only on terraform pipelines and think about some ways to facilitate troubleshooting. For example, I would recommend publishing all the logs of lambda functions, AWS Step Functions and AWS Codebuild projects to a log collector and let operators observe those logs even if terraform apply succeeds.