Managing Multiple AWS Environments with Ease: Sceptre and AWS CDK in Harmony

You probably know about AWS CDK: “The AWS Cloud Development Kit (AWS CDK) lets you define your cloud infrastructure as code in one of its supported programming languages.“; Sceptre (open source tool by Cloudreach) adds an extra layer of power for DevOps engineers: “Sceptre is a tool to drive CloudFormation. Sceptre manages the creation, update and deletion of stacks while providing meta commands which allow users to retrieve information about their stacks.

As you see, they both have something important in common: CloudFormation. the output of an AWS CDK program is an AWS CloudFormation template and Sceptre leverages CloudFormation templates to streamline stack deployments. Think of Sceptre as the conductor that orchestrates your CDK-defined infrastructure across multiple environments. It lets you focus on writing concise CDK code while Sceptre handles the environment-specific configurations and deployments. I found this article in Dev community website a good and brief explanation about Sceptre. The good news is that Sceptre supports AWS CDK as templating source. The following diagram makes it clear how this works:

Managing diverse environments such as Development, Test, Acceptance, Staging and numerous production clusters with unique configurations can be a nightmare. Sceptre saved my day by simplifying deployments across them all using the same infrastructure code (plain CloudFormation templates, Troposphere or CDK)! Also, separating configuration plane from code plane and ability to pass parameters across the stacks makes it very useful for DevOps teams. I used troposphere for templating but its development lost its momentum and couldn’t keep up with official CloudFormation updates. So, when I heard about AWS CDK support in Sceptre, it was like dream comes true!

I gave it a try and I’m very happy with it. I didn’t have to change anything in the existing code but instead leveraged AWS CDK for new stacks. Amazingly, cross-stack references, hooks, parameter handling features all worked as before. This is possible because of modular design of sceptre and its reusability capabilities.

Now, Let’s have a look at an example. The following python code creates a simple OpenSearch cluster using its CDK construct:

from aws_cdk import CfnOutput, CfnParameter
from aws_cdk import aws_certificatemanager as acm
from aws_cdk import aws_route53 as r53
from aws_cdk import aws_iam as iam
from sceptre_cdk_handler import SceptreCdkStack
from aws_cdk import aws_opensearchservice as opensearch
from aws_cdk import aws_ec2 as ec2


# Important: Notice how it subclasses SceptreCdkStack and passes **kwargs into the base class's
# __init__(). This is important to maintain compability with the different deployment_types.
class OpenSearchStack(SceptreCdkStack):
def __init__(self, scope, id: str, sceptre_user_data: dict, **kwargs):
super().__init__(scope, id, sceptre_user_data, **kwargs)
# If you want to pass parameters like you do elsewhere in Sceptre, this works great!
self.num_data_nodes = CfnParameter(self, 'NumOfDataNodes', type='Number')
self.data_node_type = CfnParameter(self, 'DataNodeType')
self.data_node_size = CfnParameter(self, 'DataNodeSize', type='Number')
self.os_username = CfnParameter(self, 'Username', default='flybits-admin')
self.warm_node_type = CfnParameter(self, 'WarmNodeType', default='ultrawarm1.medium.search')
self.master_node_type = CfnParameter(self, 'MasterNodeType', default='r5.large.search')

self.num_az = int(sceptre_user_data['NumOfAZs'])
self.version = sceptre_user_data['Version']
self.num_of_warm_nodes = sceptre_user_data['NumOfWarmNodes']
self.num_of_master_nodes = sceptre_user_data['NumOfMasterNodes']

if self.num_of_warm_nodes == 'None':
self.num_of_warm_nodes = None
if self.num_of_master_nodes == 'None':
self.num_of_master_nodes = None


slr = iam.CfnServiceLinkedRole(
self, "Service Linked Role",
aws_service_name="es.amazonaws.com"
)

self.os_domain=opensearch.Domain(self, "OS_Domain",
version=opensearch.EngineVersion.open_search(version=self.version),
enable_version_upgrade=True,
capacity=opensearch.CapacityConfig(
data_node_instance_type=self.data_node_type.value_as_string,
data_nodes=self.num_data_nodes.value_as_number,
warm_nodes=self.num_of_warm_nodes,
warm_instance_type=self.warm_node_type.value_as_string,
master_nodes=self.num_of_master_nodes,
master_node_instance_type=self.master_node_type.value_as_string,
),
zone_awareness=opensearch.ZoneAwarenessConfig(
availability_zone_count=self.num_az
),
ebs=opensearch.EbsOptions(
volume_type=ec2.EbsDeviceVolumeType.GP3,
volume_size=self.data_node_size.value_as_number
),
encryption_at_rest=opensearch.EncryptionAtRestOptions(
enabled=True
),
node_to_node_encryption=True,
enforce_https=True,

fine_grained_access_control=opensearch.AdvancedSecurityOptions(
master_user_name=self.os_username.value_as_string,
)
)

self.resource_arn = self.os_domain.domain_arn + "/*"
self.os_domain.add_access_policies(iam.PolicyStatement(
actions=["es:*"],
effect=iam.Effect.ALLOW,
principals=[iam.AnyPrincipal()],
resources=[self.resource_arn]
))

its sceptre configuration looks like the following:

template:
type: cdk
# The path is always within your project's templates/ directory.
path: logging/opensearch.py
deployment_type: bootstrapless
bootstrapless_config:
file_asset_bucket_name: !stack_attr template_bucket_name
# It can be useful to apply the same prefix as your template_key_prefix to ensure your
# assets are namespaced similarly to the rest of Sceptre's uploaded artifacts.
file_asset_prefix: "cdk-assets"
class_name: OpenSearchStack

# Parameters are DEPLOY-TIME values passed to the CloudFormation template. Your CDK stack construct
# needs to have CfnParameters in order to support this, though.
parameters:
NumOfDataNodes: "2"
DataNodeType: "r6g.large.search"
DataNodeSize: "100"

sceptre_user_data:
NumOfAZs: "2"
RoleNeeded: true
Version: "2.7"
NumOfWarmNodes: None
NumOfMasterNodes: None

In specific look at the flexibility of introducing values for parameters in code. In sceptre we have 2 ways of passing values: first is parameters which is equivalent to setting values for CloudFormation parameters and the other is sceptre_user_data which is specific to sceptre and its resolvers take care of assigning the values in code. I brought both in code to show their use case. In specific, for variables like NumOfWarmNodes which their value can be a number or None , you can’t use parameters because it doesn’t accept 2 different types; instead sceptre_user_data is very helpful and you can easily assign None or a Number (please see the code)

As conclusion, if you are already using CloudFormation or AWS CDK and you are looking for a flexible and structured way of managing multiple environments using the same infrastructure as code, a combination of Sceptre and AWS CDK is highly recommended.

Is Microservices Architecture Cheap or expensive?

In the ever-evolving landscape of software architecture, the debate surrounding microservices and monolithic architecture has raged on for years. Back in 2015, I remember discussing with a software architect about Kubernetes. He was understandably very interested in migrating applications to a microservices based architecture and adopting Kubernetes. One of his most important selling points was the costs. However, as time has shown, the reality of microservices architecture isn’t always as cost-effective as it might seem at first glance.

Microservices architecture has been touted for its ability to enhance flexibility and increase productivity and efficiency. Yet, the hidden costs associated with this approach have become apparent over time. Let’s dissect some of the key financial aspects.

Data Transfer Costs: In a microservices environment, communication between different services is frequent and essential. However, this constant communication leads to data transfer costs, which can significantly impact your budget. While techniques like Topology Aware Hints (TAH) can help mitigate these costs to some extent, they remain a noteworthy expenditure.

Operational Overhead: Microservices require the use of sidecars, service meshes, and a host of tools for operational and management purposes. These components are often more complex and resource-intensive than their counterparts used in monolithic applications, contributing to increased operational costs.

So, Does it mean microservices architecture is more expensive? or as David Heinemeier Hansson (Creator of Ruby on Rails) says microservices don’t make sense?

Well, in my opinion it depends (I know the cliche!). If we compare microservices architecture to a well-designed monolithic architecture, its direct costs are in general higher; However, the benefits of microservices, such as flexibility and increased productivity, often lead to lower indirect costs associated with management and operations.

Consider the Example of AWS Prime Video: By transitioning away from microservices to a monolithic architecture, AWS reportedly achieved a 90% cost reduction. This demonstrates that simplicity can be a compelling factor in favor of monolithic architecture in certain scenarios.

Nevertheless, it’s important to note that many applications thrive in Kubernetes or other orchestrators, with stakeholders satisfied with their choice. The CNCF (Cloud Native Computing Foundation) community acknowledges the cost challenges and actively works on mitigating them through initiatives like TAH (Topology Aware Hints) and Karpenter.

In conclusion, the debate over whether microservices architecture is more expensive than monolithic architecture isn’t easily settled. Both approaches have their merits and drawbacks, and the cost-effectiveness of each depends on the specific use case and how well they are implemented.

As technology evolves, we can expect ongoing discussions about the financial aspects of software architecture. With initiatives like TAH and Karpenter, the CNCF community is actively addressing the cost challenges associated with microservices. Therefore, the future promises further exploration and innovation in the quest for cost-effective software solutions. Stay tuned for more developments in the years to come.

Karpenter in EKS and CloudTrail Events

We recently started to migrate from CAS (Kubernetes AutoScaler) and EC2 ASG (AutoScaling Groups) to Karpenter in some of our EKS clusters. So far so good and I’m happy with the results, especially because of excellent use of Spot instances 🙂 and reducing our EC2 costs but I noticed something interesting about CloudTrail logs.

I noticed that our CloudTrail costs in the accounts with Karpenter are slightly increased. Looking closer, I saw a lot of UpdateInstanceInformation Events and the Identity source for these events was Karpenter Node Role making calls to AWS SSM. It makes sense because Karpenter actually comes with SSM agent and SSM Agent calls this API in Systems Manager service every 5 minutes to provide heartbeat information. So, if you configured CloudTrail to log all management events, you will see this event more often when you have an EKS cluster with Karpenter.

Use vault-cli in a pipeline

The other day I wanted to apply some configurations of Hashicorp Vault in a pipeline to automate some operations but Vault was a private deployment in a Kubernetes cluster. I was a bit lazy (not recommended) and also curious about this new (not anymore) trend which is ChatGPT! 🙂 So, here is my conversation and it worked. It’s not complicated but was helpful.

“I want to access an internal kubernetes service endpoint in a script. In specific, we have hashicorp vault inside kubernetes. I want to have a script that port forward to this service and then I use my local endpoint to run a couple of commands using ‘vault cli’ in the same script. can you please give me an example script?”

and the answer:

Certainly! Here’s an example script that demonstrates how you can port forward to a Kubernetes service endpoint and use the HashiCorp Vault CLI to run commands:

#!/bin/bash

# Port forward to the Vault UI service
kubectl port-forward -n hashicorp service/vault-ui 8200:8200 &
sleep 5 # Wait for port forwarding to be established

# Set the Vault endpoint to the local forwarded address
export VAULT_ADDR="http://127.0.0.1:8200"

# Authenticate to Vault (assuming token-based authentication)
export VAULT_TOKEN="your_vault_token"

# Run Vault CLI commands
vault status
vault secrets list

# Stop the port forwarding
kill %1

Using Account Factory for Terraform in Control Tower

Recently I worked on a governance project and I decided to take a look at AWS Control Tower. I found it much more mature than few years ago with good documentation. Also, Landing Zone is integrated in Control Tower which is really nice.

Having said that, especially when it comes to automation using code, Control Tower needs some improvements. Basically, there is no API for Control Tower and things such as creation and managing guardrails in CT can’t be done using code.

Although some operations in Control Tower can be automated. One of the nice things in Control Tower is Account Factory which enables us to create and manage AWS Accounts in an Organization’s landing zone. Even nicer, there is a GitOps model to automate the processes of account provisioning and account customization in AWS Control Tower, named Account Factory for Terraform (AFT). The official document explains how to deploy AFT very well. It will create a couple of lambda functions and pipelines; also Step functions and Service Catalog are configured in a way to process requests to create and manage AWS Accounts using Terraform.

When Account Factory is deployed, we need to work with 4 repositories that will trigger pipelines; each repository is responsible for a specific operation. For example, to create a new AWS Account you have to use aft-account-request repository. It uses a Terraform module with the same name and usually works well, considering you followed the documentation.

So far so good but if something goes wrong, the troubleshooting is a bit hard because this terraform module is very simple and you have to have a good understanding of how the whole workflow works to be able to troubleshoot. Let me give an example: when I was pushing one request to create a new AWS account, I wrote the name of organization unit (OU) all lower case while apparently the name of OU is case sensitive. When I pushed the code, terraform pipeline succeeded with no error because it just inserts a key/value pair in a DynamoDB table but obviously no account was created. To troubleshoot, I had to go over the whole procedure and check the logs of lambda functions and code builds to figure out the root cause of the issue. It was a great experience for me and gave me a deep knowledge of the procedure but it can be difficult for operators who don’t have access to those lambda functions and Codepipelines which are located in sensitive AWS Accounts.

As I’ve shown, if you are planning to use AFT, don’t rely only on terraform pipelines and think about some ways to facilitate troubleshooting. For example, I would recommend publishing all the logs of lambda functions, AWS Step Functions and AWS Codebuild projects to a log collector and let operators observe those logs even if terraform apply succeeds.

Do Math with AWS CloudWatch

We know AWS Cloudwatch: A very good monitoring service which we use to check metrics (or logs) of AWS resources. Personally I always used CloudWatch in its simplest way which is choosing a namespace, dimension and then the desired metric (AWS one or Custom) and playing with the time frame.

Sometimes it’s needed to do some more complex operation to monitor a situation instead of a simple metric. Especially when you want to create an alert. CloudWatch has this capability to do some math operations; you can find more information here but I will explain a use case which I faced recently. It’s related to long expected RabbitMQ broker of Amazon MQ.

Not long time ago, AWS announced availability of RabbitMQ as a broker of Amazon MQ service. RabbitMQ is very popular for distributed systems, so, a managed service from AWS will help DevOps teams a lot 🙂

This is familiar to those who know RabbitMQ but in our use case we wanted to receive an alert when the rate of Acknowledging messages is considerably less than rate of publishing messages to queues. We came up with the following Cloudformation code to implement this alarm:

  AckRateAlarm:
    Type: 'AWS::CloudWatch::Alarm'
    Properties:
      AlarmDescription: Rate of Ack is considerably less than rate of publishing
      AlarmName: RabbitMQAckRateAlarm
      Metrics:
        - Id: a1
          Expression: "IF(ma1 > ma2 + 1000, 1, 0)"
          Label: "Ack rate vs Publish rate"
        - Id: ma1
          MetricStat:
            Metric: 
              MetricName: PublishRate
              Namespace: AWS/AmazonMQ
              Dimensions:
                - Name: Broker
                  Value: Foo
            Period: 300
            Stat: Average
          ReturnData: false
        - Id: ma2
          MetricStat:
            Metric: 
              MetricName: AckRate
              Namespace: AWS/AmazonMQ
              Dimensions:
                - Name: Broker
                  Value: Foo
            Period: 300
            Stat: Average
          ReturnData: false
      EvaluationPeriods: '2'
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
        - SNS_ARN
      OKActions:
        - SNS_ARN

This use case sounds simple but I think shows the benefit of using Math in CloudWatch metrics well.

AWS Managed Prometheus and Grafana

AWS re:Invent 2020 is in progress and it’s full of introducing new services. For people like me who work day by day on both public cloud services (AWS in specific) and cloud native ecosystem, introductions of 2 services related to Observability were very interesting:

They are in Preview but AWS provides it to all users already. Of course they may be subject to changes and not recommended for production.

I really like such services because there is no vendor locking concern and like other managed services can be used at scale. Other advantages are integration with other services including IAM. Although there are solutions for scalability or securing these services but they are not easy to implement and the cost of maintaining them can be high.

I’m really excited to see how they work in practice but I wanted to share my view with you, maybe you are quicker than me in employing them and playing around 🙂

Using Flux in Microk8s

This post is a rather short one like a real tweet but I hope it help people who might be in the same situation as me and can’t find much information online.

This time we are talking about Microk8s and Flux. Microk8s is a great implementation of Kubernetes especially useful for edge and IoT which is approved project in CNCF. Flux is an implementation for GitOps by Weaveworks which is also a candidate project of CNCF. It’s also a nice tool which brings kubernetes configuration management into another level.

They work well together but there is a small tweak required in one of the steps. When you want to authorize your flux to access your github repository (SSH based) you need public key to be introduced as Deploy Key in Github. The command which works for microk8s is the following:

microk8s.kubectl -n flux --kubeconfig /var/snap/microk8s/current/credentials/client.config logs deployment/flux | grep identity.pub | cut -d '"' -f2

That’s it and you will get the required public key.

Integrating Kibana in Elastic Cloud with Okta – Step by Step

Implementing SSO is very useful when teams grow. Okta is well known as Identity provider and in specific for SSO. Elastic is also well known for their great products including Elasticsearch and Kibana! Elastic started its hosted service (Elastic Cloud) and they added nice features such as Hot/Warm deployments which made it popular. They both have good documentation but when it comes to this specific integration, things are not clear. I spent some time and communicated with support on both sides and in this post I will show how to integrate Kibana hosted by Elastic Cloud with Okta as IdP, Step by Step:

  • First step is to configure Okta side to get the Assertion XML. Go to Okta Admin page and Add an application. Choose a SAML 2.0 App. Then you have to specify some basic information such as App name and Logo. Next is SAML Settings which is the important part. In specific the following parameters should be defined:
    1. Single sign on URL: for Elastic cloud the format is:
      https://YOUR_CLUSTER_ADDRESS:9243/api/security/v1/saml
      please note that /api/security/v1/saml is fixed (at least by the time this post is written)
    2. Audience URI (SP Entity ID): This is exactly the URL of your Kibana in Elastic Cloud but please don’t forget / at the end:
      https://YOUR_CLUSTER_ADDRESS:9243/
    3. Name ID Format: depends on your Okta usernames. In my case it’s EmailAddress

      SAML setting
    4. Group Attribute Statements: This is very important for granular Access management and role mapping in Kibana.
      For the Name specify groups. This is important and then for better management you can specify a filter to map groups that contain kibana (as an example). It will filter groups that you created in Okta directory and will help in mapping with Kibana/Elasticsearch x-pack roles. for example you can create a group in Okta with a name like kibana_admins and add Okta users that you want to have superuser privilege in Elasticsearch to this group. We will come back to this mapping later.
      group attribute
    5. It’s almost done now at Okta side. You can review and check the guide which is given by Okta about how to introduce Assertion and Metadata to service provider (Kibana/Elasticsearch)
  • in Next step is configuring Elasticsearch. The main guide to do this on Elastic Cloud is the following:
    https://www.elastic.co/guide/en/cloud/current/ec-securing-clusters-SAML.html
    You can edit elasticsearch.yml by going to Cloud Console and choosing your deployment and then Edit option and then `User Setting Override`:
    edit elasticsearch

    The following values are important:
    1. attributes.principal: you can see the explanation what this is but in case of Okta, apparently its value should be nameid
    2. attributes.groups: It should be in line with item 4 of previous step (Okta) but I recommend to use exactly groups as value
    3. idp.metadata.path: It is something like the following:
      https://YOURCOMPANY.okta.com/app/OKTA_APP_ID/sso/saml/metadata
      but you can get it by visiting Sign On configuration in Okta. In the following picture see the link pointing to Metadata in blue at the bottom.
      metadata link
    4. idp.entity_id: If you check Setup Instructions in Sign On configuration (picture above), it’s mentioned but it looks like the following:
      http://www.okta.com/OKTA_APP_ID
    5. sp.entity_id: As the guide says it should be:
      "KIBANA_ENDPOINT_URL/" but keep in mind that it should align with item 2 of previous step in Okta and again, don’t forget / 🙂
    6. The rest is straight forward and you won’t miss them
  • Tips:
    • This is very important and took me and Elastic support a lot of time to troubleshoot: If you have Hot and Warm nodes, you should apply the configuration to both type of nodes and there are separate elasticsearch.yml files on Cloud Console
    • The value for attributes.principal should be exactly nameid and nameid:persistent won’t work.
    • You must use the SAML realm name cloud-saml (mentioned in the guide)
  • Next step is to do role mapping. You can read about it here:
    https://www.elastic.co/guide/en/elastic-stack-overview/7.3/saml-role-mapping.html
    So far we setup Okta to send some metadata along with the auth response and using these API’s we have to map the groups in Okta with Roles in ElasticSearch. For example I have 2 groups in Okta named: kibana_operators and kibana_admins. using the following mappings I map them to Monitor and superuser roles in ElasticSearch:
######## Role Mapping 1 - Operators #######
PUT /_security/role_mapping/saml-kibana-operators
{
  "roles": [ "Monitor" ],
  "enabled": true,
  "rules": { "all" : [
      { "field": { "realm.name": "cloud-saml" } },
      { "field": { "groups": "kibana_operators" } }
  ]},
  "metadata": { "version": 1 }
}

######## Role Mapping 2 - Admins #######
PUT /_security/role_mapping/saml-kibana-admins
{
   "enabled": true,
    "roles": [ "superuser" ],
    "rules": { "all" : [
        { "field": { "realm.name": "cloud-saml" } },
        { "field": { "groups": "kibana_admins" } }
    ]},
    "metadata": { "version": 1 }
}
  • Finally, you have to configure Kibana to use SAML as authentication mechanism. This step is straightforward as mentioned as Step 6 of this guide. Just Edit kibana.yml by using User setting Overrides in Elastic Cloud Console and specify the values accordingly.

And that’s it! Now you should be able to create users in Okta and add them to appropriate group to enable them to access Kibana!

AWS ECS CloudFormation Timeout

This is more an informational post that may help others to feel less miserable in the same situation as I was! The scenario is this:

You are updating an ECS cluster via AWS CloudFormation but for whatever reason the cluster doesn’t stabilize. So, you see the stack is in UPDATE_IN_PROGRESS state and you don’t receive any message in CloudFormation Events page. If you can’t troubleshoot the issue with ECS and take no action, It will take 3 hours before CloudFormation timeouts and display a message! At this point, as you can guess, CloudFormation will rollback. Situation can be even worse if Rollback can not be proceeded successfully (in our case, there was a lack of resources preventing update and rollback). Again, CloudFormation will stuck in UPDATE_ROLLBACK_IN_PROGRESS state and will timeout after 3 hours! In a conversation I had with AWS support, they said this time is hard-coded and can’t be changed at the moment!

So, in such a situation: Keep Calm And Troubleshoot!