AWS Managed Prometheus and Grafana

AWS re:Invent 2020 is in progress and it’s full of introducing new services. For people like me who work day by day on both public cloud services (AWS in specific) and cloud native ecosystem, introductions of 2 services related to Observability were very interesting:

They are in Preview but AWS provides it to all users already. Of course they may be subject to changes and not recommended for production.

I really like such services because there is no vendor locking concern and like other managed services can be used at scale. Other advantages are integration with other services including IAM. Although there are solutions for scalability or securing these services but they are not easy to implement and the cost of maintaining them can be high.

I’m really excited to see how they work in practice but I wanted to share my view with you, maybe you are quicker than me in employing them and playing around ūüôā

Using Flux in Microk8s

This post is a rather short one like a real tweet but I hope it help people who might be in the same situation as me and can’t find much information online.

This time we are talking about Microk8s and Flux. Microk8s is a great implementation of Kubernetes especially useful for edge and IoT which is approved project in CNCF. Flux is an implementation for GitOps by Weaveworks which is also a candidate project of CNCF. It’s also a nice tool which brings kubernetes configuration management into another level.

They work well together but there is a small tweak required in one of the steps. When you want to authorize your flux to access your github repository (SSH based) you need public key to be introduced as Deploy Key in Github. The command which works for microk8s is the following:

microk8s.kubectl -n flux --kubeconfig /var/snap/microk8s/current/credentials/client.config logs deployment/flux | grep | cut -d '"' -f2

That’s it and you will get the required public key.

Integrating Kibana in Elastic Cloud with Okta – Step by Step

Implementing SSO is very useful when teams grow. Okta is well known as Identity provider and in specific for SSO. Elastic is also well known for their great products including Elasticsearch and Kibana! Elastic started its hosted service (Elastic Cloud) and they added nice features such as Hot/Warm deployments which made it popular. They both have good documentation but when it comes to this specific integration, things are not clear. I spent some time and communicated with support on both sides and in this post I will show how to integrate Kibana hosted by Elastic Cloud with Okta as IdP, Step by Step:

  • First step is to configure Okta side to get the Assertion XML. Go to Okta Admin page and Add an application. Choose a SAML 2.0 App. Then you have to specify some basic information such as App name and Logo. Next is SAML Settings which is the important part. In specific the following parameters should be defined:
    1. Single sign on URL: for Elastic cloud the format is:
      please note that /api/security/v1/saml is fixed (at least by the time this post is written)
    2. Audience URI (SP Entity ID): This is exactly the URL of your Kibana in Elastic Cloud but please don’t forget / at the end:
    3. Name ID Format: depends on your Okta usernames. In my case it’s EmailAddress

      SAML setting
    4. Group Attribute Statements: This is very important for granular Access management and role mapping in Kibana.
      For the Name specify groups. This is important and then for better management you can specify a filter to map groups that contain kibana (as an example). It will filter groups that you created in Okta directory and will help in mapping with Kibana/Elasticsearch x-pack roles. for example you can create a group in Okta with a name like kibana_admins and add Okta users that you want to have superuser privilege in Elasticsearch to this group. We will come back to this mapping later.
      group attribute
    5. It’s almost done now at Okta side. You can review and check the guide which is given by Okta about how to introduce Assertion and Metadata to service provider (Kibana/Elasticsearch)
  • in Next step is configuring Elasticsearch. The main guide to do this on Elastic Cloud is the following:
    You can edit elasticsearch.yml by going to Cloud Console and choosing your deployment and then Edit option and then `User Setting Override`:
    edit elasticsearch

    The following values are important:
    1. attributes.principal: you can see the explanation what this is but in case of Okta, apparently its value should be nameid
    2. attributes.groups: It should be in line with item 4 of previous step (Okta) but I recommend to use exactly groups as value
    3. idp.metadata.path: It is something like the following:
      but you can get it by visiting Sign On configuration in Okta. In the following picture see the link pointing to Metadata in blue at the bottom.
      metadata link
    4. idp.entity_id: If you check Setup Instructions in Sign On configuration (picture above), it’s mentioned but it looks like the following:
    5. sp.entity_id: As the guide says it should be:
      "KIBANA_ENDPOINT_URL/" but keep in mind that it should align with item 2 of previous step in Okta and again, don’t forget / ūüôā
    6. The rest is straight forward and you won’t miss them
  • Tips:
    • This is very important and took me and Elastic support a lot of time to troubleshoot: If you have Hot and Warm nodes, you should apply the configuration to both type of nodes and there are separate elasticsearch.yml files on Cloud Console
    • The value for attributes.principal should be exactly nameid and nameid:persistent won’t work.
    • You must use the SAML realm name¬†cloud-saml (mentioned in the guide)
  • Next step is to do role mapping. You can read about it here:
    So far we setup Okta to send some metadata along with the auth response and using these API’s we have to map the groups in Okta with Roles in ElasticSearch. For example I have 2 groups in Okta named: kibana_operators and kibana_admins. using the following mappings I map them to Monitor and superuser roles in ElasticSearch:
######## Role Mapping 1 - Operators #######
PUT /_security/role_mapping/saml-kibana-operators
  "roles": [ "Monitor" ],
  "enabled": true,
  "rules": { "all" : [
      { "field": { "": "cloud-saml" } },
      { "field": { "groups": "kibana_operators" } }
  "metadata": { "version": 1 }

######## Role Mapping 2 - Admins #######
PUT /_security/role_mapping/saml-kibana-admins
   "enabled": true,
    "roles": [ "superuser" ],
    "rules": { "all" : [
        { "field": { "": "cloud-saml" } },
        { "field": { "groups": "kibana_admins" } }
    "metadata": { "version": 1 }
  • Finally, you have to configure Kibana to use SAML as authentication mechanism. This step is straightforward as mentioned as Step 6 of this guide. Just Edit kibana.yml by using User setting Overrides in Elastic Cloud Console and specify the values accordingly.

And that’s it! Now you should be able to create users in Okta and add them to appropriate group to enable them to access Kibana!

AWS ECS CloudFormation Timeout

This is more an informational post that may help others to feel less miserable in the same situation as I was! The scenario is this:

You are updating an ECS cluster via AWS CloudFormation but for whatever reason the cluster doesn’t stabilize. So, you see the stack is in UPDATE_IN_PROGRESS¬†state and you don’t receive any message in CloudFormation¬†Events¬†page. If you can’t troubleshoot the issue with ECS and take no action, It will take 3 hours before CloudFormation timeouts and display a message! At this point, as you can guess, CloudFormation will rollback. Situation can be even worse if Rollback can not be proceeded successfully (in our case, there was a lack of resources preventing update and rollback). Again, CloudFormation will stuck in UPDATE_ROLLBACK_IN_PROGRESS¬†state and will timeout after 3 hours! In a conversation I had with AWS support, they said this time is hard-coded and can’t be changed at the moment!

So, in such a situation: Keep Calm And Troubleshoot!

Lambda function for RDS Slow Query

Lambda functions are just another great tool provided by AWS to solve issues in a modern way! Using Lambda functions, you can run a micro service without a need to have a server and think of how to configure and maintain it!

There are lots of use cases for Lambda functions; here I used it to implement a service which sends alerts in case there is a slow query running in RDS. Of course slow queries are important for developers as it helps them to debug better and improve performance of the application. You can find the code here but there are some other things to be considered:

  • As you may know, there are some ways to trigger a Lambda function. In this case, using CloudWatch Events to schedule it periodically makes sense.
  • The lamda function should have some permissions to get RDS Logs and send alerts using SNS. To find out how to define required rules, please see this¬†AWS documentation. You are also asked to do this when creating Lambda function.
  • There is a parameter named ‘distinguisher’ which is actually the keyword specifying the occurrence of slow query. For ‘Postgresql’ RDS it can be ‘
  • Parameters Group in RDS should be configured to log slow queries. To know how to do this please see AWS documentation or this guide:Enabling slow query log on Amazon RDS

ElasticSearch snapshot on S3

If you use ElasticSearch for Log analysis, you probably need to have backup and retirement strategy. It’s very handy to store a backup on a S3 bucket and configure lifecycle on that S3 bucket. I know there is a plugin (curator) that can do this but I preferred to use another approach and use ElasticSearch REST API’s. Here is a step to step guide about how to achieve this:

1) install AWS plugin:

2) create repository in your Elasticsearch cluster:

curl -XPUT 'localhost:9200/_snapshot/backup_s3_repository?pretty' -d'
"type": "s3",
"settings": {
"bucket": "BUCKETNAME",
"region": "REGION",


  • AWS plugin should be installed on all nodes and services should be restarted to recognize plugin; otherwise you will get this error:

“Unknown [repository] type [s3]”

3) create snapshot:

curl -k -XPUT ‘https://localhost:9200/_snapshot/backup_s3_repository/snapshot_name?pretty?wait_for_completion=true’

4) create a cron job for taking snapshots (for step 3). You can skip `wait_for_completion=true` in cron job

5) Configure Lifecycle for that S3 bucket.

Healthcheck with Serverspec

As I mentioned in my previous post, I would like to introduce a project I have recently started which is a Chef cookbook about system healthcheck.

The idea is to generate a healthcheck page based on results of running¬†Serverspec tests. ¬†Specifically, it would be helpful for load balancers and more specifically for AWS ELB healthcheck. According to¬†Serverspec: it tests your servers’ actual state by executing command locally, via SSH, via WinRM, via Docker API and so on. So you don’t need to install any agent softwares on your servers and can use any configuration management tools. But the true aim of Serverspec is to help refactoring infrastructure code. Load Balancers check health of system using this healthcheck URL and stop sending traffic if it’s not healthy. The generated health check page can be handy¬†for this URL then. If self-check fails, it will return an error status code, otherwise 200 status code. In this cookbook, 2 concepts are used:

  • Infrastructure test as code using Serverspec: It’s best if we have a test-driven infrastructure as code but even if there is no code for infrastructure, this cookbook¬†can still help!
  • Loadbalancer healthcheck: if healthchek page is comprehensive enough, this check can help in having a high available system. In case of AWS, when ELB is accompanied by Auto Scaling Group, a new healthy instance will replace the bad one if ELB notices an issue.

So, if you are interested in test driven coding especially in infrastructure, please have a look. Looking forward to hear your feedbacks.

Open Source Concept and Public Cloud

I’m really a big fan of open source concept as I’ve seen its benefits in improving quality of the world we are living in. Especially in a country where access to high quality materials is hard, it’s the only legal and fair way to learn more and be able to contribute. The other way to access valuable data/tools actually is cheating! and I really hate that. I’m proud that I have never cheated in my whole educational life whereas it was really common to cheat and win! Open source development is also taking and giving which satisfies your spirit!

One of the things I really love working with public cloud (in specific, AWS) is its openness and nice documentation which enables everyone¬†to implement their ideas that¬†is in line¬†with Open Source concept. In addition, AWS emphasis on DevOps and its integration with open source tools such as Chef, Packer, … has helped in fortifying open source culture in public cloud context. You can find great tools and utilities which are developed for AWS. In my next post, I will introduce a project that I have started and any contribution is more than welcomed!

Chocolatey cookbook issue with Packer

I recently had difficulties using Chocolatey chef cookbook to install packages on Windows 2012 R2 EC2 instances via Packer. For those who have issues, I would recommend to use an older version of chocolatey cookbook (12_5_fix). A solution is to modify Berksfile with the following:

cookbook 'chocolatey', git: "", branch: "12_5_fix"

AWS Solutions Architect Certification

On the new exciting journey that I started in public cloud, today I earned AWS Solutions Architect (Associate Level) certificate. It was a bit more challenging than what I expected but it was fun! For those who want to pass the exam, in my opinion despite what’s said about the focus of exam on VPC, RDS, high availability and scalability; the truth is that you should get familiar with almost all of the services and update yourself with new ones. For example, to my surprise I didn’t have any direct question about RDS but instead 5 or 6 questions about SQS and SWF and 1 question about Kinesis! I suppose questions are randomly selected and others may have different experiences but it’s a good idea to know basics of all the services. Of course, VPC, security, high availability and scalability are super important and you must be fluent in them but all I say is that they are not enough for passing the exam. Also, expect more scenario sort of questions which include different concepts rather than direct one sentence questions that you may find in internet.

So, if you are preparing for this exam, work harder and good luck!