About this blog post

Monitoring the performance and availability of your infrastructure is crucial for ensuring the smooth operation of your applications and services.

At Sipfront, we are fully dedicated to monitoring telecommunications systems 24/7, with functional and performance tests at your service. This will give carriers and developers peace of mind so desperately needed.

That raises the question: who watches the watchmen?

We monitor our own infrastructure 24/7 as well of course, and our tool of choice for this task is Datadog. In this blog post, we describe how we manage our Datadog monitors using terraform, since we follow a strict Infrastructure-as-Code paradigm.

TL;DR

Managing infrastructure and Datadog monitors in terraform is AWESOME!

–> Github

Why Datadog?

Datadog is a powerful monitoring tool that can help you collect, analyze, and visualize metrics, traces, and logs from your entire stack. In this blog post, we’ll explore how you can use Terraform to create Datadog monitors, which can help you detect and alert on critical events in your infrastructure.

What are Datadog monitors?

Datadog monitors are automated alerts that can be triggered based on a wide range of conditions, such as specific metric thresholds, anomalies, changes in the state of your infrastructure, or custom events. Monitors can be configured to notify you via email, SMS, Slack, or other channels when a threshold is exceeded or an event occurs. Monitors can also be used to trigger automated actions, such as running a remediation script or scaling up/down a resource.

Why use Terraform to create Datadog monitors?

Terraform is an infrastructure-as-code tool that allows you to define your infrastructure in a declarative way using a domain-specific language (DSL). Terraform supports a wide range of cloud providers, services, and tools, including Datadog. By using Terraform to define your Datadog monitors, you can:

Define your monitors in code, making it easier to version, review, and collaborate on changes. Automate the creation, modification, and deletion of monitors, reducing the risk of manual errors and ensuring consistency across environments. Store your monitors in a version control system, making it easier to track changes and roll back to a previous state if needed.

How to use Terraform to create Datadog monitors?

To use Terraform to create Datadog monitors, you’ll need to follow these general steps:

  1. Set up your Datadog account and API key: You’ll need to have a Datadog account and an API key with the appropriate permissions to create and manage monitors.

  2. Install and configure the Datadog provider for Terraform: The Datadog provider is a third-party Terraform plugin that allows you to manage Datadog resources, including monitors. You’ll need to install and configure the provider by adding it to your Terraform configuration file (e.g., main.tf) and setting your Datadog API key as a provider-level variable.

1provider "datadog" {
2  api_key = var.datadog_api_key
3}
  1. Define your monitor resource: You’ll need to create a Terraform resource block that defines your monitor. The resource block should include the type of monitor, the conditions that trigger the monitor, and the notification channels that should receive alerts. Here’s an example of a monitor that triggers an alert when the CPU usage of a specific host exceeds 80% for 5 consecutive minutes:
 1resource "datadog_monitor" "connections-monitor-prod-sipfront-ES-us-east-1" {
 2  name        = "Webserver CPU usage on Prod us-east-1"
 3  type        = "query alert"
 4  query       = "avg:system.cpu.user{env:production, region:us-east-1} by {host} > 0.8"
 5  message     = "CPU usage on {{host.name}} is above 80% for 5 consecutive minutes"
 6  monitor_thresholds {
 7    warning  = 0.8
 8    critical = 0.9
 9  }
10  include_tags = true
11  tags        = ["region:us-east-1", "env:production", "service:webserver"]
12}

For details please refer to the official documentation. Also, there are a lot of tutorials and examples out there in the WorldWideWeb.

How to create bucketloads of monitors with terraform?

In most use cases, infrastructure isn’t just a few machines, load-balancers and a Database. Production and development infrastructure consists of numerous Instances, VMs, DBs and Services.

To monitor CPU and memory usage of our ECS infrastructure alone, we need 120 individual monitors. There is no sane way to create and manage these monitors manually.

To tackle this problem, we utilized the for_each and nested loops abilities of terraform.

First, we need to define all different systems, environments and regions we want to evaluate.

1list_of_environments = ["development", "production"]
2list_of_ecs_services = ["app", "website", "wizard", "api"]
3list_of_regions = ["eu-central-1", "us-east-1"]

Secondly, we define the type of monitors with warning and critical level thresholds respectively.

 1map_of_monitor_types = {
 2    cpuutilization = {
 3	name = "cpuutilization",
 4	critical_level = "80",
 5	warning_level = "70"
 6    },
 7    memory_utilization = {
 8	name = "memory_utilization",
 9	critical_level = "80",
10	warning_level = "70"
11    }
12}

Get some notification E-Mail addresses and a service tag-name in there:

1notify_emails = ["@support@yourdomain","@theman@yourdomain"]
2aws_service = "ECS"

Now the magic happens: we collect and concatenate all this data into one list of objects:

 1all_ecs_monitor_data = distinct(flatten([
 2        for env in local.ecs.list_of_environments : [
 3            for service in local.ecs.list_of_ecs_services : [
 4                for type in local.ecs.map_of_monitor_types : [
 5                    for region in local.ecs.list_of_regions : {
 6                        dd_type = type.name
 7                        name = "${service} ${type.name} ${env} ${region}"
 8                        env = env
 9                        service = service
10                        region = region
11                        warning_level = type.warning_level
12                        critical_level = type.critical_level
13                        tags = ["env:${env}", "region:${region}", "service:${local.ecs.aws_service}", "name:${service}", "type:${type.name}"]
14                        notify_emails = local.notify_emails
15                        message = "${type.name} of ${service} too high, Notify: ${join(",", local.ecs.notify_emails)}"
16                        query = "avg(last_30m):avg:aws.ecs.service.${type.name}.maximum{servicename:${service},env:${env},region:${region}} > ${type.critical_level}"
17                    }
18                ]
19            ]
20        ]
21]))
22

all_ecs_monitor_data now contains every possible and unique variation of the given data, where one entry would look like this:

 1{
 2    "critical_level" = "85"
 3    "env" = "website"
 4    "message" = "memory_utilization of website too high, Notify: @support@yourdomain,@theman@yourdomain"
 5    "name" = "production memory_utilization sipfront-website us-east-1"
 6    "notify_emails" = [
 7      "@support@yourdomain,
 8      "@theman@yourdomain",
 9    ]
10    "query" = "avg(last_30m):avg:aws.ecs.service.memory_utilization.maximum{servicename:website,env:production,region:us-east-1} > 85"
11    "region" = "us-east-1"
12    "service" = "website"
13    "tags" = [
14      "env:production",
15      "region:us-east-1",
16      "service:ECS",
17      "name:production",
18      "type:memory_utilization",
19    ]
20    "type" = "memory_utilization"
21    "warning_level" = "75"
22}
23

HINT: You can use the output resource to print the whole dataset.

Now, since this is a list of objects, according to terraform, to use this with the for_each meta argument of a terraform resource, we need to transform this data into a map format.

1for_each   = {
2        for index, monitor in local.all_ecs_monitor_data:
3        monitor.name => monitor
4}

Let’s create the monitors, shall we?

 1resource "datadog_monitor" "sipfront_ecs_monitor_automation" {
 2    for_each   = {
 3        for index, monitor in local.all_ecs_monitor_data:
 4        monitor.name => monitor
 5    }
 6    name        = each.value.name
 7    type        = "query alert"
 8    query       = each.value.query
 9    message     = each.value.message
10    escalation_message = each.value.message
11    monitor_thresholds {
12        warning  = each.value.warning_level
13        critical = each.value.critical_level
14    }
15
16    include_tags = true
17    tags        = each.value.tags
18}

That’s it. Try terraform plan first, to see if everything is ok.

You can find the full example on Github

We hope you find this blog post about our work helpful! Happy Monitoring!

comments powered by Disqus