Automated alerting for data loss with Elastic Watcher

Watcher is a function offered by the Elastic Stack which enables users to send notifications or trigger remote actions based on certain events or thresholds within their data.

In this blog post we will show how you can create a Watcher that sends automated notifications to one of your Slack users or groups when an index no longer receives data so that this issue can be quickly remediated. We strongly suggest that you have one of these watches for each of your indexes so you can be quickly notified and prevent any data loss.

In order to be able to send Slack messages via Watcher you need to configure Watcher appropriately. This guide will not cover the integration part between Watcher and Slack but you can read more about it here (https://www.elastic.co/guide/en/elastic-stack-overview/current/actions-slack.html#configuring-slack).

Watcher has 4 main parts that it is composed of: trigger, input, condition and action. Let’s take a look at each of these components.


"trigger": {
"schedule": {
"interval": "1h"
}
}

The trigger is the part in which we define how often we want the Watcher to be run. This setting can have multiple options such as hourly, weekly, monthly, yearly, at a fixed interval or even cron expressions. The option of how often you want the watcher to be run should be selected depending on the use case you are trying to cover. In our case, of checking whether there is data coming into an index, the interval should be selected depending on how often data is being streamed to that particular index. For this example we will use a one hour interval for the check.


"input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "random-index*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "from": "now-1h",
                      "to": "now"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }
  }

The input of the Watcher contains the search query which we will use to filter which data is taken into consideration when our Watcher is being run. Let’s have a look at some of the more important elements:

  • “search_type”: “query_then_fetch” – this is a setting that assures that after the search query is run the results are also retrieved. This setting is important if you want to populate the message you send via Slack with dynamic values from your data. If this option is not set then the query results will not be retrieved and they cannot be used as dynamic values in your message.
  •  “indices” – this is where you specify the index which will be used for the Watcher
  • “rest_total_hits_as_int”: true – is a new setting that came with 7.0 version of Elasticsearch. Setting this to true allows us to retrieve the total number of documents returned by our query
  • “query”: { “bool”: {“filter”: [ { “range”: { “@timestamp”: { “from”: “now-1h”,”to”: “now”} }}]}} – is the query that we will use for our Watcher. Here your options are endless in combining various types of filters and aggregations depending on your use case. For our example we will only set a range filter on the @timestamp field which is the field that holds the time when an event took place. This will allow us to look at the last hour of data to determine if any documents were received in the index.

"condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "lt": 1
      }
    }
  }

The condition defines whether a Watcher’s action should be triggered or not. The condition can be of multiple types such as: always, never, compare, array_compare or a script.

For our example we will use a compare condition and match it to the situation in which the number of documents received in an index is less than 1 which would indicate that the index is no longer receiving data.

The important part of the condition is accessing the context of the Watcher. The context of the Watcher holds multiple values that can be used in order to determine if the action will be triggered. For our example we want to know if there are any documents at all received in the index so we will look at the total number of hits for our query and add the condition that this number should be less than 1.


"actions" : {
  "notify-slack" : 
    "throttle_period" : "60m",
    "slack" : {
      "message" : {
        "to" : [ "#Index-monitoring" ], 
        "text" : "There are  {{ctx.payload.hits.total.value}} documents received in the past hour!" 
      }
    }
  }
}

The action will define what will happen if the condition of the Watcher is met. This action can be of multiple types such as email, Slack message, logging, indexing, PagerDuty message, Jira action or webhook. For our use case we want to send a Slack message to a certain channel in order to notify our engineers when an index stops receiving data so that they can investigate the issue.

This action will send a message to the Slack channel index-monitoring with the message stating that there are 0 documents in the index. We could also send this message to a specific user by adding the “@” tag and the user ID. It’s important to note that if you use the user name the notifications will not go through. You need to specifically use the user ID from Slack. The {{ctx.payload.hits.total.value}} line is a dynamic value which is populated from within the context of the Watcher. This value will show the total number of documents inside our index but because we put the condition for the action to trigger only when there are 0 documents the number will always show 0.

Putting it all together


{
  "trigger": {
    "schedule": {
      "interval": "1h"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "random-index*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "from": "now-1h",
                      "to": "now"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "lt": 1
      }
    }
  },
"actions" : {
  "notify-slack" : 
    "throttle_period" : "60m",
    "slack" : {
      "message" : {
        "to" : [ "#Index-monitoring" ], 
        "text" : "There are  {{ctx.payload.hits.total.value}} documents received in the past hour!" 
      }
    }
  }
}

With this Watcher in place for each of your indexes you will be quickly notified of any data loss that might occur. As we all know keeping a steady and constant flow of data is paramount for accurate analysis.