WHY IS THIS APP USEFUL?
Have you ever been trying to use Splunk to help solve a problem, only to find out that the logs you need stopped coming into Splunk? After quite a bit of feedback from clients asking “where’s my data?”, we decided to tackle this problem. After a few years of internal development and refining, we released the Broken Hosts App for Splunk onto Splunkbase.
The Broken Hosts App for Splunk is a useful tool for monitoring data going into Splunk. It can alert you when hosts stop sending data into Splunk. It also looks at the last time that data was received by Splunk for each combination of host, sourcetype, and index. If data from that host, sourcetype, and index combination is arriving later than expected, it will send an alert so that the issue can be resolved.
This is the first part in a multi-part series. In the rest of the series, we’ll dive into the nitty-gritty details about how to set it up and do the initial configuration, as well as discuss how to respond when it alerts you to an issue.
APP COMPONENTS OVERVIEW
There are three main components that work together to make the Broken Hosts App for Splunk work. The first is four search macros that set up some base settings. The second is the “expectedTime” lookup table that contains any tuning settings that are needed. The last is a saved search, which uses the macros, the lookup table, and data from the Splunk indexes to determine if an alert needs to be generated.
Let's dive into these components in more detail. Below are the four macros that set up some base settings that will be used by the app. The first two can be overridden by settings in the “expectedTime” lookup table.
- default_contact - An email address to send alert emails - This can be overridden in the lookup table - (default is “email@example.com”)
- default_expected_time - The default amount of time (in seconds) that a host can be late before it alerts - This can be overridden in the lookup table - (default is “14400” - 4 hours)
- ignore_after - Any host/index/sourcetype combination that hasn’t sent any events in this amount of time (in seconds) will be ignored and will NOT alert - (default is “2592000” - 30 days)
- search_additions - Additional SPL commands that are added near the beginning of the search to perform custom actions - (default is “noop”)
The “expectedTime” lookup table contains any specific settings that can be tuned for your environment. This is where a majority of the work with this app happens. The lookup table is processed from the top down, and takes the first match. There are seven fields in this lookup table (all fields are NOT case sensitive):
- index - The index for the data that you would like to match - this field does accept wildcards - this field is required
- sourcetype - The sourcetype for the data that you would like to match - this field does accept wildcards - this field is required
- host - The host for the data that you would like to match - this field does accept wildcards - this field is required
- lateSecs - The amount of time (in seconds) that the index/sourcetype/host combination is allowed to be late before it alerts - this field is required
- suppressUntil - Alerts for the index/sourcetype/host combination will be suppressed until this date - since we use the “convert auto()” function for this field, you can use any date format that converts to a number - we recommend: “MM/DD/YYYY HH:MM:SS” or epoch time - this field is optional
- contact - The email address where you would like the alert to be sent - if this is blank, the email address from the default_contact macro will be used - this field is optional
- comments - Any comments that you would like to add for that line of the lookup table. This information is not used in the alert. This field is typically used to record information about why the entry is needed, when it was added, who added it, or any other details. This field is optional
The saved search pulls the latest time for each index/sourcetype/host combination. Then, it runs the `search_additions` macro (for any custom parsing that you need). At this point, it runs the index, host, and sourcetype combination through the expectedTime lookup table to see if there is an entry that matches. It uses the information from the lookup table to determine if an alert needs to be generated. Two things will cause an alert to be generated: 1) if the last time an event was received is older than the amount allowed by lateSecs, or 2) if the last time an event was received is in the future. The latter case indicates a timezone or timestamp extraction issue.