Introducing "Data Latency"
The purpose of working with real-time architecture, like Splunk, is so you can rely on the software to send you alerts as events occur. When there's a difference in the time that Splunk indexes the event compared to the timestamp present in the event, it can be a major thorn in your side. This is an issue I'll refer to as "data latency". Receiving alerts hours or days later, or potentially not at all, could be the difference between a quickly mitigated issue and a majorly bad day.
In the world of statistics and big data, when we incorrectly identify something as absent, we call it a "false negative". It’s like saying “nope, that person didn’t try to break into our system” - when really they did and we just never got alerted about it. Or “we're confident that web server didn’t go down” - when it did and our logs just haven’t indicated as such. Again, when teams miss alerts bad things can happen.
What I'm really here to tell you...
Not only am I going to share with you how to measure data latency in Splunk, but I'm also here to let you in on some important things that data latency can tell you about your data.
1.) Data latency measurements in Splunk can tell you if there is perceived latency or not
Although this may be the base fact, about data latency and Splunk, it’s still important to point out. To start, let’s come up with a baseline for measuring the latency of our data. Take a look at the following search:
index=* | eval time=_time | eval indextime=_indextime | eval latency=(indextime-time) | stats count by avg(latency), min(latency), max(latency) by sourcetype
This search is going to look over all of your data and look at a three important parts:
_time - the time that Splunk believes the event took place
_indextime - the time that the data was indexed into Splunk
latency - the difference between the time the event was indexed and the time Splunk believes it took place (in seconds). Ideally, you’ll want latency to be 0 and/or not much higher or lower.
2.) Latency can tell you if you have an obvious timestamp issue
Here are a couple facts that are important to keep in mind:
- _indextme and _time are two different timestamps
- Theoretically, _indextime should always happen after the event took place, or as close to the same time as possible. This is due to the fact that an event needs to happen on a remote system and then be written to disk, or sent to an indexer, and parsed before it's indexed. If our systems are lightening fast, maybe this is 0, but probably not.
That being said, with our measurement search we would generally expect all of our measurements to be (in a perfect world) exactly 0, but realistically positive, and very low numbers.
If we run the search above and see that our average latency is -5000 seconds, for example, then we know something is seriously wrong with our data. -5000 seconds would be the same as saying that we were indexing data before it was even created on the originating host. From my experience, this usually means you’ll want to look at your timestamps. Timezones, especially in Splunk, are very important.
If your data is coming in as UTC (Coordinated Universal Time), you’ll want to make sure it is marked as such. If it is not marked as UTC, you can create a props.conf file to force Splunk to recognize it as UTC. The following two lines on your indexer or Heavy Forwarder can make a world of difference for data that is in UTC.
If you’ve cleared up your timezone issue and you're still seeing latency, also take a look at your internal logs. The following search will tell you if Splunk is having issues understanding the format of your timestamp:
index=_internal component=DateParserVerbose log_level=WARN
3.) Latency can tell you if you have an issue with your limits.conf file in Splunk
Keeping in mind the facts already highlighted, a very high positive latency can be symptomatic of other configuration-related issues. The following event is a sample event that will occur in your splunk _internal logs if you have a limits.conf file that is set too low:
INFO ThruputProcessor - Current data throughput (262 kb/s) has reached maxKBps. As a result, data forwarding may be throttled. Consider increasing the value of maxKBps in limits.conf.
If you experience this message, you’ll probably want to look at increasing the setting in your limits.conf file.
4.) Latency can tell you if you have a performance problem
Another cause of data latency can be that your data is taking far too long to go through the data pipeline. The following article from Splunk will give you some great information about how you can view your indexing performance: "Managing Indexers and Clusters of Indexers: Use the DMC to view indexing performance".
For optimal performance of your data, you can set the following settings for your sourcetype in props.conf:
For a more in-depth description, the following article for HUNK will talk about how much performance gain you can see from setting these: "Hunk User Manual: Performance best practices".
Ultimately, data latency is important in the world of Splunk
You’ll probably still find some value in Splunk without getting into this level of detail; however, these tips become increasingly important when you’re looking at apps like Splunk Enterprise Security and want to alert in real-time. In addition, these tips will help you be more aware of your data and understand it in a greater level of detail, something that is arguably one of the more important steps when considering storing and searching data at this scale.