During a recent a "Splunk Answers" discussion, I was talking with a user about Splunk’s Universal Forwarder and how it operates when your indexer is offline. This turned into a healthy enough discussion that I felt it share-worthy, hence this blog post.
Question at hand:
What happens to inputs for file monitor, WinEventLog, perfmon, etc... when the indexer goes offline?
First of all, it's important to note about Splunk and the Universal Forwarder, and even Splunk Enterprise, that each of these have a data pipeline. I won’t go into too much detail about the specifics here, but if you’re interested in finding out more about data pipelines, you can check out this great .conf talk from 2014.
For the purposes of this discussion, there are three important things to know about the data pipeline and the Splunk Universal Forwarder:
- Data enters the pipeline
- Data exits the pipeline
- A maximum of 500KB of data (by default) is in-memory while in the pipeline
So, what exactly happens when the indexer is offline? The short answer is simple, data can’t go out of the pipeline.
The longer answer is that subsequently everything downstream will be affected. The data in the pipeline will start to get backed up, and eventually when that 500KB limit is reached, no more data can enter the pipeline.
Think of it this way...
Imagine a conveyer belt, or the line at your favorite amusement park. You entering the line is the data input, you getting on the roller coaster is the output, and you and your fellow park goers are the data in-memory as you bake in the hot summer sun while waiting to get on the ride. Eventually, there’s a cap on how many people can enter the line. Everyone else in the park are just park-goers (data) until they can squeeze into the (data pipe) line.
Interestingly enough, our discussion lead to an even less explored question, but definitely one to pay attention to:
What happens if the Universal Forwarder fails during a time when queues are blocked?
Bad news first
If the Universal Forwarder is shut down unexpectedly, Splunk will lose that data that is in-memory. In fact, that has nothing to do with queues being blocked. If the Universal Forwarder is shutdown unexpectedly and data hasn’t left the data pipeline yet, that data that is in-memory will be destroyed - regardless of whether the indexer is up or not. The problem is amplified when the topic of blocked queues comes up, simply because a blocked output means that the in-memory queue that's downstream will fill up to it’s max.
Again, using the amusement park scenario, if you’re in line and suddenly the roller coaster breaks and the amusement park workers say “come back this afternoon”, that line is going to disperse - the people in line might even leave the park because they’re so mad. In the Splunk case, that data is lost forever since it was in-memory.
The good news
500KB is the max and is generally a very small amount of data when Splunk handles gigabytes, terabytes, and even petabytes of data per day. If that ride shuts down and 30 people can’t make it on, An amusement park that takes in trillions of guests every day isn’t going to be extremely concerned if 30 people didn’t get to ride.
What about the data that hasn’t entered the pipeline yet?
When we’re talking about file monitoring, Splunk keeps track of it’s place in the file using a pointer. So, when the Universal Forwarder is started back up, Splunk will continue indexing right from where it left off and data that never entered the pipeline will index as normal (provided the queues aren’t blocked). And the people in the amusement park? They’ll get in line for the roller coaster as expected when the ride starts back up.
One thing you will notice, if the Universal Forwarder is offline for a while, is that there is a larger difference between your _indextime and _time fields. Typically, if we’re looking at real-time data these should be identical (or close to it). Any difference in them is how you can measure latency. When queues get blocked and data is delayed, than you’ll see a greater difference between these two fields.
You can have a look at your data latency using the following search:
index=_internal source=*splunkd.log host=<universal_forwarder_hostname> | eval indextime=strftime(_indextime,"%Y-%m-%d %H:%M:%S") | eval time=_time | eval itime=_indextime | eval latency=(itime - time) | stats count, avg(latency), min(latency), max(latency) by host
What this all really amounts to isn't really the initial question, but the real question:
Should I be concerned about this?
Should I be worried about my Universal Forwarders when my indexer is offline? For the most part, the answer is “no”. I say this mostly because 500KB is generally not “big” data, especially since this is the maximum possible data that can be lost, and also because relative to the amount of data you’re probably indexing using Splunk, you won’t even notice it’s gone.
The final question, for those whose business can't afford to lose that 500KB of data, is: "can we combat this?" Can we do anything so that we don’t lose this 500KB data or maybe at least just reduce the impact? There are two settings I would encourage you to explore and Splunk has great documentation on. Each of them comes with their own sets of pros and cons (and additional overhead). For the sake of keeping this blog post at a reasonable length I won’t go into them, but I encourage you to read up more on Protecting Against Loss of In-Flight Data and Using Persistent Queues. These can help greatly reduce the impact of indexers going offline.
If you have more questions about Splunk's data pipeline I would encourage you to visit answers.splunk.com and ask them!