The scenario I’m about to discuss is one of the biggest pain points lately for clients, and that is the frustration occurring when Splunk recommends the use of some reference hardware. Let me share a real-life example to put this in perspective before we fully dive in.
For all the Mexican food lovers out there, I think we can agree that being recommended to some awesome Mexican restaurant we’ve never been to can be exciting (it’s the simple things, right?). For me though, unfortunately, I’m not the biggest fan of sour cream. So, let’s say I go to that new place, sit down, open the menu, close my eyes, point to something on the menu, and they bring it out with a wallop of sour cream on top… I think you get the picture. I can’t get mad at my friend for their recommendation of the restaurant, that wouldn’t make sense. Also, If I have certain food aversions then it’s on me to read the menu (documentation), and if I don’t, well, then I probably deserve some extra sour cream on my plate.
Similarly, I’m not going to get mad at Splunk after they recommend the use of some reference hardware and I’m the one to flip on 100 real-time searches, or blindly install 30 apps on a Single Search Head, and (consequently) it doesn’t perform the way I want it to. And I really can’t especially when they very clearly indicate that this can be a bad move.
The idea that "your mileage may vary" with Splunk and Hardware Recommendations can, and has, caused quite a few frustrations. Not many people understand what this means and, though it’s documented well for some apps (like Splunk Enterprise Security), it can be somewhat difficult to quantify for others. Especially with the number of app developers, and the flexibility of the software.
So, how do you trudge through performance problems when you’re concerned you're up against a bottleneck on your Search Head, even though you followed the reference Hardware and the documentation of the apps installed?
Here are 5 tips that I’ve found useful to try and quantify some common search head performance issues.
(Note: These points assume you haven’t buried yourself so deep into a problem that you can’t even search in Splunk. If you can’t get off the ground, this article may not be too helpful.)
1.) Start by Searching Splunk Internal Logs
Assuming your Splunk system is performing well enough for you to search, start by looking at Splunk’s internal logs and see if you have any skipped jobs. This is usually a telltale sign something is wrong.
index=_internal status=skipped host=<your_splunk_server> source="/opt/splunk/var/log/splunk/scheduler.log" earliest=-24h@h
2.) Search for Installed Apps
If you’re finding skipped jobs there could be a couple things wrong, but we’ll start with looking at the number of apps you have installed.
| rest /services/apps/local splunk_server=<your_search_head> | search disabled=0 | table label
You can derive a lot of insights out of this search. For example, if this is a Splunk Enterprise Security Search Head or Splunk App for PCI Compliance Search Head, you’re going to want those apps installed on their own as documented here. If you’re finding other apps that aren’t ES or PCI related installed side-by-side, you’ll want to move them to an ad-hoc search head.
3.) Search for Apps with Most Scheduled Jobs
Now there is no hard and fast rule for ad-hoc search heads in terms of “number of apps that can be installed at one time” like I know a lot of you would like. But you can get an idea for what apps are giving you the most headaches by using the following search:
|rest /services/saved/searches splunk_server=<your_search_head> | search is_scheduled=1 | stats count by eai:acl.app splunk_server | sort - count
If you have several apps with a lot of scheduled jobs, this may be an issue you’ll want to look at. Splunk has limitations to the number of jobs it can run at any one time which you can find more about here. Two of the settings I’d point you to are the following.
base_max_searches = <int> max_searches_per_cpu = <int>
Warning: I’d like to state that increasing these settings will directly affect performance of the system overall as it increases the max number of searches Splunk can run. If you’d like to get an in-depth look at your searches that are running, I’d recommend two final searches.
4.) Search for Real-Time Scheduled Searches on your Search Head
|rest /services/saved/searches splunk_server=<your_search_head> | search is_scheduled=1 dispatch.earliest_time=*RT* | rename title AS savedsearch_name | rename eai:acl.app as title | table title savedsearch_name cron_schedule dispatch.earliest_time dispatch.latest_time search
5.) Search for All Scheduled Jobs and Compare Run-times to Cron Schedules
The following search requires a bit of critical thinking, but it will give you some interesting information that you can use to start tuning your searches that are scheduled (and not run in real-time):
index=_internal host=<your_search_head> source="/opt/splunk/var/log/splunk/scheduler.log" _raw!="*queued*" | stats avg(run_time) max(run_time) min(run_time) by savedsearch_name user | rename avg(run_time) as average_run_time | rename max(run_time) as max_run_time | rename min(run_time) as min_run_time | eval average_run_time = average_run_time/60 | eval min_run_time = min_run_time/60 | eval max_run_time = max_run_time/60| sort - average_run_time | join savedsearch_name [|rest /services/saved/searches splunk_server=<your_search_head> | search is_scheduled=1 dispatch.earliest_time!=*RT* | rename title AS savedsearch_name | rename eai:acl.app as title| table title savedsearch_name cron_schedule search ]
What this search will give you:
- savedsearch_name - Name of the Search
- user - User that runs the search
- average_run_time - average run time of the search over the specified time
- max_run_time - Maximum run time of the search over the specified time
- min_run_time - Minimum run time of the search over the specified time
- cron_schedule - How often the search runs
- search - Search String
- title - Name of the app
Some insights you can derive from this:
- Is your search scheduled to run more often than it takes for the search to complete? I.e. is the cron schedule set for every 5 minutes, but the average run time of your search is 30 minutes?
- What are your longest running searches? Can you improve them by using the following helpful article?
- Do you have any users who are scheduling long running jobs on an ES Search Head that could be moved somewhere else?
Hope This Helps!
Hopefully this helps some of you out with any search head performance nightmares you might be experiencing. If you have any questions comments or other recommended searches that have helped you out in the past, feel free to let us know. Also feel free to e-mail me any good mexican restaurant recommendations.