Big Data isn’t a set it and forget it endeavor.
There are a number of topics surrounding Big Data that need to be considered as an organization progresses. For example, you should be re-evaluating predictive models for accuracy in some sort of regular manner and you should be spot checking your data to see if the quality of it is still the same over time. What you also need to ensure with Big Data, is that your systems are performing well. I’m not only talking here about questions like “Do I have enough CPUs?” and “Is my storage fast enough?”. Questions that I find to be more commonly overlooked are “Is my Software running optimally?” and “Am I utilizing the resources I have as best as I possibly can?” These questions are what I am aiming to give Splunk users more insight about - starting with the Splunk Scheduler and a specific problem I see very frequently that I call “Skipped Searches”.
In short, the Splunk Scheduler is the backbone of Splunk, as well as a number of apps and add-ons. It is used to automatically run searches without someone needing to have a web browser open typing out the Splunk Search Processing Language (SPL). This is how automated searches are run and alerts are sent. For example, if I want to know when my LIFX Light Bulbs are powered on, I can do that and schedule Splunk to email me when it happens.
One of the issues that occurs, as environments grow and more users start utilizing Splunk, is that the Splunk Scheduler often becomes overburdened or bloated. Fortunately, Splunk has self-imposed limits so that users don’t take down a Splunk Environment by running an extreme number of poorly performing searches. Unfortunately, these limits can cause delays in things that matter - like Notable Events in Splunk Enterprise Security. Or it can cause some apps that use too many Real-time Searches (I’m looking at you Palo Alto Networks) to be kind of “cut off at the knees” and never work at all until those limits are increased.
This guide should serve as a jumping off point for solving this issue of “Skipped Searches”. It’s unfortunately a complex problem in some environments, but hopefully this blog will lend some guidance on how to make your Splunk Scheduler run optimally.
1.) Detection of Skipped Searches.
First and foremost, you’re going to want to detect if you have a problem with “Skipped Searches”. This is a very simple Splunk Search that will tell you if you have events from the scheduler around searches that have skipped, and it looks like the following:
index=_internal earliest=-24h status=skipped sourcetype=scheduler
2.) Identification of what apps the Skipped Searches are coming from.
This search will give you a count of skipped searches by host and the app that the searches are a part of. It will also sort by the host/app with the largest number of skipped searches.
index=_internal earliest=-24h status=skipped sourcetype=scheduler | stats count by host app | sort - count
3.) Determining what to do with apps.
“If you don’t need it, get rid of it”
Start by asking yourself this question first: “Do I need this app?” This is a minimalistic approach, but also one of the most important questions to answer. I’ve encountered environments plenty of times with hundreds of thousands of skipped searches per day for apps that a customer wasn’t even using and wasn’t bringing them value. The number one easiest and most simple way to tackle a search that is skipping and help other searches run more efficiently, is to disable the search you don’t need. I don’t know how many times in my Splunk career I’ve uttered the phrase: “If you don’t need it, get rid of it”.
“Is this app installed in the correct location?”
This is another important question to think about. I always trust the Splunk Documentation and for a lot of apps there is great documentation to tell you exactly where an app needs to be installed. There tends to even be a matrix that outlines it clearly. I’ve encountered a number of environments where an app isn’t installed correctly and that documentation wasn’t followed. This is another easy way to correct skipped searches: Install the app correctly.
For example, In a distributed environment with a Search Head and an Indexer, a lot of apps don’t need to be installed on the Indexer. The most common reason for this is the scheduled searches need to take place on the Search Head where they are visible, and shouldn’t be duplicated unnecessarily on somewhat invisible environmental components such as the Indexer.
“Can this app be optimized for where it is installed?”
In cases where app documentation inadvertently does state that the app needs to be on an indexer (again, I’m looking at you Palo Alto Networks) you can feel free to disable scheduled searches on the Indexer. In 90% of cases, your Search Heads are going to be connected to all of your Indexers and have access to all of the same data so the searches will only need to run there. There will be no reason to have Scheduled Searches performing the exact same tasks on Indexers and Search Heads.
4.) Identification of bad searches.
The following search will help you identify all scheduled searches that are skipping for a given server. Make sure to replace <server_name> with the server you care about before running this.
index=_internal sourcetype=scheduler status!=queued earliest=-30d@d host=<server_name> | eval is_realtime=if(searchmatch("sid=rt* OR concurrency_category=real-time_scheduled"),"yes","no") | fields run_time is_realtime savedsearch_name user | stats avg(run_time) as average_run_time max(run_time) as max_run_time min(run_time) as min_run_time max(is_realtime) as is_realtime by savedsearch_name user | eval average_run_time = average_run_time/60| eval min_run_time = min_run_time/60 | eval max_run_time = max_run_time/60 | sort - average_run_time | join savedsearch_name [|rest /servicesNS/-/-/saved/searches splunk_server=<server_name> | search is_scheduled=1 | rename title AS savedsearch_name | rename eai:acl.app as title| table splunk_server title savedsearch_name cron_schedule search ] | rename title as "App Name" | fields splunk_server savedsearch_name user average_run_time max_run_time min_run_time cron_schedule is_realtime search "App Name" | sort splunk_server, -average_run_time
5.) Determining what to do with searches.
“Is this a scheduled, real-time search?”
If so, get rid of it. This is the number one worst search you can run. Even scheduling the search every 1 or 5 minutes is better than the scenario where you schedule it in real-time.
“If you don’t need it, get rid of it”
Again, like the apps, if you don’t need a search, get rid of the search. Time and time again I have seen people setup “test” searches and then forget about them. Asking a user to clean up a search is perfectly acceptable and will help everyone in your environment.
“Can I optimize this search?”
The output of Step 4 will give you two interesting fields. average_run_time - how long the search took to run (in seconds) cron_schedule - how often the search is scheduled to run.
If you’re noticing that the average run time is longer than the cron_schedule, you’ll want to look to either improve your searches performance, or to schedule the search less frequently that is still acceptable. If you want help with writing better searches, check out this Splunk doc on how to Write Better Searches.
6.) A Breakdown of the Scheduler Limits.
Note: If you’ve followed all of the steps above, then your last option is to increase limits on a system. Please treat this as a last resort, though it does commonly need to happen, especially for Splunk Enterprise Security.
Splunk has a few different limits to consider and I’ll show you how those are calculated with a 16 core system as it’s important to understand these limits before making any changes. For more specific information, please visit the page for limits.conf.
Out of the box with a Splunk 16 core system, Splunk can run 22 searches at any one time. That is calculated using the following formula:
max_hist_searches = max_searches_per_cpu ( default of 1) x number_of_cpus (16) + base_max_searches (default of 6).
Of those 22 searches, the scheduler is allocated 50 percent of that number by default (so 11 searches) according to the setting max_searches_perc.
Of those 11 searches that the scheduler can run, the auto summarization (things like Data Models) are allowed 50% of the number of scheduled searches. Taking 50% of the number of searches that the scheduler can run, we end up with about 6 searches that can be run at a time for your Data Models according to the setting for auto_summary_perc.
This all being said, if your data models are taking an extremely long time to run, or not completing and consistently skipping, increasing your base max searches from 6 to 12, will only get you to 7 auto summary searches at one time. You may want to alter auto_summary_perc in order to allow more searches to be dedicated to your Data Models. If you have the CPU resources, you may want to up the max_searches_per_cpu to something like 2. This would allow you to run 10 Data Model summarizations at any one time and give an overall number of 38 maximum historical searches.
The Splunk Scheduler is, in fact, a pretty exact science when you break it down. Understanding it however can be quite complex. Hopefully this article gives you some tips and tricks for making sense of it all. If you have any additional methods for wrangling the Splunk Scheduler, or any questions or comments please feel free to reach out. I’d love to hear from other people who have tackled this issue.