Splunk Search Optimization: A Paleo Diet for SPL

Sometimes there are cases in Splunk where you need to optimize your search and it requires going back to your primal roots. But, what do these primal instincts look like? This blog post boils them down into a few simple rules to improve your search performance.

Venturing back to our primal roots...

There are plenty of tutorials out there that explain how to optimize your Splunk search, and for the most part they do a really good job. However, as with any situation, there are edge cases... Cases where you need to search through 13 months of WinEventLog data, totalling over 14TB. Cases where even the most seasoned of SPL (Search Processing Language) authors run screaming. It's times like this when it becomes important to harken back to your primal roots, and forego all modern conveniences -- in an attempt to get back to the basics of searching -- like field extractions and tags. But, what do these primal instincts look like? I’ve boiled them down into a few simple rules for turning an “All Time” search over index=wineventlog from a “Nightmare on SPL Street” into “Done in 5600 seconds” (are these movie puns doing anything for you? We should hang out more).

As I go through these rules, think about how you would apply them to this search in order to improve its performance. As it stands, this search will basically run forever when subjected to “All Time” on 13 months / 14TB of data:

sourcetype=WinEventLog:Security eventtype=windows_logon_success 
user="alice" OR user="bob" host="*" | dedup _raw | lookup 
logon_types.csv Logon_Type | lookup user_info_all.csv user | eval 
my_action=case(isnotnull(logon_description),logon_description,EventCode=
=4647,"User Logged Off",EventCode==4800,"User Locked PC") | search 
my_action="Logon - Interactive" OR my_action="Logon - Unlock" OR 
my_action="Logon - Remote Interactive" OR my_action="User Logged Off" OR 
my_action="User Locked PC" | table _time, host, user, my_action, 
employee, email, division, bunit, department | rename user as Username, 
host as Host, my_action as Action, employee as Employee, email as "Email 
Address", bunit as "Business Unit", division as "Division", department 
as Department

Rule #1: Limit the data Splunk touches:

Adding "index=" to your search is the single best thing you can do to improve your search -- I was able to convert a never-ending search to one that completed in less than 60 seconds by simply adding an index restriction to it. Other useful fields are "sourcetype", "source" and "host" -- these are "indexed" fields, meaning that they are actually stored to disk when the data is received, rather than being calculated at search time. Sometimes, other fields are "indexed" fields as well, depending on the source (structured logs such as CSV or JSON are often configured as "indexed" fields). However, generally most fields are "search time" fields. Think of these fields like the fields in an “index” on a database table -- using them is a really fast way to access your data.

Rule #2: Avoid wildcards like the plague:

Splunk lets you use wildcards, but it doesn’t use them very efficiently. “sourcetype=rsa*” doesn’t mean “look at my list of sourcetypes, get the ones that start with rsa, and search for those”. Unfortunately, it means “look at all of my events and discard the ones that don’t start with rsa”. This is disk I/O that could be better spent on just about anything else.

Rule #3: Beware your SPL ordering:

Splunk commands come in lots of shapes and sizes. Two of those shapes are “centralized streaming” and "transforming". These commands are basically the arch nemesis of mapreduce. Splunk achieves a good deal of its performance by distributing part of the search out to the indexers, and then aggregating the results on the search head. Centralized streaming and transforming commands run in the reduce part of the process (on the search head), and force any SPL that comes after them in the pipeline to do the same. The more work you can push out to the indexers, the better the search performance, because it's distributed amongst more systems. Some examples of these commands include dedup, table, and stats. This may seem counter-intuitive, as “dedup” and “stats” sound like excellent ways to limit the amount of data the rest of your search needs to process. In practice, however, due to their implementation, these commands end up being a bottleneck in the search.

Rule #4: Knowledge objects are the devil, part one:

I like to refer to this as Splunk's "big ugly elephant in the room". Just like wildcards, you should avoid both tags and eventtypes like the plague. Under the hood, when you search for "tag=authentication", what Splunk actually does is go grab every search string that has that tag, and joins them together with an "OR", until you end up with a MASSIVE list of search parameters. This is especially bad when you haven't followed rule #1 and narrowed your search to a single index (let’s face it, sometimes that’s just not an option). Now, Splunk is forced to look at even MORE events that you don't care about, to see if the field extractions (that it’s applying just-in-time as you search) match your filter. Using tags and/or eventtypes in a search, especially one that's long-running, is easily the biggest thing that could negatively impact search performance -- and it’s also the most inconvenient thing to have to live without.

Rule #5: Knowledge objects are the devil, part two:

Just like with part one, this is another thing you just don’t talk about at parties. The more data you're looking at, the less you should be relying on field extractions. In order to filter data on these fields, Splunk has to actually read EVERY event that matches the indexed fields, apply the regular expressions or eval statements or lookups, and then determine if it matches your filter. This is where your primal instincts will come into play. Splunk is actually REALLY good at doing simple text searching. So, if you provide a bunch of text, that appears in the events you’re interested in, you’re going to limit the events in a much more efficient manner -- sure, Splunk still has some disk I/O to do, but it’s not also applying field extractions to the data to tell if it matches. Much, much faster.

Rule #6: Always look at the job inspector:

This is less of a tip for how to optimize searches, and more of a tip for how to get better at it. Next time you run a search, look under the search box -- you’ll see a link that says “Job”; when you click that, you’ll get an option that says “Inspect Job”. The job inspector provides a wealth of information about your search, including where it spent the most time and how it parsed your search. Some things of particular interest:

  • “normalizedSearch” - This is the result of Splunk compiling your search into the internal syntax it uses to filter data
  • “remoteSearch” - This is the part of the search that will run on the indexers
  • “reportSearch” - This is the part of the search that will run on the search head

Remember, you want to force as much into “remoteSearch” as you can. Try searching for a common tag-like authentication, or network on an Enterprise Security search head, and see what pops up in normalizedSearch.

How do you think you can improve this search?

Let’s go step-by-step through each of the rules to see how we can get that search to return data. We were able to shrink the runtime of this search down to 5,518 seconds.

1.) First, let’s restrict the search to only the index we care about. Let’s add “index=wineventlog” to the search. Now we have:

sourcetype=WinEventLog:Security index=wineventlog 
eventtype=windows_logon_success user="alice" OR user="bob" host="*" | 
dedup _raw | lookup logon_types.csv Logon_Type | lookup 
user_info_all.csv user | eval 
my_action=case(isnotnull(logon_description),logon_description,EventCode==4647,
"User Logged Off",EventCode==4800,"User Locked PC") | search my_action="Logon - 
Interactive" OR my_action="Logon - Unlock" OR my_action="Logon - Remote 
Interactive" OR my_action="User Logged Off" OR my_action="User Locked PC" | 
table _time, host, user, my_action, employee, email, division, bunit, 
department | rename user as Username, host as Host, my_action as Action, 
employee as Employee, email as "Email Address", bunit as "Business Unit", 
division as "Division", department as Department

2.) Now, let’s remove wildcards. This is easy, since the only wildcard we have is in the “host” field. See below:

sourcetype=WinEventLog:Security index=wineventlog 
eventtype=windows_logon_success user="alice" OR user="bob" | dedup 
_raw | lookup logon_types.csv Logon_Type | lookup user_info_all.csv 
user | eval my_action=case(isnotnull(logon_description),logon_description,
EventCode==4647,"User Logged Off",EventCode==4800,"User Locked PC") | 
search my_action="Logon - Interactive" OR my_action="Logon - Unlock" OR 
my_action="Logon - Remote Interactive" OR my_action="User Logged Off" OR 
my_action="User Locked PC" | table _time, host, user, my_action, employee, 
email, division, bunit, department | rename user as Username, host as Host, 
my_action as Action, employee as Employee, email as "Email Address", 
bunit as "Business Unit", division as "Division", department as Department

3.) Reordering the SPL is going to totally change this search around. This one is tricky -- we removed the “logon_types.csv” lookup, the eval for my_action, and the second search command, and replaced them with restrictions in the first search:

sourcetype=WinEventLog:Security index=wineventlog 
eventtype=windows_logon_success EventCode=4647 OR EventCode=4800 OR 
Logon_Type=2 OR Logon_Type=7 OR Logon_Type=10 user="alice" OR user="bob" | 
lookup user_info_all.csv user | table _time, host, user, my_action, employee, 
email, division, bunit, department | rename user as Username, host as Host, 
my_action as Action, employee as Employee, email as "Email Address", bunit as 
"Business Unit", division as "Division", department as Department

4.) Now, let’s get rid of tags and eventtypes as much as possible. The “windows_logon_success” eventtype comes from the Splunk Addon for Windows, and expands to “sourcetype=*:Security (signature_id=4624 OR signature_id=528 OR signature_id=540)”. We’re going to clean up the search too when we add it -- you can see the other two EventCodes we list will never match with the eventtype, so we can remove them. There’s also a duplicate sourcetype match we’ll remove, as shown below:

sourcetype=WinEventLog:Security index=wineventlog 
(signature_id=4624 OR signature_id=528 OR signature_id=540) 
Logon_Type=2 OR Logon_Type=7 OR Logon_Type=10 user="alice" OR user="bob" | 
lookup user_info_all.csv user | table _time, host, user, my_action, employee, 
email, division, bunit, department | rename user as Username, host as Host, 
my_action as Action, employee as Employee, email as "Email Address", bunit as 
"Business Unit", division as "Division", department as Department

5.) Lastly, let’s replace field extractions and see the final result. Each match on a search time field extraction was replaced with whatever that value would appear as, literally, in the event:

sourcetype=WinEventLog:Security index=wineventlog 
("EventCode=4624" OR "EventCode=528" OR "EventCode=540") 
("Logon Type: 2" OR "Logon Type: 7" OR "Logon Type: 10") 
(alice OR bob) | lookup user_info_all.csv user | table _time, host, user, 
my_action, employee, email, division, bunit, department | rename user as 
Username, host as Host, my_action as Action, employee as Employee, email as 
"Email Address", bunit as "Business Unit", division as "Division", department 
as Department

So... how’d you do?

Don’t be disappointed if you weren’t able to simplify that search down to the basics on the first try - even our expert search developers were surprised by the results. With practice and discipline, though, you’ll be eating Splunk searches just like your cavemen ancestors.



Close off Canvas Menu