Identifying Splunk Search Duplicates with jellyfish and Jaro-Winkler

Managing Splunk across several search heads can be a challenge. Part of this challenge is an issue that can easily go unseen: search heads performing near duplicate work, potentially impacting performance. To help identify these situations, Tim created a script.



Sniffing out a way to improve ES performance

In large Splunk environments with many users, the possibility that your search infrastructure could be working double time running duplicate searches is increased. This can possibly lead to degraded performance, especially on Enterprise Security search heads, which have the added overhead of running the extra tools included.

To counteract this problem, we developed a script to help sniff out searches which may be duplicated across search heads.

A word about jellyfish and Jaro-Winkler

To compare two searches, we need to have a metric that can give us an easy to understand method of comparison. Enter: Jaro-Winkler.

Jaro-Winkler is an algorithm that takes two strings and gives a value between 0 and 1 to represent how similar two strings are, with 1 being an exact match. Jaro-Winkler is a modification of the Jaro algorithm which gives greater weight to the beginning of the string. This is perfect for Splunk searches, as searches are executed in order.

Instead of implementing this algorithm ourselves, we can use the very useful Python library jellyfish. The jellyfish library implements several useful string comparison algorithms, including Jaro-Winkler. 

Note: Please ensure you are downloading the correct library. Recently, a malicious library named “jeIlyfish” (a capital “i” replaces the first “l”) was found to be stealing SSH and GPG keys of it’s users. This shouldn’t be a problem if you use pip install (as a typo is unlikely), but if you regularly set up new Python libraries directly, take special care.

The Script

Python 2

import jellyfish
import splunklib.results as results
import splunklib.client as client
import re
import pprint


class SplunkHost:
    def __init__(self, host, host_port, username, password):
        self.host = host
        self.host_port = host_port
        self.username = username
        self.password = password


def splunk_host_input(prompt, host_type=None, port_type=None, min_port=None, max_port=None):

    while True:
        host_info = raw_input(prompt)
        host_info = re.split('[\;\s\,]', host_info)
        if host_type is not None:
            try:
                host_info[0] = host_type(host_info[0])
            except ValueError:
                print("Host input type must be {0}.".format(host_type.__name__))
                continue
        if port_type is not None:
            try:
                host_info[1] = port_type(host_info[1])
            except ValueError:
                print("Port input type must be {0}.".format(port_type.__name__))
                continue
        if min_port is not None and host_info[1] < min_port:
            print("Port must be less than or equal to {0}.".format(min_port))
        elif max_port is not None and host_info[1] > max_port:
            print("Port must be less than or equal to {0}.".format(max_port))
        else:
            return host_info
            break


def main():

    ad_hoc = splunk_host_input("Enter the first Splunk host, REST port, username, and password to check (separated by spaces or commas )", str, int, 1, 65355)
    ad_hoc = SplunkHost(ad_hoc[0], ad_hoc[1], ad_hoc[2], ad_hoc[3])
    es = splunk_host_input("Enter the second Splunk host, REST port, username, and password to check (separated by spaces or commas): ", str, int, 1, 65355)
    es = SplunkHost(es[0], es[1], es[2], es[3])

    ad_hoc_service = client.connect(
        host=ad_hoc.host,
        port=ad_hoc.host_port,
        username=ad_hoc.username,
        password=ad_hoc.password)

    es_service = client.connect(
        host=es.host,
        port=es.host_port,
        username=es.username,
        password=es.password)

    searchQuery = """| rest /servicesNS/-/-/saved/searches 
                     | search is_scheduled=1
                     | table title search disabled is_scheduled """

    kwargs_oneshot = {"earliest_time": "-24h",
                      "latest_time": "now",
                      "count": 0}

    ad_hoc_oneshotsearch_results = ad_hoc_service.jobs.oneshot(searchQuery, **kwargs_oneshot)
    es_oneshotsearch_results = es_service.jobs.oneshot(searchQuery, **kwargs_oneshot)

    ad_hoc_reader = results.ResultsReader(ad_hoc_oneshotsearch_results)
    es_reader = results.ResultsReader(es_oneshotsearch_results)

    answer_dict = {}
    ad_hoc_dict = {}
    es_dict = {}

    for item in ad_hoc_reader:
        ad_hoc_search_title = item['title'] + " (ad-hoc)"
        ad_hoc_search_string = item['search']
        ad_hoc_dict[ad_hoc_search_title] = ad_hoc_search_string

    for es_item in es_reader:
        es_search_title = es_item['title'] + " (es)"
        es_search_string = es_item['search']
        es_dict[es_search_title] = es_search_string

    for ad_hoc_search_title, ad_hoc_search_string in ad_hoc_dict.iteritems():
        for es_search_title, es_search_string in es_dict.iteritems():
            similarity = jellyfish.jaro_winkler(unicode(ad_hoc_search_string, 'utf-8'), unicode(es_search_string, 'utf-8'))
            if similarity > 0.9:
                answer_dict[ad_hoc_search_title + " - " + es_search_title] = similarity

    file = open("output.txt", 'w')
    pp = pprint.PrettyPrinter(indent=4, stream=file)
    pp.pprint(answer_dict)


if __name__ == "__main__":
    main()

Python 3

import jellyfish
import splunklib.results as results
import splunklib.client as client
import re
import pprint


class SplunkHost:
    def __init__(self, host, host_port, username, password):
        self.host = host
        self.host_port = host_port
        self.username = username
        self.password = password


def splunk_host_input(prompt, host_type=None, port_type=None, min_port=None, max_port=None):

    while True:
        host_info = input(prompt)
        host_info = re.split('[\;\s\,]', host_info)
        if host_type is not None:
            try:
                host_info[0] = host_type(host_info[0])
            except ValueError:
                print("Host input type must be {0}.".format(host_type.__name__))
                continue
        if port_type is not None:
            try:
                host_info[1] = port_type(host_info[1])
            except ValueError:
                print("Port input type must be {0}.".format(port_type.__name__))
                continue
        if min_port is not None and host_info[1] < min_port:
            print("Port must be less than or equal to {0}.".format(min_port))
        elif max_port is not None and host_info[1] > max_port:
            print("Port must be less than or equal to {0}.".format(max_port))
        else:
            return host_info
            break


def main():

    ad_hoc = splunk_host_input("Enter the first Splunk host, REST port, username, and password to check (separated by spaces or commas )", str, int, 1, 65355)
    ad_hoc = SplunkHost(ad_hoc[0], ad_hoc[1], ad_hoc[2], ad_hoc[3])
    es = splunk_host_input("Enter the second Splunk host, REST port, username, and password to check (separated by spaces or commas): ", str, int, 1, 65355)
    es = SplunkHost(es[0], es[1], es[2], es[3])

    ad_hoc_service = client.connect(
        host=ad_hoc.host,
        port=ad_hoc.host_port,
        username=ad_hoc.username,
        password=ad_hoc.password)

    es_service = client.connect(
        host=es.host,
        port=es.host_port,
        username=es.username,
        password=es.password)

    searchQuery = """| rest /servicesNS/-/-/saved/searches 
                     | search is_scheduled=1
                     | table title search disabled is_scheduled """

    kwargs_oneshot = {"earliest_time": "-24h",
                      "latest_time": "now",
                      "count": 0}

    ad_hoc_oneshotsearch_results = ad_hoc_service.jobs.oneshot(searchQuery, **kwargs_oneshot)
    es_oneshotsearch_results = es_service.jobs.oneshot(searchQuery, **kwargs_oneshot)

    ad_hoc_reader = results.ResultsReader(ad_hoc_oneshotsearch_results)
    es_reader = results.ResultsReader(es_oneshotsearch_results)

    answer_dict = {}
    ad_hoc_dict = {}
    es_dict = {}

    for item in ad_hoc_reader:
        ad_hoc_search_title = item['title'] + " (ad-hoc)"
        ad_hoc_search_string = item['search']
        ad_hoc_dict[ad_hoc_search_title] = ad_hoc_search_string

    for es_item in es_reader:
        es_search_title = es_item['title'] + " (es)"
        es_search_string = es_item['search']
        es_dict[es_search_title] = es_search_string

    for ad_hoc_search_title, ad_hoc_search_string in ad_hoc_dict.items():
        for es_search_title, es_search_string in es_dict.items():
            similarity = jellyfish.jaro_winkler(ad_hoc_search_string, es_search_string)
            if similarity > 0.9:
                answer_dict[ad_hoc_search_title + " - " + es_search_title] = similarity

    file = open("output.txt", 'w')
    pp = pprint.PrettyPrinter(indent=4, stream=file)
    pp.pprint(answer_dict)


if __name__ == "__main__":
    main()

You will need to be able to execute remote REST searches with your account to be able to run this script. To use, simply run the script from the command line and fill out the necessary information. Output will be written to the current directory in the file output.txt by default.

The format of the output is as follows:

{ <search name 1> (ad-hoc) - <search name 2> (ES) : <Jaro-Winkler score>,
…
}

In testing, I found that a value above 0.9 should prompt serious consideration into whether or not you need to run the search on both search heads.

Happy Splunking!

Hopefully, this script will help you eliminate any parts of your Splunk search infrastructure that are working unneeded overtime, thereby improving your performance.




Close off Canvas Menu