Splunk Case Study: Indexed Extractions vs. Search-Time Extractions

Index Vs Search Time Extractions

Introduction

Splunk documentation hides a unique setting that can be extremely helpful, but can also come at a cost. What I'm talking about is the setting for Indexed Extractions. Even though an entire book chapter could probably be written on this topic, I’m going to focus on some high level points that tend to come up when Indexed Extractions are being discussed.

Case Study Baseline

Before getting started, I want to baseline the included case study. What I needed to do was to locate a usable dataset and, in order to make sure I wasn’t using any sensitive information, I chose a freely available dataset from Yelp. This particular dataset is available for download from the Yelp Dataset Challenge Round 9 website.

I chose a specific file from their download titled yelp_academic_dataset_review.json. This file is 3.3GB in size and contains 4,153,150 rows of data. Each row is a different review about a specific business. This felt like a sizable and complex enough set of data for the purpose of this case study. Not only does the data mimic the 3 V’s of actual Big Data: Volume, Variety, and Velocity; but, better yet, it was also structured data which Splunk has a lot of great mechanisms for handling.

The most important test I wanted to review was this question of “What performed better?” Would it be index-time extractions or letting the Splunk Search Head handle the data extraction? By setting up my very controlled test case, I was able to examine this question.

Let’s collect the data

To start out, I needed two differently named, but identical, data sets. When collecting those data sets with Splunk, I had to ensure that they were distinctly different in terms of where the extractions were taking place.

The steps for doing this included:

  1. Placing the original file on my Search Head.
  2. Deploying two sets of configurations to my Search Head and Indexer so that, once collected, one data set will use Indexed Extractions and another will use Search Time Extractions. (Those configurations are listed below).

inputs.conf

My inputs.conf file was setup to monitor two specific files and place them into two specific sourcetypes: yelp_no_indexed_extract and yelp_indexed_extract.

[monitor:///tmp/yelp_data_no_indexed_extract.json]
index=test
sourcetype=yelp_no_indexed_extract
disabled=0

[monitor:///tmp/yelp_data_indexed_extract.json]
index=test
sourcetype=yelp_indexed_extract
disabled=0

props.conf

My props.conf files are where the rubber meets the road for this test, and where I actually specify the type of extractions. The first was to set up KV_MODE=JSON, which tells only the Search-Head to make sense of our JSON formatted data. The second was to intentionally shut that off and tell the indexer to extract the data using INDEXED_EXTRACTIONS=JSON.

[yelp_no_indexed_extract]
TRUNCATE = 0
CHARSET = UTF-8
KV_MODE=JSON
SHOULD_LINEMERGE=false
DATETIME_CONFIG=CURRENT

[yelp_indexed_extract]
TRUNCATE = 0
CHARSET = UTF-8
KV_MODE=none
INDEXED_EXTRACTIONS=JSON
SHOULD_LINEMERGE=false
DATETIME_CONFIG=CURRENT

Tests

With my configurations in place, I was able to setup and run two very controlled tests. I ran these at completely different times so that they didn’t interfere with one another (and that boring detailed information is listed below).

Test #1 - Non-Indexed Extractions test

Start time: 11:10AM EST
Size of Index Before: 0GB

Test Setup:

On the Indexer I ran the following commands to clean my index.

/opt/splunk/bin/splunk stop
/opt/splunk/bin/splunk clean eventdata -index test
/opt/splunk/bin/splunk start

On the Search Head I ran the following commands to tell Splunk to start monitoring this file.

cd /tmp
cp yelp_academic_dataset_review.json yelp_data_no_indexed_extract.json

Size of Index After Indexing Completed: 2.6GB

Test #2 - Indexed Extractions test

Start time: 11:21AM EST
Size of Index Before: 0GB

Test Setup:

On Indexer I ran the following commands to clean my index.

/opt/splunk/bin/splunk stop
/opt/splunk/bin/splunk clean eventdata -index test
/opt/splunk/bin/splunk start

On the Search Head I ran the following commands to tell Splunk to start monitoring this file.

cd /tmp
cp yelp_academic_dataset_review.json yelp_data_indexed_extract.json

Size of Index After Indexing Completed: 9.5GB

CPU Performance Results

Indexed Extraction CPU performance was slightly worse than CPU Performance for Indexing data without Indexed Extractions.

Below is a chart that shows the CPU usage during both tests for the index and parsing queues.

Parsing Queue:

Indexing Queue:

Storage Performance Results

Indexed Extraction storage performance was much worse than Search Time Extractions. Below is a table that shows the size of the index before and after indexing, with indexed extractions and without.

Simple Search Result Tests

For simple search results, it was found that the best performance you’ll find is using indexed extractions. The worst performance you’ll find is when you search data that was indexed using Indexed Extractions, but you use the “=” operator.

More Complex Search Result Tests

For more complex search results, it was found that the best performance you’ll find is using indexed extractions. In fact, in this test case, results returned about twice as fast. Though we have to keep in mind that the “::” is much like an “equals sign” and does not allow for a lot of flexibility. So, though we can get great results with this test, there are plenty of tests we were unable to run.

Other findings

One very large downside to consider when moving forward with Indexed Extractions is that calculated fields will not work with Indexed Extractions. Two visualizations I was able to make with the search time extracted data are shown below. These are word clouds for ratings that people gave for 1-star and 5-star Yelp Reviews.

As you can see, these visualizations show different words being used for each type of review. You can see the word “not” on the left for the 1-star reviews. On the right you’ll notice “very” and “great” for the 5-star reviews.

These types of visualizations are not possible with the data that purely uses Indexed Extractions. The reason is that calculated fields only work on search-time extractions.

Summary:

To summarize, Indexed Extractions should be used with caution. Splunk gives a pretty fair warning against using them in almost any doc that references Indexed Extractions, including their definition on Splexicon.

Though they can have some seriously awesome benefits with searching in very specific use cases, overall they add a lot of overhead in terms of both CPU and Storage costs. In our testing, the size of our index was almost 3 times as high as the actual raw data. In addition, the data in this test case ended up not being as flexible and I was unable to perform some very common and useful SPL commands on Index Extracted fields, such as eval.

Hopefully this will bring some clarity as to why Splunk warns against Indexed Extractions so much. If anyone has any questions or comments on this case study, please feel free to reach out.