In my previous blog post I discussed some considerations for designing and sizing a Splunk environment, and the resources that servers will need for a mock deployment. However, one topic that I did not focus on was the storage requirements. There’s enough to consider that this topic warrants an entirely separate blog post.
While storage seems simple at face value, there are a few factors to consider when sizing a Splunk environment that are important for ensuring that Splunk performs well and maintains data as searchable for an appropriate period of time.
Factors Impacting Storage Requirements
The two most important factors impacting Splunk storage are the daily ingestion rate and the desired retention period. These two values work together to account for the bulk of the calculation of how much space you will need allocated.
Daily Ingestion Rate
The daily ingestion rate is simple - how much data is Splunk consuming each day, before compression. Generally, it’s safe to use the licensed capacity as an upper limit for this value, as you should not consistently be exceeding your license. However, there are cases where this is not always the best method.
According to Splunk’s current Pricing FAQ, volume discounts are offered for larger license sizes. This means that the cost per gigabyte of daily ingestion per day decreases with a larger license. In many cases, it can make more sense to buy a larger license with the intent to grow into that license to result in a lower average cost per unit of data in Splunk. For more details on this, talk to someone who drives a fancier car and uses larger words with less actual meaning than me (aka your sales rep).
This could mean that your organization ends up purchasing a much larger Splunk license than what you might anticipate using for a few years. In this case, it may make sense to purchase storage based on your anticipated ingestion rate and budge to increase the storage allocated in subsequent years. This would be a situation where your calculations might not be based on simply the license size.
At face value, retention period is simple: how long data is available in Splunk. However, there are some nuances to this value to consider.
The retention period is typically at least somewhat dictated by compliance. Many compliance standards, such as PCI, include verbiage regarding how long logs must be maintained and how much data must be immediately available (PCI DSS, 10.7). In terms of Splunk, immediately available would mean that the data must be immediately searchable (and not frozen or otherwise archived).
Another consideration around retention would be the intended use for the data from a search perspective. For Splunk Enterprise Security Suite (ES) or Splunk IT Service Intelligence (ITSI) use cases, Splunk runs tons of searches nearly constantly over a relatively small range of data. You may have other reporting or business analytics use cases that require a larger set of data to be available quickly. Planning the storage around the typical search behavior can result in better system performance due to the tiered method that Splunk uses.
So we just do some math and buy some disks, right?
Without knowing any better, you might think that a Splunk disk calculation would work something like this:
- You have a 10gb license
- Your compliance requirement stipulates that you need 90 days of logs immediately available
- You math those two numbers together (yes, I’m using math as a verb here) and determine you need 900gb of disk space
- You buy a 1TB hard drive, and never think of Splunk storage again
Unfortunately, it’s not quite that simple. What actually happens in this case:
- All of your users hate you forever
- Splunk performs poorly
- You end up with more capacity than you actually might need
To better understand how this translates into disk utilization and storage performance, let’s first explore how Splunk stores data.
How Splunk Stores Data
Splunk stores data in different indexes, and each index is composed of a number of buckets, which essentially are files written to the filesystem in a format Splunk understands. Buckets consist of compressed raw data as well as structures that allow Splunk to quickly determine if terms (words) are included within a bucket. (Note: this is a gross oversimplification, as just this topic could be multiple blog posts).
Searching Splunk for data requires a few steps (again, this is a simplification of the process):
- Using the time range selected, Splunk knows what buckets to search based on the timestamps in their filenames
- Splunk quickly determines if the desired data might be contained in a bucket, and skips those that are irrelevant
- Buckets containing data are decompressed and matching results are obtained
- Analysis/processing/transformation of the data is performed
- Results are returned to the user
This means that, in order to work efficiently, Splunk is not simply storing data on the disk in a raw, uncompressed format. In fact, the raw data is compressed, and typically effectively compresses down to approximately 15% of the original data size when stored on disk.
But there’s more to Splunk than just compressing the data and calling it a day - the largest percentage of storage space consumed by Splunk actually consists of searchable metadata - essentially, the data used by Splunk to quickly determine if a file needs to be decompressed to return results when a search is run. This accounts for approximately 35% of the original data size when stored to disk.
By adding these together, we see that the actual disk space used by Splunk when storing data amounts to roughly 50% of the original data size, before any replication. Obviously, your mileage may vary depending on the type of data being stored: some data compresses quite well and other data compresses extremely poorly, so your deployment may see different results. However, if you don’t have a great idea of what to expect, using the 50% value is a good way to start.
Splunk buckets also have the concept of a lifecycle, which determines the state of the bucket. When a bucket is being written to, it is said to be in the hot state. Hot buckets are the only ones where data is actively being written. Once a bucket is full (or a Splunk restart happens or the bucket is manually closed), the bucket will roll to the warm state, and will be closed for future writing. After a period of time, Splunk will roll the buckets to cold, and after a longer period of time, Splunk will roll the cold buckets to frozen (which by default, results in these buckets being deleted).
Without going into to too much detail, the benefits of the hot -> warm -> cold -> frozen is that it allows for storage to be tuned to better match the desired use case for the environment and also to manage storage costs.
Warm Storage is where both hot and warm buckets reside. It is also the only storage were new/incoming data is written. This type of storage should be the fastest available to your Splunk system: Splunk requires a minimum of 800 IOPS for this storage. If practical, it makes sense to spring for NVMe or SSD drives for this. If spinning disks are used, a RAID 10 is generally required to achieve acceptable performance (in my experience this will require a 8x drive array to reach these numbers on all but the fastest 15k SAS disks). SAN storage is completely acceptable provided it can accommodate the amount of IOPS that Splunk requires (test your storage first, don’t just believe your SAN vendor). Shared storage via NFS should never be used for Splunk warm.
That said, assuming we are using a flash-backed system for our warm storage - we likely can’t store everything there due to the cost of these drives. This is where understanding our Splunk use case comes in: if for example our primary use case for Splunk is a Security Operations Center (SOC) one, we might find that the overwhelming majority of our searches won’t involve more than a week of data. This means that we might consider using very fast storage for this amount of data and understanding that searches over a longer period of time will use slower storage and not complete as quickly.
Cold storage is the second class of storage defined in Splunk and is the last class that is actively searchable. Cold storage is often used for data that needs to be searchable without additional steps (such as in the case of the “immediately available” requirement for PCI), but where decreased search performance is acceptable. Typically, this type of storage is used as a trade-off, especially when a high cost/low capacity type of storage such an NVMe SSD solution backend is used for Splunk warm and it’s not practical to keep all of the searchable data on this type of storage.
In some cases, you might simply be allocating a single large pool of storage to Splunk. In this case, there isn’t any difference between the warm and cold storage in terms of performance. Splunk still will roll buckets to cold under the hood, so you will just want to symlink your cold mount point to your warm storage.
I haven’t found any official recommendation on the required performance for cold storage, since it’s generally more trouble to get the required storage performance from warm storage unless flash-based storage is being used. It should at least be mentioned that this storage needs to perform well enough to handle bucket rolling activities (copying indexed data from warm to cold) as well as any search requests that make use of this data.
Since cold storage is a lower class of storage, it’s tempting to use a network based storage such as NFS for this (NFS is *not* supported for warm storage, but it is supported for cold). Bear in mind that this exposes your environment to stability issues due to NFS failures - I have seen entire indexers malfunction (leading to a cascading Splunk infrastructure failure) due to NFS issues accessing cold storage.
There’s a third class of storage that is significantly less convenient than warm or cold storage that can be leveraged as a cost management solution where longer retention periods are needed for compliance or legal reasons.
By default, Splunk does not use frozen storage - the frozen behavior instead deletes the data, once the configured retention period for cold has been exceeded. A Splunk administrator can override this behavior to specify a location where data is stored when it rolls to frozen.
Frozen data has the benefit of taking up significantly less space than other Splunk indexed data, generally around 15% of the original data size. If you think that number sounds familiar, you’re right - that’s the same 15% in the first calculation we did for determining how much space the raw data will take up when data is indexed. Rolling to frozen doesn’t magically give us more space - we’re deleting the searchable metadata (35% of the original data size), which renders our data non-searchable in Splunk, and archiving the compressed raw data.
In order for frozen data to become searchable in Splunk, the data must be thawed. This means that the data is moved to a location on the indexer where it can be read and that Splunk re-builds all of the metadata needed to make the data searchable. Depending on the data, this can be a very time intensive (and manual) process.
Additionally, once data is frozen by Splunk, it is no longer tracked. Cleaning up the frozen data (such as deleting it once it is beyond the duration where it must be retained) must be accomplished using a process external to Splunk.
As such, frozen data is best suited as an archive that is not expected to be accessed during normal business operations, but must be kept for compliance or legal purposes. Regularly accessing frozen data in Splunk is not a very efficient use of your admins’ time.
Data Model Acceleration
One often forgotten component of your storage calculation would be the space needed to store data model acceleration data, which is commonly used by the Splunk Enterprise Security Suite (ES). The official calculation for this data works out to be 3.4x the daily usage to account for storage for one year of accelerated datamodel data. This means that a year of datamodel acceleration on a 10gb/day license would use up approximately 34gb of disk space. This isn’t a huge consideration on a small deployment but will add up as an environment scales.
Now that we’ve covered the classes of storage, let’s work through a few examples, first using a small 10gb/day environment and then expanding to cover the sample environment (500gb/day) detailed in part 1 of the Splunking Responsibly blog.
Storage Calculation: Single Instance, 10gb/day
Let’s start with a simple example:
- Splunk licensed capacity: 10gb/day
- Single server, no replication
At this capacity, we can expect to consume 5gb/day of actual disk space:
- 10gb/day * .15 (raw data archive) = 1.5gb/day disk usage
- 10gb/day * .35 (search metadata) = 3.5gb/day disk usage
- 1.5gb + 3.5gb = 5gb/day disk usage
Now that we know how much space we need per day, we can extrapolate our storage requirements out using retention numbers. Assuming we are keeping everything in warm or cold storage, this is a simple calculation:
- Storage per day * number of days
In Our 10gb/day Example:
90 days retention:
- 5gb daily usage * 90 days = 450gb disk - 1 year retention:
- 5gb daily usage * 365 days = 1825gb disk
Data model acceleration (assuming 1 year):
- 10gb * 3.4 = 34gb
1 year retention + DMA:
- 1825gb + 34gb = 1859gb
In this example, I would recommend the customer have 2TB of space allocated for Splunk data
If we want to reduce our storage requirements by using frozen data, but need to keep 90 days of data “immediately available for analysis” with 1 year of data available (in line with what is required by PCI DSS 10.7), our calculation would look something like this:
- 5gb daily usage * 90 days = 450gb
- 1.5gb daily usage * 275 days = 412.5gb
- 450gb (warm/cold) + 412.5gb (frozen) + 34gb (DMA) = 896.5gb
- In this example, I would recommend the customer have 1TB of space allocated for Splunk data
As you can see, using frozen to archive 9 months of data significantly reduces the storage requirements at the expense of making the data easily available.
Storage Calculation: Single Instance, 500gb/day + clustering
Now for a more complicated example. Here we will reference the 500gb/day deployment in my previous post. This environment consisted of 8 indexers. For the sake of this example, we’ll consider a perfect distribution of data across all of the indexers, with a search and replication factor of 2:2 at each site.
Let’s start with calculating our base storage requirements:
- 500gb/day * .15 (raw data archive) = 75gb/day disk usage
- 500gb/day * .35 (search metadata) = 175gb/day disk usage
- 75gb + 175gb = 250gb/day disk usage (pre-replication)
- 75gb * 2 (replication factor) + 175gb * 2 (search factor) = 500gb/day
- 500gb/day / 8 indexers (assuming even distribution) = 62.5gb/day/indexer
At this point, we continue our calculation based on desired retention:
90 days retention:
- 500gb daily usage * 90 days = 45TB disk
- 45 TB / 8 indexers = 5.625 TB disk per indexer
1 year retention:
- 500gb daily usage * 365 days = 182.5 TB disk
- 182.5 TB / 8 indexers = ~ 23 TB disk per indexer
Data model acceleration (assuming 1 year):
- 500gb * 3.4 = 1700gb
1 year retention + DMA:
- 182.5 TB + 1.7 TB = 184.2 TB
- 184.2 TB / 8 indexers = ~23 TB disk per indexer (slightly above this)
- In this example, I would recommend the customer have a bit more than 23TB of space allocated for Splunk data on each indexer
This becomes more complicated if you go with a 90 days warm/cold + 275 days frozen calculation, but since we’ve done everything up to this point, why not:
- 500gb daily usage * 90 days = 45 TB disk
- 45 TB / 8 indexers = 5.625 TB disk per indexer
- 75 gb daily usage * 275 days = 20.625 TB disk
- 20.625 TB / 8 indexers = 2.578 TB disk per indexer
- Across the environment: - 45 TB (warm/cold) + 20.625 TB (frozen) + 1.7 TB (DMA) = 67.325 TB
- Per indexer: - 5.625 TB (warm/cold) + 2.578 TB (frozen) + 0.2125 TB (DMA) = ~ 8.42 TB per indexer
- In this example, I would recommend the customer have about 9TB of space allocated for Splunk data for each indexer
But Math is Hard!
It’s important to be able to understand how storage is calculated and be able to do it by hand if necessary (I have used many a whiteboard to go through this example with customers, and being able to do this manually is really helpful for helping them to “get it”). That said, once you’re comfortable with the process, tools exist to make this a lot easier.
By far the best tool I’ve come across yet is https://splunk-sizing.appspot.com/. This doesn’t cover every possible combination (such as some daily consumption values or accounting for data model acceleration), but it will quickly allow you to get a guesstimate of how much space a client will need for their deployment.
That said, just like in school, be able to show your work and defend your calculations if a client has questions.