Splunking Responsibly: Considerations for managing limited system resources

A typical distributed Splunk environment can result in quite a sprawl of systems to manage, especially a large distributed one. This blog post has been created to help you make informed decisions on what changes to make in order to appropriately allocate resources to your Splunk systems.

When designing a Splunk deployment, especially a large distributed one, it can seem like Splunk was purely developed by hardware vendors to sell more servers. While this is not actually true (conspiracy theorists aside), a typical distributed Splunk environment can result in quite a sprawl of systems to manage.

A sample Splunk deployment

For the sake of example, let’s consider an imaginary new Splunk customer. This customer is looking to use Splunk for compliance and security purposes, and wants to use the Splunk Enterprise Security Suite (Splunk ES) in order to generate actionable security alerts from their log data. To get started, they want to be able to search their data and have security alerting, with a handful of Splunk users for each environment. However, they also expect Splunk usage to take off once everyone sees how awesome of a tool it is, and want to have room for additional growth included in the original design. They also have multiple sites and believe Splunk will need to support a failover to DR once it becomes a business critical application.

After a sizing exercise, this customer decided upon purchasing a modest 500gb/day Splunk license with Splunk Enterprise Security (ES), with the option of expanding this in the future if data ingestion increases.  

Sizing the environment

Indexers

Depending on who you ask, an ES deployment recommends that a single indexer support no more than 75-100gb of data per day in order to handle incoming data and service the significant search volume with ES and its associated correlation searches and data models. Regardless of what sizing number you look to use, you should never assume that any single indexer will be able to handle more than 100gb per day in an ES environment. This means our 500gb/day environment will require 5-7 indexers, each exceeding the minimum system requirements, which at the time of writing this are 16 cores and 32gb of RAM per indexer. (For more on Splunk Enterprise system requirements check out the docs). With this number of indexers, you’ll likely be deploying some form of Splunk indexer replication clustering, which requires a cluster master. For budgeting purposes, we decided to get an extra indexer up front to allow for some additional capacity. So right off the bat, we’re already at 9 servers, just for data retention.

Search Heads

But Splunk requires more than just indexers - we still need to search. It is generally considered best practice to separate premium apps onto a different search head, especially since these apps often require a large number of searches running on a nearly constant basis. This ES deployment will require at least two search heads: one for Enterprise Security and one for ad-hoc searches. This separates the heavy ES search load from other ad-hoc searches and reports, and allows you to run additional Splunk apps that would not interact well with ES when run on the same system. There’s another two servers.

Data Ingestion Tier

But wait - there’s more. This organization probably has some syslog data coming in. It’s not considered best practice to send syslog directly to Splunk. (Splunk has some additional "Tips and Tricks" for using syslog-ng on their website, check it out). Instead Splunk recommends deploying systems running syslog-ng to collect syslog and then have Splunk configured to read the log files written by syslog-ng. Typically, a Splunk Universal Forwarder or Heavy Forwarder will be deployed for this task and dedicated for syslog. Since UDP syslog data is stateless, you’ll often want to distribute this syslog across multiple syslog receivers so that there is failover capability in the event of a system failure (and also the ability to perform maintenance on one system without losing log data). So there’s another two systems (not counting the load balancer).

Deployment Server

Next, we have configuration management. It’s a best practice to use a Splunk Universal Forwarder to collect data whenever possible. An organization with a 500GB license likely has at least a couple of other servers, each which would be running the Splunk UF. In order to manage these forwarders, a Splunk Deployment Server (DS) is typically used - each UF checks in with the DS on a regular basis, looking for new or updated configurations. Which is another server. For those of you keeping score, we’re up to 14.  

Search Head Cluster

Splunk usage is contagious. Everyone is going to want to use Splunk for more and more tasks within your organizations. More users means more interactive and scheduled searches, which may push your single search heads to the limit. To handle this, you deploy a search head cluster (SHC). This type of clustering technology is based on the RAFT protocol, and requires a minimum of 3 search heads to ensure enough are available to elect one in charge (known as the Captain). In order to manage configurations on the SHC, a separate server, known as a deployer, is also required. So that’s 3 more servers in addition to the search head you already had. And if we want to do the same for ES - you guessed it, two additional search heads and a separate deployer. 20 servers.

Index Cluster

Since Splunk is storing all of your critical business data, keeping you in compliance, and running so many critical business processes, you want to ensure that an effective DR strategy is in place. Splunk multi-site clustering is a great way to handle this - it allows your indexers to replicate data between locations, such as a primary and secondary datacenter. For the sake of this example, let’s assume that Splunk needs to perform identically in the event of a DR event. You set up a multisite cluster to match production - 8 more indexers. Fortunately, you can reuse the same cluster master node you already had. 28 servers.

Monitoring Console

Finally, you want to ensure that your whole Splunk environment is running smoothly. Fortunately, Splunk provides a built in Monitoring Console (formerly Distributed Monitoring Console) to do just that. Since this system needs to directly run searches and REST commands against all of the Splunk systems, it needs to be a dedicated search head - so there’s another system.  

That brings our total to 29 servers for a 500gb/day Splunk environment running ES with a heavy search, alert, and user volume, and some room for future growth. And there are plenty of possible cases I haven’t even covered yet, such as having syslog at the DR site - we don’t have syslog receivers for that in the design currently. We also haven’t considered the possibility of spanning the Search Head Clusters across multiple sites (or having completely different SHCs for each site), both of which are options for adding search redundancy. The possibilities are pretty much endless.

For a large organization, 29 servers to run a mission-critical application might not be that big of a deal. However, not everyone has this level of Splunk usage but still wants to leverage many of the features outlined in this example. In fact, many of these Splunk options integrate well into a smaller deployment with similar requirements, albeit at a lower usage level.

I don’t have a 500gb/day license - why does any of this matter to me?

Since a Splunk Enterprise license is sold based on data indexed per day, any Splunk Enterprise license includes all of these features - so if you want to design a 20gb/day environment to have the same redundancy and capacity as our 500gb/day example, go right ahead. Just don’t come to me crying when the hardware purchase doesn’t get approval.  

All kidding aside - a common situation I face when working with clients is effectively managing resources. As much as we love complicated multi-terabyte per day deployments, those are significantly less common than much smaller ones, in the 20-50GB/day range.  

How do we keep the Splunk functionality we want while not dedicating half of the computing capacity in the datacenter to Splunk?

You have a couple of options - the first would be Splunk Cloud. Splunk Cloud is a great way to get all of the functionality you love about Splunk, without having to manage the servers yourself. In this example, I’m going to assume we are working with an on-prem Splunk deployment, where all of the indexers and search heads are systems under your control.  

In order to increase server numbers without purchasing an equal amount of hardware, virtualization is frequently employed. Many Splunk systems work very well when virtualized (such as search head cluster deployers and cluster masters), but others are better suited to being on bare-metal hardware or at least having firm resource reservations when virtualized (such as indexers and search heads).

Splunk Reference Hardware

Splunk Reference Hardware is broken down into two categories. Reference hardware for a single-instance deployment, and a Reference Hardware for a distributed deployment. A single-instance Splunk deployment is one in which all of your Splunk roles exist on one server. Splunk Reference hardware for a single-instance deployment, at the time of this writing, is a system with 12 CPU cores and 12gb of RAM (referred to us as a 12 x 12). At the time of writing, this system is recommended for deployments where less than 2GB of data per day is being indexed, and where only two users need to be signed in and using Splunk at any one time. (Visit Splunk's "Capacity Planning Manual" for more on determining when to scale your Splunk Enterprise deployment). So, if you have more than 2 users who want to be using Splunk at the same time, or you plan to index more than 2GB of data per day, a single-instance deployment is not for you.

A distributed deployment is a a set of distributed Splunk Enterprise instances, working together and allows for a more scalable Splunk solution. It’s exactly what we described in the first portion of this blog where we had 29 servers working together, rather than 1. Reference hardware for a distributed deployment is broken down into Search Heads and Indexers. In a distributed deployment, Splunk recommends a Search Head with 16 CPU and 12GB of RAM. As for Indexers, those are broken down into three different levels of recommendations shown below:

Distributed Deployment - Indexer Reference Hardware:

If there aren’t any specific system requirements (such as the aforementioned 16x32 for ES indexers and search heads), then the hardware specifications listed here would be what a standard Splunk system is recommended to look like. For resource constrained splunk systems, you may want to look at the Minimum Recommendations to start out with, and grow from there.

For a smaller deployment, this level of resources per system may not be something that can be practically virtualized. In a lot of cases, it makes sense to dedicate physical hardware to roles that require it (indexers and, depending on your environment, search heads as well) and supplement the other components of a distributed environment with virtual machines. If we have a limited number of resources in the virtual environment that can be dedicated to Splunk, how do we decide how to allocate those resources without impacting the Splunk experience?

Making the right decision

In a resource-constrained virtualization environment, there may be plenty of pressure to not assign the recommended resources to any given Splunk instance. Fortunately, some Splunk systems generally will show lower utilization compared to others, but there are caveats associated with reducing the resources allocated to each.  

In general, we do not recommend reducing the system resources of any Splunk instance below the recommended reference hardware. In reality, most of the systems that fall into this category have usage patterns that are burstable in nature, as opposed to a constant load. So even though a server may show low usage at times, the potential for bursts and growth in a deployment are very real.

Ad hoc search heads

The ad-hoc search head (or ad-hoc search head cluster) consists of search heads not running a premium application such as Splunk Enterprise Security (ES), Splunk App for PCI Compliance, or Splunk IT Service Intelligence (ITSI). This is where any ad-hoc reports or searches will run. In the case of a search head cluster, members work together to handle scheduled searches, whereas an ad-hoc search is executed locally by whatever member the user is directed to by your load balancer. Overall, the search load on either a single search head, or a search head cluster, is varied based on scheduled search load and user volume. Generally speaking, the load on an ad-hoc search head is less in comparison to search heads for premium apps.

One of the key factors to consider is the scheduler. The search head scheduler is the backbone of your search head as it prioritizes concurrently running jobs. If you saturate your Splunk system with searches, you’ll typically run into issues like searches skipping and not completing. This is the primary reason why it is recommended to have separate Search Heads - one for ad-hoc searches, and one for premium apps. Inherently, premium apps are going to come with a healthy amount of scheduled searches so it’s best to split those two apart.

If a search head or search head cluster is consistently seeing skipped searches, something is wrong. There are many reasons this can happen including searches which are configured to be inefficient (real-time ones are the biggest culprit) or there is potentially a sizing issue. Always look for skipped searches when troubleshooting any search performance related issue. We have a fairly comprehensive read on this topic that can be found here.

That said, not following the Splunk reference hardware for a Search Head and allocating too few resources will decrease the available number of concurrent searches that can be run simultaneously on a search head. This is due to how Splunk calculates limits on the scheduler. Number of CPUs is used to calculate the number of historical searches that can be run, as well as several other limits.

Reducing the CPU cores would significantly reduce the search capacity and potentially result in skipped searches, but the risk would be localized to whatever search head had reduced CPU allocations.

If you are considering running a search head below the reference hardware recommendations, keep a few things in mind.

  • Ensure your search requirements are very light and data volume to be searched is low
  • Don’t have a lot of Splunk users that need to use the system concurrently
  • Don’t have a ton of saved searches
  • Don’t run a premium app (ES, ITSI, etc.) on a search head with resources below the Reference Hardware
  • Accept the risk that search performance may suffer and scheduled searches may be skipped

With these limitations, it may not make sense to jeopardize the usability of a Splunk instance by reducing the resources of a search head.

Cluster Master Node

The master node is responsible for directing replication and search traffic in the environment. It is a critical system in terms of maintaining environment stability and redundancy. Realistically, the role of the server is generally going to be pretty light except when performing indexer bucket fix-up tasks. We would not recommend skimping resources on this box even if it appears to have low utilization since its role is inherently important for ensuring integrity of the data in Splunk across the cluster. 

Deployment server

The deployment server is a common system that we see undersized. Ryan O'Connor has a blog post on our site that covers this in pretty good depth - Splunk Answers: Dealing with undersized deployment servers.

To summarize - the performance of the DS is highly dependent on the number of clients you have checking in, and the resources allocated to it impact the ability to make configuration changes to Universal Forwarders. Starving this system for resources can result in issues with pushing out configuration changes to forwarders (I have seen undersized systems become unusable in this situation). Reducing the phoneHomeIntervalInSecs setting can be an effective way to better spread out the load, but it does delay the propagation of changes to UFs.

As an example, with ~500 clients, a deployment isn’t at the 2,000 client interval that Splunk uses for their 12c/12gb recommendation, but this is still a sizable number. If resources are reduced on the deployment server (which should never be less than 4c/4gb), we would want to coincide this with increasing the phone home interval - to something significantly higher, like 5-10 minutes.  

Heavy Forwarders

Heavy forwarders are systems that are primarily used for tasks such as receiving syslog or calling APIs. In the case of syslog, two components generally work together to ingest logs - syslog-ng and Splunk (either a UF or an HF). As a result, system load ends up being dependent on the volume of syslog and on the choice you make between HF and UF (https://www.splunk.com/blog/2016/12/12/universal-or-heavy-that-is-the-question.html). There is inherent risk in data loss if a Heavy Forwarder is starving for resources and unable to consume all incoming syslog. This is due to the stateless nature of some syslog (UDP vs TCP), and also due to the fact that Splunk may not be able to process all of the data as fast as it would need to, resulting in blocked queues.

So how can you be more lean with a Heavy Forwarder? With a lower volume of syslog and your Universal Forwarders sending logs directly to your indexers (and not leveraging a Heavy Forwarder), we have seen systems smaller than 8c/8gb work successfully at some clients. Typically these are environments with licenses below 50GB/day, where syslog data makes a small percentage of the overall data ingested.  

We certainly would not recommend anything smaller than 4c/4gb for a Heavy Forwarder, and you must understand that that number is well below the Splunk reference hardware so your results may vary. This type of system specification would require some potentially advanced management and fine tuning of data ingestion. One thing that has helped systems sustain proper data ingestion and cut down on resources, is the aforementioned blog on using a Universal Forwarder in place of a Heavy Forwarder. It’s a detailed topic, but we did include it because it will give you the ability to move a lot of the heavy lifting of your data to your indexers and allow for a leaner syslog collector. 

Where to not cut corners

Some Splunk systems should always meet or exceed the minimum system requirements - this is especially important for indexers (don’t ignore the 800-1,200 IOPS requirement, SSDs are your friends here) and search heads (especially those running premium apps such as Splunk Enterprise Security). While it’s tempting to think your smaller environment will do fine on a indexer with a lower capacity or a search head with less than 16 cores - ES still runs the same type of searches regardless of your license size. Out of the box, it even runs some searches in real-time. The number one reason we see for poor ES performance (and even a desire to abandon the product entirely) is lack of sufficient resources on indexers and search heads powering ES. 

Conclusion

Hopefully this helps you make an informed decision on what changes to make to the resources allocated to your Splunk systems. If you have any specific questions about your deployment situation, don’t hesitate to reach out. Thanks!



Close off Canvas Menu