Brief Note: Check out Splunk Docs if you haven't already
Lately, I’ve encountered more than one environment with a Deployment Server that I would classify as “undersized”. How do I know it’s undersized? Great question. Let’s go to the Splunk docs.
Generally speaking, Splunk Docs is going to be your one source of truth for the latest official information about any apps, add-on’s, or Splunk components. Although this might not be the popular opinion, there are a lot of docs from Splunk I really like and this one about Deployment Server sizing is one of them.
Sizing Challenge: No two Deployment Servers look the same
Sizing a Deployment Server isn’t the easiest task. I've seen quite a few people stumble on this, and a lot of that has to do with the fact that Deployment Servers aren’t an exact science. There’s customization that's inherently built in when you’re talking about a Deployment Server, because it deals with deploying specific apps that fit a unique business need. Deployment Servers also involve deploying those apps to servers and clients that are managed according to business’ IT policies and procedures.
As a result, we have to give the caveat that “your mileage may vary” across the answers in this post. However, this post should serve as a general guideline for you and answer some common Deployment Server questions. That is, of course, assuming you aren’t doing anything too outlandish with your Deployment Server.
Q: When do you need to separate your Deployment Server from a Search Head or Indexer?
A: Any time you have over 50 clients.
Q: What resources should you dedicate to your Deployment Server?
A: Though the doc above doesn’t explicitly state that “you should have 12 CPU and 12 GB of RAM”, you should. The doc alludes to it, but that's about the best we’re going to get.
Q: What if you feel 12 CPU/12GB of RAM is too large of a system and can’t justify dedicating that number of resources to a box that isn’t even performing searches?
A: Please refer to the previous answer. You should have 12 CPU/12GB of RAM.
Note: If you’re thinking of cutting corners (which is what you’d be doing if you chose to go with anything less), the lowest I would recommend is 8 CPU/8GB of RAM. And, if you do that, keep in mind that the 12 CPU/12GB of RAM number was tested with default Deployment Server settings for up to 2000 clients. So, you shouldn’t have nearly that number of clients checking in to your undersized DS, unless you’re just looking for heartache.
Q: So, are you saying I’ll be completely fine with 8 CPU/8GB of RAM?
A: No. Please don’t take my recommendation as the one and only source of truth. This is spoken from experience with implementing Splunk for a lot of different environments and from what I’ve seen people get by with. 8 CPU/8GB of RAM may work fine for your environment, especially when you have a great MSP to manage it for you; however, I wouldn’t recommend it in general.
Q: How are people getting by with 8 CPU/8GB of RAM when this page seems to indicate 12 CPU/12GB of RAM is the recommended system specs?
A: If you read the doc, you’ll see mentions of things like “If you are deploying to more than 2000 clients, you might experience significantly improved performance by increasing the phone home interval to five minutes (300 seconds).” What this means is that there are some improvements you can make to fine tune your Deployment Methodology.
Fine Tuning: 3 Deployment Methodology improvements you can make
1.) Update your phoneHomeIntervalInSecs. If you take a look at the docs for the deploymentclient.conf file, you’ll see a couple key attributes.
phoneHomeIntervalInSecs = <number in seconds> * Defaults to 60. * Fractional seconds are allowed. * This determines how frequently this deployment client should check for new content.
handshakeRetryIntervalInSecs = <number in seconds> * Defaults to one fifth of phoneHomeIntervalInSecs * Fractional seconds are allowed. * This sets the handshake retry frequency. * Could be used to tune the initial connection rate on a new server
These two attributes are excellent ways to curb your clients from accidentally harming an undersized Deployment Server. As stated in the doc, you should update the phoneHomeIntervalInSecs to at least 300 seconds if your server grows to over 2000 clients. I would even recommend doing this if you have less clients than that. A 5-minute check in interval is still completely acceptable.
As you can see by the definition, handshakeRetryIntervalInSecs is ⅕ of the phoneHomeIntervalInSecs. So, generally speaking, increasing your phoneHomeIntervalInSecs to some larger number, means that not only will your clients be checking in less regularly and saving on CPU cycles, but during the initial startup of a DS, or after a “./splunk reload deploy-server”, your clients won’t all be trying to perform handshakes in such a small timeframe.
2.) Try and limit the size of the apps being deployed to your clients. One common practice for this is if you need to deploy an Add-on, deploy that add-on. But for any inputs.conf file associated with the add-on (that may get updated more frequently), place that file in it’s own app and deploy that out separately so that your clients have less to download when you need to push out a new update.
3.) Do not restart or reload your entire DS for an app update. When you update a Deployment App, do not restart the entire deployment server. Reload your Deployment Server by specific server classes. This looks like the following on a Linux machine:
$SPLUNK_HOME/bin/splunk reload deploy-server -class <your_server_class_here>
This will only reload the classes that you specify, which saves you from having every single one of your Universal Forwarders try and perform a new handshake.
Hope this helps get rid of some headaches!
I'm hopeful these answers will help eliminate some of the struggle surrounding Deployment Servers. However, since I’m sure I haven’t answered everything on this topic, if you have more questions or something additional to contribute feel free to leave us a comment.