Control Your DNS: Using Splunk to See Suspicious DNS Activity

A host suddenly spewing a ton of DNS queries can be an indicator of either compromise or misconfiguration. Toby's tutorial takes a look at a method that will alert you when a machine deviates from its normal behavior, so you can easily spot suddenly voluminous DNS traffic.


  • Toby Deemer
  • May 07, 2018
  • Tested on Splunk Version: N/A

Building on the stock ES "Excessive DNS Queries" to look for suspicious volumes of DNS traffic  

Starting from the assumption that a host suddenly spewing a ton of DNS queries can be an indicator of either compromise or misconfiguration, we need a mechanism to tell us when this happens. Here we will look at a method to find suspicious volumes of DNS activity while trying to account for normal activity.

Splunk ES comes with an "Excessive DNS Queries" search out of the box, and it's a good starting point. However, the stock search only looks for hosts making more than 100 queries in an hour. This presents a couple of problems. For most large organizations with busy users, 100 DNS queries in an hour is an easy threshold to break. Throw in some server systems doing backups of remote hosts or MFPs trying to send scans to user machines, and we suddenly have thousands of machines breaking that 100/hour limit for DNS activity. This makes the search not only excessively noisy, but also very time consuming to tune into something an analyst can act on or even want to look at.

What we really need is a way to look at how a machine typically behaves during its normal activity, and then alert us when that machine suddenly deviates from its average. The solution here is actually pretty simple once it's written out in Splunk SPL (search processing language). So we need to do a couple things:

  • Determine a working time window to use in calculating the average
  • Establish a baseline for a machine's DNS activity
  • Compare the established average against individual slices of the whole window

Let’s Look at a Real Life Example

Take an average workday of 8 hours. Oddly enough, there are also three 8-hour chunks in a day, so this is a safe window to use as a first draft. This window can be adjusted to better suit the needs of any specific environment. Make it wider for a less sensitive alert; make it narrower for a more sensitive alert. What we need to do is look at that eight-hour span and get a count of DNS events per host, per hour.

Search Part 1: Pulling The Basic Building Blocks

In this first part of our search, we are pulling our basic building blocks, including:

  • Hosts making the DNS queries
  • Original sourcetype of the DNS events (useful for later drilldown searching)
  • Starting timestamp of each hour-window
  • DNS server(s) handling the queries
  • Total count for that query src within that hour

| tstats summariesonly=true allow_old_summaries=true
  	values(host) as host
  	count from datamodel=Network_Resolution.DNS
  	where "DNS.message_type"="QUERY"
  	by "DNS.src",sourcetype, _time span=60m

This will appear as is shown below: 

Search Part 2: Starting to Clean Up Results

From here, we need to do a couple things to clean up our results. The Network Resolution Data Model includes all DNS traffic that it sees, so if your infrastructure is properly set up, that can easily be millions of events a day. Most of that is expected, so we don't need to care about it in this search. Let's drop those out:

| tstats summariesonly=true allow_old_summaries=true
  	values(host) as host
  	count from datamodel=Network_Resolution.DNS
  	where "DNS.message_type"="QUERY"
  	AND DNS.query!="*.hurricanelabs.net"
  	AND DNS.query!="*.in-addr.arpa"
  	AND DNS.query!="*.ubuntu.com"
  	AND DNS.query!="*.splunkcloud.com"
  	by "DNS.src",sourcetype, _time span=60m

This takes our results from 2906 down to 2791. Not a huge improvement, but it does drop the things we know for sure are expected and not worth an alert trigger. Next, we'll constrain the search to hosts that have made more than 100 queries in any hour (like the original ES search does), and we'll also drop out a common noisy host that does lots of DNS lookups - the backup server. This takes us again from 2791 down to 2315. A bit better, but still a big number. Notice that most of the counts on same-host line items are very similar, often identical. That is what will become our average. In a later step, I'll create a macro that we can use to drop out known query sources and another to drop out known domain lookups so that these exclusions don't have to be contained in the search itself.

Search Part 3: Starting to Look at Averages

Now that we have our query counts looking at per-host, per-hour, we want to look at a wider window to average these counts, which brings us back to that eight hour timeframe I mentioned earlier. Let's add that:

| tstats summariesonly=true allow_old_summaries=true
  	values(host) as host
  	count from datamodel=Network_Resolution.DNS
  	where "DNS.message_type"="QUERY"
  	AND DNS.query!="*.hurricanelabs.net"
  	AND DNS.query!="*.in-addr.arpa"
  	AND DNS.query!="*.ubuntu.com"
  	AND DNS.query!="*.splunkcloud.com"
  	by "DNS.src",sourcetype, _time span=60m
| rename "DNS.src" as "src"
| search count>100 src!=10.166.40.79
| bucket _time AS bucket span=8h

We use the bucket command here to give us an eight hour window for each line item. From here, it's an easy average function to see whether that host has deviated from its established pattern. We'll use the eventstats command to run an average per src and bucket:

| tstats summariesonly=true allow_old_summaries=true
  	values(host) as host
  	count from datamodel=Network_Resolution.DNS
  	where "DNS.message_type"="QUERY"
  	AND DNS.query!="*.hurricanelabs.net"
  	AND DNS.query!="*.in-addr.arpa"
  	AND DNS.query!="*.ubuntu.com"
  	AND DNS.query!="*.splunkcloud.com"
  	by "DNS.src",sourcetype, _time span=60m
| rename "DNS.src" as "src"
| search count>100 src!=10.166.40.79
| bucket _time AS bucket span=8h
| eventstats avg(count) as avg_count by bucket src

Search Part 4: Finding The Anomaly

And that brings us to the grand finale. We want to look for moments when a host has far outstripped its average. So let's compare:

| tstats summariesonly=true allow_old_summaries=true
  	values(host) as host
  	count from datamodel=Network_Resolution.DNS
  	where "DNS.message_type"="QUERY"
  	AND DNS.query!="*.hurricanelabs.net"
  	AND DNS.query!="*.in-addr.arpa"
  	AND DNS.query!="*.ubuntu.com"
  	AND DNS.query!="*.splunkcloud.com"
  	by "DNS.src",sourcetype, _time span=60m
| rename "DNS.src" as "src"
| search count>100 src!=10.166.40.79
| bucket _time AS bucket span=8h
| eventstats avg(count) as avg_count by bucket src
| where count > (avg_count * 2.5)

So here we see this one host in the 9am window of 04/30/2018 made 2124 DNS queries, whereas this machine's average is only 775. What is this spike in activity? If we then add a field to display the queries made by the host, we can see that this machine belongs to a user who is in need of an ad-blocker. In this case, it shows us that the typical morning web browsing also indirectly pulls in thousands of other DNS queries in order to satisfy all the ad-network traffic that's hosted on an average website. So, small side note: a corporate-wide ad-blocker policy not only will keep your user machines safer, it will lessen the load on your DNS infrastructure and make it easier to pinpoint these anomalies. I would not recommend adding this display function to an alert search, as it will make the notable event output /very/ large, and also slow the search down significantly. Use it as a visibility tool /after/ the alert trips.

| tstats summariesonly=true allow_old_summaries=true
  	values(host) as host
  	values(DNS.query) AS query
  	count from datamodel=Network_Resolution.DNS
  	where "DNS.message_type"="QUERY"
  	AND DNS.query!="*.hurricanelabs.net"
  	AND DNS.query!="*.in-addr.arpa"
  	AND DNS.query!="*.ubuntu.com"
  	AND DNS.query!="*.splunkcloud.com"
  	by "DNS.src",sourcetype, _time span=60m
| rename "DNS.src" as "src"
| search count>100 src!=10.166.40.79
| bucket _time AS bucket span=8h
| eventstats avg(count) as avg_count by bucket src
| where count > (avg_count * 2.5)

Search Part 5: Adding Mechanisms for Ease of Maintenance

Now let's add those macros I mentioned to help us clean up the search and more easily manage any necessary exclusions we want to account for. In this search, we have two things we want to exclude: known URL queries, and heavy query sources that we know are not an issue. To do this, we'll create two macros and drop them into the tstats portion of our search:

The way these macros work is to take the value supplied in the config and substitute it into the field you specify in the macro parentheses when you call it in a search:

| tstats summariesonly=true allow_old_summaries=true
  	values(host) as host
  	count from datamodel=Network_Resolution.DNS
  	where "DNS.message_type"="QUERY"
  	NOT `known_dns_query("DNS.query")`
  	NOT `known_dns_src("DNS.src")`
  	by "DNS.src",sourcetype, _time span=60m
| rename "DNS.src" as "src"
| search count>100
| bucket _time AS bucket span=8h
| eventstats avg(count) as avg_count by bucket src
| where count > (avg_count * 2.5)

So we see in the above code that we've dropped out all the DNS queries and query-source hosts that we'd previously had to do directly in the search. We can also see below that in the macro config, we can easily add future entries that we may want to drop out of our results. Say we add a new primary DNS host or a new mail server. Just drop the IP into a new line item in the "known_dns_src" macro, and your correlation search is automagically updated. I've used this same approach to easily drop RFC1918 addresses out of searches when I'm looking for external address activity in a log type or datamodel. This method also carries the added benefit that it works in tstats searches as well as normal searches, so you're less likely to trip up on the very specific logic formatting in tstats functions.

If you want to do something similar, that macro is:

`hdsi_rfc1918(1)` defined as:

($field$="10.*" OR
$field$="172.16.*" OR
$field$="172.17.*" OR
$field$="172.18.*" OR
$field$="172.19.*" OR
$field$="172.20.*" OR
$field$="172.21.*" OR
$field$="172.22.*" OR
$field$="172.23.*" OR
$field$="172.24.*" OR
$field$="172.25.*" OR
$field$="172.26.*" OR
$field$="172.27.*" OR
$field$="172.28.*" OR
$field$="172.29.*" OR
$field$="172.30.*" OR
$field$="172.31.*" OR
$field$="192.168.*")

With macro argument "field", called in searches like:

`hdsi_rfc1918("All_Traffic.src")`
NOT `hdsi_rfc1918("All_Traffic.dest")`
`hdsi_rfc1918("dest_ip")`
etc

Search Part 6: Looking Further Into Our Results

Looking back at our DNS traffic from this host though, we see all that ad-network traffic. It's not an emergency, but something we'll want to look at cleaning up. But how can we parse through all that ad-network noise and see if there's something more malicious hiding in there? Well we often see machine generated domains pop up in DNS traffic when something's not right on a host. These MGDs also typically are quite long. So we can parse the queries here and look for long domain segments.

Take our existing search, and add some functions to the bottom:

| tstats summariesonly=true allow_old_summaries=true
  	values(host) as host
  	values(DNS.query) AS query
  	count from datamodel=Network_Resolution.DNS
  	where "DNS.message_type"="QUERY"
  	NOT `known_dns_query("DNS.query")`
  	NOT `known_dns_src("DNS.src")`
  	by "DNS.src",sourcetype, _time span=60m
| rename "DNS.src" as "src"
| search count>100
| bucket _time AS bucket span=8h
| eventstats avg(count) as avg_count by bucket src
| where count > (avg_count * 2.5)
--new lines below--
| mvexpand query
| `truncate_domain(query, domain)`
| search NOT [| inputlookup alexa_lookup_by_str where rank<=10000 | fields + domain] NOT (query=*amazon* OR query=*amazonaws*)
| eval p=split(query, ".")
| mvexpand p
| eval l=len(p)
| where l>=25
| rename p AS long_segment l AS segment_length
| fields + src sourcetype _time domain long_segment segment_length count

What we do there is an mvexpand to split our previous multi-value query field into one line item per query. Then we use the truncate_domain macro to get a clean query domain without the URL characters. We also drop out Amazon and the Alexa top sites list to lessen the typical noise. The more you can lower your noise floor, the more you can focus on the signals. Now, we split the query up at the dots so we can see each URL segment. Stay with me. And /then/ we look for any segment that's over 25 characters. Finally, we drop back in all of our fields, and we see that there were two domains out of the ~2100 results from this host that meet the long domain segment criteria. You can of course adjust that threshold to something wider or narrower to fit your particular environment.

Just a final detail on the search - as this uses tstats out of the datamodel, then does rather basic manipulations at search time, it's a fairly lightweight cost to your ES environment. In our final example, my search log shows:

This search has completed and has returned 2 results by scanning 10,742,117 events in 4.296 seconds.

This makes it not only useful as an alert, but is not too costly to run as-needed for in-depth investigations.

So now you should be able to easily spot suddenly voluminous DNS traffic from your internal hosts. Go give 'em a query response.

A tiny bit of background info:




Close off Canvas Menu