(Hi all–welcome to the latest installment in the series of technical blog posts from members of the SplunkTrust, our Community MVP program. We’re very proud to have such a fantastic group of community MVPs, and are excited to see what you’ll do with what you learn from them over the coming months and years.
–rachel perkins, Sr. Director, Splunk Community)
This is part 3 of a series.
Find part 1 here: http://blogs.splunk.com/2016/02/11/whats-next-next-level-splunk-sysadmin-tasks-part-1/.
Find part 2 here: http://blogs.splunk.com/2016/02/16/whats-next-next-level-splunk-sysadmin-tasks-part-2/
Hi, I’m Mark Runals, Lead Security Engineer at The Ohio State University, and member of the SplunkTrust.
There can be numerous challenges involved with ingesting data into your local Splunk environment. Because Splunk works so well out of the box against so many types and formats of data, it can be easy to overlook the complexity of what is happening behind the scenes.
So far in this series we’ve talked about ways to validate some of the basic assumptions people have as they search in and look at data in Splunk – these events happened on that server at this time. In retrospect I should have used that line at the beginning of this series. In part 1 I talked through a way to make sure the values in the host field are correct, and in part 2 that the local server time is set correctly. Time issues with your data go beyond simply making sure local server clocks are right. However, getting that set correctly is like buttoning your shirt with the correct first button and hole. Once that is addressed, the next step is to identify cases where there is an extreme or significant gap between when the data was generated and when it comes into Splunk. This is part art, part science. At a base level, the ‘science’ is pretty easy – subtract _indextime from _time. The art is masking your ire when you talk to system administrators about how they haven’t been managing their systems correctly! I kid, I kid. Actually the art is trying to identify which systems or data sources are having issues time or other data ingestion issues, if the cause is server or Splunk related, and where to apply a fix.
The two categories of time issues
I tend to lump time issues into two categories: availability and integrity.
Let’s say you have an alert set up to run every 15 minutes looking at the last 15 minutes’ worth of logs from a particular sourcetype – only it takes 20 minutes or more for the data to come in. The data will eventually be placed in its chronologically correct position but your alert will never fire. Availability.
Conversely, let’s say you are investigating an outage or security issue that happened at a particular time, only one of the data types is generated in a different, and unaccounted for, time zone compared to the rest of the data – you will likely miss related events. Integrity.
Solutions and resources
There are more issues and possible solutions in this area than I could possibly cover in one or even several blog posts. As a quick start let’s look at some Splunk configuration things to do/look for. The first is the fact that forwarders are set by default to send only 256 kbps of data. A server generating more data than the forwarder can push is one reason you might see a delay in data being ingested. This can be found with a query like the following:
index=_internal sourcetype=splunkd "current data throughput" | rex "Current data throughput \((?<kb>\S+)" | eval rate=case(kb < 500, "256", kb > 499 AND kb < 520, "512", kb > 520 AND kb < 770 ,"768", kb>771 AND kb<1210, "1024", 1=1, "Other") | stats count sparkline by host, rate | where count > 4 | sort -rate,-count
If a forwarder has just been restarted it will likely have to catch up, which is why the query has its where statement. The sparkline output looks funky in email form, but my team has this query run to cover a midnight-to-midnight stretch. The number’s placement can give insight into when the limit was hit and what might be happening – ie, busy server that was rebooted or consistently hitting the limit and we need to up the throughput limits on the forwarder. That can be adjusted via the forwarder’s limits.conf > [thruput] maxKBps = whatever. There is a dashboard related to this and other forwarder issues on the Forwarder Health Splunk app.
There is some anecdotal evidence that some of the newest forwarders might not be generating this internal message or the conditions for the event’s generation has changed which hopefully isn’t the case(!!). I have an open case with Splunk looking into this; will have this post updated if something is determined one way or the other.
The next thing to do is update your props time related settings for your sourcetypes, especially TIME_FORMAT. This and related settings will make sure Splunk is understanding the timestamp correctly. This subtopic alone can be long and involved. While I hate to hawk my own crap, the OSU team has had to do a lot of work in this space (work months and ongoing /shudder) so I’ll refer you to the Data Curator app. I’m not sure if Splunk dropping events or recognizing timestamps incorrectly is worse but either way if events aren’t where you expect them its bad. The following is one of the queries in the Data Curator app that looks for dropped events due to timestamp issues; you could run it over the last 7 days or so:
index=_internal sourcetype=splunkd DateParserVerbose "too far away from the previous event's time" OR "outside of the acceptable time window" | rex "source::(?<Source>[^\|]+)\|host::(?<Host>[^\|]+)\|(?<Sourcetype>[^\|]+)" |rex "(?<msgs_suppressed>\d+) similar messages suppressed." | eval msgs_suppressed = if(isnull(msgs_suppressed), 1, msgs_suppressed) | timechart sum(msgs_suppressed) by Sourcetype span=1d usenull=f
Besides the Data Curator app, I recommend other Splunk resources like Andrew Duca’s Data Onboarding presentation from .conf15.
So now let’s generically say Splunk is configured to recognize your timestamp formats correctly and the forwarders are able to send data just as quickly as their little digital hearts are able to pump it out. As mentioned above, we need to look at the _indextime field. To find the delta, or if there is ‘lag’, between event generation and ingestion, you simply subtract one from the other via a basic eval ( | eval index_time = _indextime – _time). If you want to see what that index time time actually is, you’d need to create a field to operate as a surrogate like | eval index_time = _indextime and then maybe a | convert ctime(index_time) unless you are a Matrix-like prodigy who can convert the epoch time number into a meaningful date in your head.
A basic and fairly generic query to review your data on the whole might be something like this, though it tends toward looking for time zone issues. If you want to have it look a bit broader at probable time issues, adjust the hrs eval to round(delay/3600,1) or just remove the search command right after that eval.
index=* | eval indexed_time= _indextime | eval delay=_indextime-_time | eval hrs = round(delay/3600) | search hrs > 0 OR hrs < 0 | rex field=source "(?<path>\S[^.]+)" | eval when = case (delay < 0, "Future", delay > 0, "Past", 1=1, “fixme”) | stats avg(delay) as avgDelaySec avg(hrs) max(hrs) min(hrs) by sourcetype index host path when | eval avgDelaySec = round(avgDelaySec, 1)
One thing I’m trying to do with the rex command in this query is cut out cases where the date field is appended to the data in an effort to cut down on granular noise. Note that this will take some time to churn through depending on your environment so I recommend a relatively small time slice of not more than 5 minutes or so. In reviewing this post, fellow Community Trustee Martin Müller pointed out that a query using tstats would be more efficient. I would agree, though I feel what we are talking about is chapter 3 or 4 material and tstats is like chapter 6 :). At any rate, a rough query he threw together is:
| tstats max(_indextime) as max_index min(_indextime) as min_index where index=* by _time span=1s index host sourcetype source | eval later = max_index - _time | eval sooner = min_index - _time | where later > 60 OR sooner > 10
An additional tip if you are on the North/South America side of the planet and you want a quick way to look for unaccounted for UTC logs, you could do the following. This will show logs coming in from the ‘future’:
index=* earliest=+1m latest=+24h
When it comes to investigating especially an overloaded forwarder or sourcetype a quick go to for me is:
host=foo source=bar (and/or sourcetype) | | eval delta = _indextime - _time | timechart avg(delta) p95(delta) max(delta)
What I’m looking for here are basic visual trends like: is there a constant delay or does the delay subside at night/during slow periods?
Overall, time issues can be somewhat troublesome to find and ultimately fix. This might involve adjusting the limits on forwarders as we’ve talked about, adding additional forwarders on a server to split up the monitoring load – ie, busy centralized syslog server, updating props settings, or having conversations with device/server admins. I’ve not worked in a Splunk environment that collects data from multiple time zones. If you do and care to share some of the strategies you’ve used to work through those particular challenges please share in the comments!
Hopefully you’ve found this series useful. It has been fun to write and share!