Splunk Monitoring and Alerts

What is Monitoring in Splunk?

Monitoring refers to reports you can visually monitor and alerting refers to conditions monitored by Splunk, which can automatically trigger actions. These recipes are meant to be brief solutions to common monitoring and alerting problems. Each recipe includes a problem statement followed by a description of how to use Splunk to solve the problem

Watch this Splunk Tutorial video

Monitoring Recipes

Monitoring can help you see what is happening in your data. In addition to recipes that monitor various conditions, this section provides recipes that describe how to use search commands to extract fields from semi-structured and structured data.

Monitoring Concurrent Users

Problem
You need to determine how many concurrent users you have at any particular time. This can help you gauge whether some hosts are overloaded and enable you to better provision resources to meet peak demand.
Solution
First, perform a search to retrieve relevant events. Next, use the concurrency command to find the number of users that overlap. Finally, use the timechart reporting command to display a chart of the number of concurrent users over time.

Let’s say you have the following events, which specify date, time, request duration, and username:
5/10/10 1:00:01 ReqTime=3 User=jsmith
5/10/10 1:00:01 ReqTime=2 User=rtyler
5/10/10 1:00:01 ReqTime=50 User=hjones
5/10/10 1:00:11 ReqTime=2 User=rwilliams
5/10/10 1:00:12 ReqTime=3 User=apond
You can see that, at 1:00:01, there are three concurrent requests (jsmith, rtyler, hjones); at 1:00:11, there are two (hjones, rwilliams); and at 1:00:12, there are three (hjones, rwilliams, apond).
Use this search to show the maximum concurrent users for any particular time:

<your search here> sourcetype=login_data<br>
| concurrency duration=ReqTime<br>
| timechart max(concurrency)<br>

Monitoring Inactive Hosts

Problem
You need to determine which hosts have stopped sending data. A host might stop logging events if the server, or application producing logs, has crashed or been shut down. This often indicates a serious problem. If a host stops logging events, you’ll want to know about it.

Solution
Use the metadata command, which reports high-level information about hosts, sources, and source types in the Splunk indexes. This is what is used to create the Summary Dashboard.

Note the pipe character is at the beginning of this search, because we’re not retrieving events from a Splunk index, rather we’re calling a data-generating command (metadata).

Use the following search to take the information on hosts, sort it so the least recently referenced hosts are first, and display the time in a readable time format:

| metadata type=hosts<br>
| sort recentTime<br>
| convert ctime(recentTime) as Latest_Time

You’ll quickly see which hosts haven’t logged data lately.

Reporting on Categorized Data

Problem
You need to report on segments of your data that aren’t neatly defined.
Solution
To search for specific parts of your data, classify your events using tags and event types. Tags are simpler but event types are more powerful.

Using Tags

You can classify simple field=value pairs using tags. For example, classify events that have host=db09 as a database host by tagging that field value. This creates a tag::host field having a value of database, on events with host=db09. You can then use this custom classification to generate reports.

Here are a couple of examples that use tags. Show the top ten host types (good for bar or pie charts):
… | top 10 tag::host

Using Event Types

When you use event types, instead of tags, to classify events, you are not limited to a simple field=value. You can use the full power of the search command, including Boolean operations, phrase matching, and wildcards.

You could make an event type called database_host with a definition of “host=db* OR host=orcl*”, and another event type called web_ host. Repeat the same searches as you did for tags, but replace tag::host with eventtype. For example, to show the top ten event types:
… | top 10 eventtype
Because event types are not specific to a dimension, such as hosts, user type, or error codes, they are all in a common namespace, jumbled together.

A search for top eventtypes might return database_host and web_error, which is probably not what you want because you’d be comparing apples to oranges. Fortunately you can filter which event types you report on, using the eval command, if you use a common naming convention for your event types.

Get 100% Hike!

Master Most in Demand Skills Now!

Comparing Today’s Top Values to Last Month’s

Problem
You need to know the top N values today and how they compare to last month’s values. This can answer questions like, which products, or database errors, are suddenly becoming more popular than they used to be?

Solution
For this solution, we’ll use the example of music data to show the top 10 most played artists today and their average position for the month. Assume the events have an artist field and a sales field that tells how many units were sold at a particular time. We’ll use the sum of sales as our metric—sum(sales)—but we could use any other metric.

The full search looks daunting at first, but you can break it down into simple steps:

Get the monthly rankings by artist.
Get the daily rankings by artist and append them to the results.
Use stats to join the monthly and daily rankings by artist.
Use sort and eval to format the results.

Get the monthly rankings-
Use this search to find the 10 biggest monthly sales by artist:

sourcetype=music_sales earliest=-30d@d<br>
| stats sum(sales) as month_sales by artist<br>
| sort 10 - month_sales<br>
| streamstats count as MonthRank

The earliest=-30d@d tells Splunk to retrieve events starting at 30 days ago (in other words, get events from the last month). stats calculates the sums of sales for each artist as the month_sales field.

You now have a row for each artist, with two columns: month_sales and artist. sort 10 – month_sales keeps only those rows with the ten largest month_sales values, in sorted order from largest to smallest.

The streamstats command adds one or more statistics to each event, based on the current value of the aggregate at the time the event is seen (not on the results as a whole, like the stats command does). Effectively, streamstats count as MonthRank assigns the first result MonthRank=1, the second result MonthRank= 2, and so on.

Get yesterday’s rankings-
Make three small changes to the monthly-rankings search to get yesterday’s rank:

Change the value for earliest from -30d@d to -1d@d to get the rankings from yesterday.

Change every instance of “month” in the search to “day”.

Wrap the search in an append command so that the results are appended to the results from the first search.

append [<br>
search sourcetype=music_sales earliest=-1d@d<br>
| stats sum(sales) as day_sales by artist<br>
| sort 10 - day_sales<br>
| streamstats count as DayRank<br>
]

Use stats to join the monthly and daily ranks by artist-
Use the stats command to join the results by artist, putting the first monthly and daily rankings into one result.
stats first(MonthRank) as MonthRank first(DayRank) as DayRank by artist
Format the output –
Finally, we’ll calculate the difference in ranking between the monthly and daily rank, sort the results by the daily rank, and display the fields in music billboard order (rank, artist, change in rank, old rank):

eval diff=MonthRank-DayRank<br>
| sort DayRank<br>
| table DayRank, artist, diff, MonthRank

Finding Metrics That Fell by 10% in an Hour

Problem
You want to know about metrics that have dropped by 10% in the last hour. This could mean fewer customers, fewer web page views, fewer data packets, and the like.
Solution
To see a drop over the past hour, we’ll need to look at results for at least the past two hours. We’ll look at two hours of events, calculate a separate metric for each hour, and then determine how much the metric has changed between those two hours. The metric we’re looking at is the count of the number of events between two hours ago and the last hour.
This search compares the count by host of the previous hour with the current hour and filters those where the count dropped by more than 10%:
earliest=-2h@h latest=@h
| stats count by date_hour,host
| stats first(count) as previous, last(count) as current by host
| where current/previous < 0.9
The first condition (earliest=-2h@h latest=@h) retrieves two hours worth of data, snapping to hour boundaries (e.g., 2-4pm, not 2:01-4:01pm). We then get a count of the number of those events per hour and host. Because there are only two hours (two hours ago and one hour ago), stats first(count) returns the count from two hours ago and last(count) returns the count from one hour ago. The where clause returns only those events where the current hour’s count is less than 90% of the previous hour’s count (which shows that the percentage dropped 10%).

Charting Week Over Week Results

Problem
You need to determine how this week’s results compare with last week’s.
Solution
First, run a search over all the events and mark whether they belong to this week or last week. Next, adjust the time value of last week’s events to look like this week’s events (so they graph over each other on the same time range). Finally create a chart.
Let’s get results from the last two weeks, snapped to the beginning of the week:

earliest=-2w@w latest=@w<br>
Mark events as being from this week or last week:<br>
eval marker = if (_time < relative_time(now(), “-1w@w”),<br>
“last week”, “this week”)

Adjust last week’s events to look like they occurred this week:

eval _time = if (marker==”last week”,<br>
_time + 7*24*60*60, _time)

Chart the desired metric, using the week marker we set up, such as a timechart of the average bytes downloaded for each week:
timechart avg(bytes) by marker
This produces a timechart with two labeled series: “last week” and “this week”.

Identify Spikes in Your Data

Problem
You want to identify spikes in your data. Spikes can show you where you have peaks (or troughs) that indicate that some metric is rising or falling sharply. Traffic spikes, sales spikes, spikes in the number of returns, spikes in database load—whatever type of spike you are interested in, you want to watch for it and then perhaps take some action to address those spikes.
Solution
Use a moving trendline to help you see the spikes. Run a search followed by the trendline command using a field you want to create a trendline for.
For example, on web access data, we could chart an average of the bytes field:

<br>
sourcetype=access* | timechart avg(bytes) as avg_bytes<br>

To add another line/bar series to the chart for the simple moving average (sma) of the last 5 values of bytes, use this command:

trendline sma5(avg_bytes) as moving_avg_bytes<br>

If you want to clearly identify spikes, you might add an additional series for spikes—when the current value is more than twice the moving average:

eval spike=if(avg_bytes > 2 * moving_avg_bytes, 10000, 0)<br>

The 10000 here is arbitrary and you should choose a value relevant to your data that makes the spike noticeable. Changing the formatting of the Y-axis to Log scale also helps.

Compacting Time-Based Charting

Problem
You would like to be able to visualize multiple trends in your data in a small space. This is the idea behind sparklines—small, time-based charts displayed within cells of your results table.
Solution
To produce these sparklines in your tables, simply enclose your stats or chart functions in the sparkline() function.
Here, we’ll use the example of web access logs. We want to create a small graph showing how long it took for each of our web pages to respond (assuming the field spent is the amount of time spent serving that web page). We have many pages, so we’ll sort them to find the pages accessed the most (i.e., having the largest count values). The 5m tells Splunk to show details down to a 5-minute granularity in the sparklines.
sourcetype=access*
| stats sparkline(avg(spent),5m), count by file
| sort – count
Run this search over the last hour. The result is a series of mini graphs showing how long it took each page to load on average, over time.

Reporting on Fields Inside XML or JSON

Problem
You need to report on data formatted in XML or JSON.
Solution
Use the spath command, to extract values from XML- and JSON-formatted data. In this example, we’ll assume a source type of book data in XML or JSON. We’ll run a search that returns XML or JSON as the event’s text, and use the spath command to extract the author name:
sourcetype=books
| spath output=author path=catalog.book.author
When called with no path argument, spath extracts all fields from the first 5000 characters, which is configurable, creating fields for each path element. Paths have the form foo.bar.baz. Each level can have an optional array index, indicated by curly braces (e.g., foo{1}.bar). All array elements can be represented by empty curly brackets (e.g., foo{}). The final level for XML queries can also include an attribute name, also enclosed by curly brackets (e.g., foo.bar{@title}) and prefaced with a @. After you have the extracted field, you can report on it:
… | top author

Extracting Fields from an Event

Problem
You want to search for a pattern and extract that information from your events.
Solution
Using commands to extract fields is convenient for quickly extracting fields that are needed temporarily or that apply to specific searches and are not as general as a source or source type.
Regular Expressions-
The rex command facilitates field extraction using regular expressions. For example, on email data, the following search extracts the from and to fields from email data using the rex command:

sourcetype=sendmail_syslog<br>
| rex “From: (?<from>.*) To: (?<to>.*)”

Delimiters –
If you’re working with multiple fields that have delimiters around them, use the extract command to extract them. Suppose your events look like this:

|height:72|age:43|name:matt smith|<br>
Extract the event fields without delimiters using:<br>
... | extract pairdelim=”|” kvdelim=”:”<br>
The result is what you would expect:<br>
height=72, age=43, and name=matt smith.

Splunk Alerts

An alert is made up of two parts:

A condition: An interesting thing you want to know about.
An action: what to do when that interesting thing happens.

In addition, you can use throttling to prevent over-firing of repeated alerts of the same type.
For example:

I want to get an email whenever one of my servers has a load above a certain percentage.

I want to get an email of all servers whose load is above a certain percentage, but don’t spam my inbox, so throttle the alerts for every 24 hours.

Alerting by Email when a Server Hits a Predefined Load

Problem
You want to be notified by email when a server load goes above 80%.
Solution
The following search retrieves events with load averages above 80% and calculates the maximum value for each host. The “top” source type comes with the Splunk Unix app (available at splunkbase.com), and is fed data from the Unix top command every 5 seconds:
sourcetype=top load_avg>80
| stats max(load_avg) by host
Set up the alert in the following way, using the instructions :

Alert condition: alert if the search returns at least one result.
Alert actions: email and set subject to: Server load above 80%.
Suppress: 1 hour.

Alerting When Web Server Performance Slows

Problem
You want to be notified by email whenever the 95th percentile response time of your web servers is above a certain number of milliseconds.
Solution
The following search retrieves weblog events, calculates the 95th percentile response time for each unique web address (uri_path), and finally filters out any values where the 95th percentile is less than 200 milliseconds:
sourcetype=weblog
| stats perc95(response_time) AS resp_time_95 by uri_path
| where resp_time_95>200
Set up the alert in the following way:

Alert condition: alert if the search returns at least X results (the number of slow web requests you think merit an alert being fired).
Alert actions: email, with subject set to: “Web servers running slow.” If you’re running in the cloud (for example, on Amazon EC2™), maybe start new web server instances.
Suppress: 1 hour.

Shutting Down Unneeded EC2 Instances

Problem
You want to shut down underutilized EC2 instances.
Solution
The following search retrieves weblog events and returns a table of hosts that have fewer than 10000 requests (over the timeframe that the search runs):

sourcetype=weblog<br>
| stats count by host<br>
| where count<10000

Set up the alert in the following way:

Alert condition: alert if the search returns at least X results (the number of hosts you think merit an alert being fired).
Alert actions: trigger a script that removes servers from the load balancer and shuts them down.
Suppress: 10 minutes.

Converting Monitoring to Alerting

The monitoring recipes produce useful reports, valuable in themselves. But, if you take a second look, many of these can also be the basis for setting up alerts, enabling Splunk to monitor the situation for you.

Monitoring Concurrent Users

This recipe can be made into an alert by using its search with a custom alert condition of “where max(concurrency) > 20”. This alerts you if too many concurrent users are logged in.
Variations: Consider calculating the average concurrency as well and alerting if the max is twice the average.

Monitoring Inactive Hosts

A custom alert condition of where now() – recentTime > 60*60 alerts you if a host has not been heard from in over an hour.

Comparing Today’s Top Values to Last Month’s

A custom alert condition of where diff < -10 alerts you if an artist shoots to number 1 today and was not in the top 10 for the last month.
Variations: Use the same recipe to monitor HTTP status codes and report if a status code (e.g., 404) suddenly becomes significantly more, or less, prevalent than it was over the last month.

Find Metrics That Fell by 10% in an Hour

This recipe is already set up conveniently for an alert. Fire an alert when any events are seen.
Variation: Fire only when more than N declines are seen in a row.

Show a Moving Trendline and Identify Spikes

The variation for this recipe is already set up conveniently for an alert. Fire an alert when any events are seen.
Variations: Fire only when more than N spikes are seen in a time period (e.g., 5 minutes). You might find it a useful exercise to add alerting to the remaining monitoring recipes.