• Articles
  • Tutorials
  • Interview Questions

Splunk Group By

Identify and Group Events into Transactions

Introduction

There are several ways to group events. The most common approach uses either the transaction or stats commands. But when should you use transactions, and when should you use stats?

The rule of thumb: If you can use stats, use stats. It’s faster than transactions, especially in a distributed environment. That speed, however, has some limitations. You can only group events with stats if they have at least one common field value and if you have no other constraints. Typically, the raw event text is discarded.

Like stats, the transaction command can group events based on common field values, but it can also use more complex constraints such as the total period of the transaction, delays between events within the transaction, and required beginning and ending events. Unlike stats, transactions retain the raw event text and field values from the original events, but they don’t  compute any statistics over the grouped events, other than the duration (the delta of the _time field between the oldest and newest events in the transaction) and the event count (the total number of events in the transaction).

The transaction command is most useful in two specific cases:

  • When unique field values (also known as identifiers) are not sufficient to discriminate between discrete transactions.
  • When it is desirable to see the raw text of the events rather than an analysis on the constituent fields of the events.

However, it is important to note that if neither of these cases is applicable, it is generally recommended to use the stats command instead. This is because search performance for stats is typically better than for the transaction command. In many cases, there may be a unique identifier available, and leveraging stats can be a more efficient approach.
For example, to compute statistics on the duration of trades identified by the unique identifier trade_id, the following searches yield the same answer:

… | transaction trade_id
| chart count by duration
… | stats range(_time) as duration by trade_id
| chart count by duration
The second search is more efficient.

However, if trade_id values are reused but the last event of each trade is indicated by the text “END”, the only viable solution is:

… | transaction trade_id endswith=END
| chart count by duration

If instead of an end condition, trade_id values are not reused within 10 minutes, the most viable solution is:

… | transaction trade_id maxpause=10m
| chart count by duration

Finally, a brief word about performance. No matter what search commands you use, it’s imperative for performance that you make the base search as specific as possible. Consider this search:

sourcetype=x | transaction field=ip maxpause=15s | search
ip=1.2.3.4

Here we are retrieving all events of sourcetype=x, building up transactions, and then throwing away any that don’t have an ip=1.2.3.4. If all your events have the same IP value, this search should be:

sourcetype=x ip=1.2.3.4 | transaction field=ip maxpause=15s

This search retrieves only the events it needs to and is much more efficient.

Get 100% Hike!

Master Most in Demand Skills Now!

Recipes

Unifying Field Names

Problem
You need to build transactions from multiple data sources that use different field names for the same identifier.
Solution
Typically, you can join transactions with common fields like:

… | transaction username

But when the username identifier is called different names (login, name, user, owner, and so on) in different data sources, you need to normalize the field names.
If sourcetype A only contains field_A and sourcetype B only contains field_B, create a new field called field_Z which is either field_A or field_B, depending on which is present in an event. You can then build the transaction based on the value of field_Z.

sourcetype=A OR sourcetype=B
| eval field_Z = coalesce(field_A, field_B)
| transaction field_Z

Finding Incomplete Transactions

Problem
You need to report on incomplete transactions, such as users who have logged in but not logged out.
Solution
Suppose you are searching for user sessions starting with a login and ending with a logout:

… | transaction userid startswith=”login”
endswith=”logout”

You would like to build a report that shows incomplete transactions—users who have logged in but not logged out. How can you achieve this?
The transaction command creates an internal boolean field named closed_txn to indicate if a given transaction is complete or not. Normally incomplete transactions are not returned, but you can ask for these “evicted” partial transactions by specifying the parameter keepevicted=true. Evicted transactions are sets of events that do not match all the transaction parameters.

For example, the time requirements are not met in an eviction transaction. Transactions that fulfill all the requirements are marked as complete by having the field closed_txn set to 1 (rather than 0 for incomplete transactions). So the pattern for finding incomplete transactions would generally be:

… | transaction <conditions> keepevicted=true
| search closed_txn=0

In our case, however, there’s a wrinkle. An endswith condition not matching will not set  closed_txn=0 because events are processed from newest to oldest. Technically, the endswith condition starts the transaction in terms of Search Processing Language. To get around this, we need to filter transactions based on the closed_txn field, as well as make sure that our transactions don’t have both a login and logout:

… | transaction userid startswith=”login”
endswith=”logout”
keepevicted=true
| search closed_txn=0 NOT (login logout)

Calculating Times within Transactions

Problem
You need to find the duration between events in a transaction.

Solution
The basic approach is to use the eval command to mark the points in the time needed to measure the different durations, and then calculate the duration between these points using eval after the transaction command. but you have the full range of eval functions available to you for more complex situations.

…| eval p1start = if(searchmatch(“phase1”), _time, null())
| eval p2start = if(searchmatch(“phase2”), _time, null())
Next we make the actual transactions:
… | transaction id startswith=”start of event”
endswith=“end of event”

Finally, we calculate the duration for each transaction, using the values calculated above.

…| eval p1_duration = p2start - p1start
| eval p2_duration = (_time + duration) - p2start

In this example, we calculated the time of the last event by adding _time (the time of the first event) and adding duration to it. Once we knew the last event’s time, we calculated p2_duration as the difference between the last event and the start of phase 2.

Certification in Bigdata Analytics

Finding the Latest Events

Problem
You need to find the latest event for each unique field value. For example, when was the last time each user logged in?
Solution
At first, you might be tempted to use the transaction or stats command. For example, this search returns, for each unique userid, the first value seen for each field:
… | stats first(*) by userid
Note that this search returns the first value of each field seen for events that have the same userid. It provides a union of all events that have that user ID, which is not what we want. What we want is the first event with a unique userid. The proper way to do that is with the dedup command:
… | dedup userid

Finding Repeated Events

Problem
You want to group all events with repeated occurrences of a value in order to remove noise from reports and alerts.
Solution
Suppose you have events as follows:
2012-07-22 11:45:23 code=239
2012-07-22 11:45:25 code=773
2012-07-22 11:45:26 code=-1
2012-07-22 11:45:27 code=-1
2012-07-22 11:45:28 code=-1
2012-07-22 11:45:29 code=292
2012-07-22 11:45:30 code=292
2012-07-22 11:45:32 code=-1
2012-07-22 11:45:33 code=444
2012-07-22 11:45:35 code=-1
2012-07-22 11:45:36 code=-1
Your goal is to get 7 events, one for each of the code values in a row: 239, 773, -1, 292, -1, 444, -1. You might be tempted to use the transaction command as follows:

… | transaction code

Using transactions here is a case of applying the wrong tool for the job. As long as we don’t care about the number of repeated runs of duplicates, the more straightforward approach is to use dedup, which removes duplicates. By default, dedup will remove all duplicate events (where an event is a duplicate if it has the same values for the specified fields). But that’s not what we want; we want to remove duplicates that appear in a cluster. To do this, dedup has a consecutive=true option that tells it to remove only consecutive duplicates.

… | dedup code consecutive=true

Time Between Transactions

Problem
You want to determine the time between transactions, such as how long it’s been between user visits to your website.
Solution
Suppose we have a basic transaction search that groups all events by a given user (clientip-cookie pair), but splits the transactions when the user is inactive for 10 minutes:

… | transaction clientip, cookie maxpause=10m

Ultimately, our goal is to calculate, for each clientip-cookie pair, the difference in time between the end time of a transaction and the start time of a more recent (i.e. ,‘previous’ in order of events returned) transaction.
That time difference is the gap between transactions. For example, suppose we had two pseudo transactions, returned from most recent to oldest:

T1: start=10:30 end=10:40 clientip=a cookie=x
T2: start=10:10 end=10:20 clientip=a cookie=x

The gap in time between these two transactions is the difference between the start time of T1 (10:30) and the end time of T2 (10:20), or 10 minutes. The rest of this recipe explains how to calculate these values.
First, we need to calculate the end time of each transaction, keeping in mind that the timestamp of a transaction is the time that the first event occurred and the duration is the number of seconds that elapsed between the first and last event in the transaction:

… | eval end_time = _time + duration

Next, we need to add the start of the time from the previous (i.e., more recent) transaction to each transaction. That will allow us to calculate the difference between the start time of the previous transaction and our calculated end_time.

To do this, we can use streamstats to calculate the last value of the start time (_time) seen in a sliding window of just one transaction— global=false and window=1—and to ignore the current event in that sliding window—current=false. In effect, we’re instructing streamstats to look only at the previous event’s value. Finally, note that we’re specifying that this window is only applicable to the given user (clientip-cookie pair):

… | streamstats first(_time) as prev_starttime
global=false window=1 current=false by clientip, cookie

At this point, the relevant fields might look something like this:

T1: _time=10:00:06, duration=4, end_time=10:00:10
T2: _time=10:00:01, duration=2, end_time=10:00:03
prev_starttime=10:00:06
T3: _time=10:00:00, duration=0, end_time=10:00:01
prev_starttime=10:00:01

Now, we can finally calculate the difference in time between the previous transaction’s start time (prev_starttime) and the calculated end_time. That difference is the gap between transactions, the amount of time (in seconds) passed between two consecutive transactions from the same user (clientip-cookie pair).

… | eval gap_time = prev_starttime – end_time

Course Schedule

Name Date Details
Big Data Course 14 Dec 2024(Sat-Sun) Weekend Batch View Details
21 Dec 2024(Sat-Sun) Weekend Batch
28 Dec 2024(Sat-Sun) Weekend Batch

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.