You are on page 1of 39

Event correlation and data

mining for event logs


Risto Vaarandi
SEB Eesti hispank
risto.vaarandi@seb.ee
2
Outline
Event logging and event log monitoring
Event correlation concept and existing
solutions
Simple Event Correlator (SEC)
Frequent itemset mining for event logs
Data clustering for event logs
Discussion
3
Event logging
Event a change in the system state, e.g., a disk failure; when a
system component (application, network device, etc.) encounters an
event, it could emit an event message that describes the event.
Event logging a procedure of storing event messages to a local or
remote (usually flat-file) event log.
Event logs play an important role in modern IT systems:
many system components like applications, servers, and network devices
have a builtin support for event logging (with the BSD syslog protocol being
a widely accepted standard),
since in most cases event messages are appended to event logs in real-
time, event logs are an excellent source of information for monitoring the
system (a number of tools like Swatch and Logsurfer have been developed
for log monitoring),
information that is stored to event logs can be useful for analysis at a later
time, e.g., for audit procedures.
4
Centralized logging infrastructure
Application
Server
Netdevice
syslog
server
log file
monitor
Event logs
notifications to
the monitoring
console
events
network
Central log server
Applications, servers, and network devices use the syslog protocol for logging their
events to the central log server that runs a syslog server. Log monitoring takes place
on the central log server and alerts are sent to the monitoring console.
5
Event log monitoring
Dec 18 08:47:26 myhost [daemon.info] sshd[15172]: log: Connection
from 10.2.211.19 port 1304
Dec 18 08:47:39 myhost [daemon.info] sshd[15172]: log: Password
authentication for alex accepted.
Dec 18 08:50:09 myhost [kern.crit] vmunix: /var/tmp: file system full
Dec 18 08:50:10 myhost [mail.debug] imapd[15399]: imap service init
from 10.2.213.2
Dec 18 08:50:10 myhost [mail.info] imapd[15399]: Login user=risto
host=risto2 [10.2.213.2]

Commonly used log monitoring tools Swatch, Logsurfer, etc. (see
http://www.loganalysis.org for more information).
Current log monitoring practice match logged lines in real time with
regular expressions and perform an action (e.g., send an alert) when a
matching line is observed.
Open issues in order to write rules (regexp action) for log
monitoring tools, one must have a good knowledge about the IT system
and log messages; existing tools dont support event correlation well.
6
Event correlation
Event correlation a conceptual interpretation procedure where new
meaning is assigned to a set of events that happen within a
predefined time interval [Jakobson and Weissman, 1995]. During the
event correlation process, new events might be inserted into the
event stream and original events might be removed.
Examples:
if 10 login failure events occur for a user within 5 minutes, generate a
security attack event;
if both device internal temperature too high and device not responding
events have been observed within 5 seconds, replace them with the event
device down due to overheating.
A number of approaches have been proposed for event correlation
(rule-based, codebook based, neural network based etc. methods),
and a number of event correlation products are available on the
market (HP ECS, SMARTS, NerveCenter, RuleCore, LOGEC, etc.)
7
Event correlation approaches
Rule-based (HP ECS, IMPACT, RuleCore, etc.) events are
correlated according to the rules condition action that are
specified by the human analyst.
Codebook based (SMARTS) if a set of events e
1
,...,e
k
must be
interpreted as event A, then e
1
,...,e
k
are stored to the codebook as a
bit-vector pointing to A. In order to correlate a set of events, look for
the most closely matching vector in the codebook, and report the
interpretation that corresponds to the vector.
Graph based find all dependencies between system components
(network devices, hosts, services, etc.) and construct a graph with
each node representing a system component and each edge a
dependency between two components. When a set of fault events
occurs, use the graph for finding possible root cause(s) of fault
events (e.g., 10 HTTP server not responding events were caused
by the failure of a single network link).
Neural network based a neural net is trained for the detection of
anomalies in the event stream, root cause(s) of fault events, etc.
8
Motivation for developing SEC
Existing event correlation products have the following drawbacks:
Complex design and resource requirements
they are mostly heavyweight solutions that are difficult to deploy and maintain,
and that require extensive user training;
they are not very suitable for application on nodes with limited computing
resources (e.g., nodes in sensor and ad hoc networks);
many products are based on the client-server model which is inconvenient for
fully distributed event correlation,
Platform and domain dependence
they are usually distributed in a binary form for a limited number of OS platforms;
some products are designed for one system management platform only;
some products have been designed for network fault management and their
application in other domains (including event log monitoring) is cumbersome,
The issue of pricing they are quite expensive (at the present time, SEC is
the only freely available event correlation engine).
Summary there is a need for lightweight, platform-independent,
and open source event correlation solutions, since heavyweight
proprietary systems are infeasible for many tasks and environments.
9
Key features of SEC
Uses rule-based approach for event correlation this approach was
chosen because of its naturalness of knowledge representation and
transparency of the event correlation process to the end user,
Written in Perl, thus open-source and cross-platform,
Licensed under the terms of GNU GPL,
Easy to install and configure no need for compiling and linking the
source, no dependencies on other software, configuration files can
be edited with any text editor like vi,
Small in size and doesnt consume much system resources (CPU
time and memory),
Reads input from log files, named pipes, and standard input
(arbitrary number of input sources can be specified); employs
regular expression patterns, Perl subroutines, substrings, and truth
values for matching input events,
Can be used as a standalone event log monitoring solution, but also
integrated with other applications through file/named pipe interface.
10
SEC configuration
SEC event correlation rules are stored in regular text files.
Rules from one configuration file are used sequentially in the same
order as they are given in the file, rule sets from different
configuration files are applied virtually in parallel.
Most rules have the following components:
event matching pattern
optional Boolean expression of contexts
event correlation key
correlation information (e.g., event counting threshold and window)
list of actions
With appropriate patterns, context expressions, and action lists,
several rules can be combined into one event correlation scheme.
When an event matches a rule, event correlation key is calculated,
and if there exists an event correlation operation with the same key,
event is correlated by that operation. If there is no such operation
and the rule specifies a correlation of events over time, the rule
starts a new operation with the calculated key.
11
SEC rule types
Single execute an action list when matching event is observed.
SingleWithScript like Single, but also use an external script for
matching.
SingleWithSuppress like Single, but ignore following matching
events for t seconds.
Pair execute an action list on event A and ignore following
instances of A until event B arrives. On the arrival of B execute
another action list.
PairWithWindow after observing event A, wait for t seconds for
event B to arrive. If B does not arrive on time, execute an action list,
otherwise execute another action list.
SingleWithThreshold count matching input events during t
seconds and if a given threshold is exceeded, execute an action list.
SingleWith2Thresholds like SingleWithThreshold, but with the
second round of counting with a falling theshold.
Suppress suppress matching input events.
Calendar execute an action list at specific times.
12
SEC pattern types
Regular expressions (Perl dialect), Perl subroutines (precompiled),
substrings, and truth values are supported as patterns. Examples:

# Sample matching input line:
# Dec 18 08:50:09 myhost [kern.crit] vmunix: /var/tmp: file system full
# Note that the $1 variable will be set to the file system name (/var/tmp)

pattern=(\S+): [fF]ile system full

# Sample matching input line:
# Oct 13 10:34:51 sensor1 [auth.alert] snort: Portscan detected from
# 213.31.250.205 Talker(fixed: 1 sliding: 1) Scanner(fixed: 15 sliding: 15)
# $1 = scanner IP address, $2 = scanner hostname, $3 = portscan description

pattern=sub { if ($_[0] =~ /Portscan detected from (\S+) (.+)/) { \
my($ip, $text) = ($1, $2); my($name) = $ip; my($rr); \
my($res) = Net::DNS::Resolver->new; my($query) = $res->search($ip, "PTR");\
if ($query) { foreach $rr ($query->answer) { \
if ($rr->type eq "PTR") { $name = $rr->ptrdname; last; } } } \
return ($ip, $name, $text); } else { return 0; } }
13
SEC contexts and actions
SEC context a logical entity that can be created or deleted from a
rule (or internally by SEC for tagging input). At creation, the context
lifetime can be set to a certain finite value (e.g., 20 seconds).
The presence or absence of a context can decide whether a rule is
applicable or not (e.g., if a rule definition has A OR B specified for
its context expression, and neither context A nor B exist, the rule will
not be applied).
A context can act as an event store events can be associated with
a context, and all the collected events supplied for an external
processing at a later time (e.g., collect events from a suspicious FTP
session, and mail them to the security administrator at the end of the
session).
SEC actions invoke external programs (shellcmd, spawn, etc.),
generate synthetic events (event), reset event correlation operations
(reset), perform context operations (create, report, etc.), set user-
defined variables (assign, etc.), write to files or FIFOs (write),
execute Perl programs or precompiled subroutines (eval, call).
14
A sample SEC rule
# This example assumes that SEC has been started with intcontexts option
type=SingleWithSuppress
ptype=RegExp
pattern=(\S+) \[kern\.crit\] vmunix: (\S+): [fF]ile system full
context=_FILE_EVENT_/logs/srv1.messages || _FILE_EVENT_/logs/srv2.messages
desc=$1:$2 file system full
action=pipe File system $2 at host $1 full mail s FS full root
window=900

When a file system full message is logged to either /logs/srv1.messages or
/logs/srv2.messages, SEC will send an e-mail alert to the local root user, and
ignore repeated file system full messages for the same host and same file
system for 900 seconds.
The desc parameter defines the event correlation key and the scope of event
correlation if we replace the key $1:$2 file system full with just file system full,
we would get just one alert for the following messages at 12:30:

Oct 13 12:30:00 srv1 [kern.crit] vmunix: /tmp: file system full
Oct 13 12:35:00 srv2 [kern.crit] vmunix: /home: file system full
15
A sample SEC rule base for Cisco
router/switch syslog messages
# If a router interface is in down state for less
# than 15 seconds, generate event
# "<router> INTERFACE <interface> SHORT OUTAGE";
# otherwise generate event
# "<router> INTERFACE <interface> DOWN".

type=PairWithWindow
ptype=RegExp
pattern=(\S+) \d+: %LINK-3-UPDOWN: Interface (.+),
changed state to down
desc=$1 INTERFACE $2 DOWN
action=event %s
ptype2=RegExp
pattern2=($1) \d+: %LINK-3-UPDOWN: Interface ($2),
changed state to up
desc2=$1 INTERFACE $2 SHORT OUTAGE
action2=event %s
window=15

# If "<router> INTERFACE <interface> DOWN" event is
# received from the previous rule, send a notification and
# wait for "interface up" event for the next 24 hours.



type=Pair
ptype=RegExp
pattern=(\S+) INTERFACE (\S+) DOWN
desc=$1 interface $2 is down
action=shellcmd notify.sh "%s"
ptype2=RegExp
pattern2=($1) \d+: %LINK-3-UPDOWN: Interface ($2),
changed state to up
desc2=$1 interface $2 is up
action2=shellcmd notify.sh "%s"
window=86400

# If ten "short outage" events from the first rule have been
# observed in the window of 6 hours, send a notification.

type=SingleWithThreshold
ptype=RegExp
pattern=(\S+) INTERFACE (\S+) SHORT OUTAGE
desc=Interface $2 at node $1 is unstable
action=shellcmd notify.sh "%s"
window=21600
thresh=10

16
The work of the sample rule base
Interface A@B down
PairWithWindow
win=15s
shellcmd notify.sh
B interface A is up
Interface A@B up
Pair
shellcmd notify.sh
B interface A is down
event B INTERFACE A DOWN
-
SingleWithThreshold
win=21600s
thresh=10
event B INTER-
FACE A SHORT
OUTAGE
+
shellcmd notify.sh
Interface A at
node B is unstable
17
SEC application experience
Employed by many companies and other institutions around the
world major North-American telecom companies with very large
networks, universities, financial institutions, a major European car
manufacturer, a major US manufacturer of medical equipment, etc.
Has been employed for event log monitoring, as an event correlator
for major network and system management platforms (HP
OpenView, CiscoWorks, BMC Patrol, Nagios), event correlator for
Snort IDS and other security systems, event correlator in medical
systems and weather forecast systems, etc.
Has been used on Linux, Solaris, HP-UX, AIX, FreeBSD, Tru64
UNIX, Mac OS X, and Windows (with CygWin Perl).
Part of the Debian and Gentoo Linux distribution.
As of October 2005, the SEC mailing list had more than 260
subscribers and contained more than 1,000 posts.
Papers, book chapters, etc.: Hardening Linux Chapter 5 (Apress
2005), the March 2005 issue of the Sys Admin Magazine, USENIX
LISA04 (both technical sessions and tutorials), IEEE IPOM02.
18
Data mining for event logs
Data mining for event logs has been identified as an important
system management task detected knowledge can be used for
building rules for event correlation systems or event log monitoring
tools, improving the design of web sites, etc.
Recently proposed mining algorithms mostly based on the Apriori
algorithm for mining frequent itemsets, designed for mining frequent
patterns of event types. Event log is viewed as a sequence
{E
1
,...,E
n
}, where E
i
= (t
i
, e
i
), t
i
time of occurrence of E
i
, e
i
type of
E
i
, and if i < j then t
i
t
j
. Frequent pattern can be defined in several
ways, with most common definitions being window- and slice-based.
Shortcomings of existing mining approaches:
Apriori is known to be inefficient for mining longer patterns,
infrequent events remain undetected but are often interesting (e.g., fault
events are normally infrequent but highly interesting),
focused on mining event type patterns from preprocessed event logs,
ignoring patterns of other sorts (in particular, line patterns from raw event
logs help one to find event types or write rules for log monitoring tools).
19
Frequent itemset mining problem
In this talk, an efficient frequent itemset mining algorithm will be
presented that can be employed for mining both line and event type
patterns from event logs.
Let I = {i
1
,...,i
n
} be a set of items. If X _ I, X is called an itemset, and
if |X| = k, X is called a k-itemset.
A transaction is a tuple (tid, X), where tid is a transaction identifier
and X is an itemset. A transaction database D is a set of
transactions (with each transaction having a unique id).
The cover of an itemset X is the set of identifiers of transactions that
contain X: cover(X) = {tid | (tid, Y) e D, X _ Y}
The support of an itemset X is the number of elements in its cover:
supp(X) = |cover(X)|
The frequent itemset mining problem given the transaction
database D and the support threshold s, find the set of frequent
itemsets {X | (tid, X) e D, supp(X) s} and the supports of frequent
itemsets.
20
Event logs as transaction
databases
For mining frequent event type patterns, the algorithm uses a slice-
based approach:
Event log L = {E
1
,...,E
n
} will be divided into non-overlapping slices S
1
,...,S
m
:
L =
i
m
=1
S
i
, S
i
S
j
= C for i j.
Slices are formed according to some criteria that is specified by the end
user, e.g., each slice contains events from one session in the internet bank.
All possible event types act as the set of items I, and each slice can be
considered a transaction with its tid between 1 and m. For the slice S
i
=
{E
1
,...,E
k
}, where E
j
= (t
j
,e
j
) is an event from S
i
(t
j
- occurrence time of E
j
;
e
j
- type of E
j
), the transaction corresponding to S
i
is (i,
j
k
=1
{e
j
}).
For mining line patterns from raw event logs, the algorithm
views lines and patterns as sets of (word, position) pairs, e.g., the pattern
Router * interface * down corresponds to a set {(Router, 1), (interface, 3),
(down, 5)};
all distinct (word, position) pairs from the event log form the set of items I,
and each event log line forms a transaction (if there are n lines, we use line
numbers from 1 to n for tids).
21
Frequent itemset mining algorithms
For the sake of efficient implementation, most algorithms order the
items according to certain criteria. In the rest of this presentation, we
assume that the set of items I is an ordered set and if X = {x
1
,...,x
n
} is
an itemset, then x
1
< ... < x
n
.
Apriori algorithm (works in a breadth-first manner) discovered
frequent k-itemsets are used to form candidate k+1-itemsets, and
frequent k+1-itemsets are found from the set of candidates by
making a pass over the database and counting the support of
candidates.
Candidate generation is based on the following observation each
subset of a frequent itemset must also be frequent. For each pair of
frequent k-itemsets {x
1
,...,x
k
} and {x
1
,...,x
k-1
,y
k
}, generate a candidate
by joining them: {x
1
,...,x
k
,y
k
}. According to the observation, all
frequent k+1-itemsets are included in the set of candidates.
22
Apriori itemset trie
Transaction db:
(1, abcde)
(2, abc)
(3, bcd)
(4, abc)
(5, ab)
supp.thresh.=2
a < b < c < d < e
4 4 5 2
a
b
c
d
3
4 4
2
2
b
c
1
d
c
d d
3
2
c
d
Each edge in the trie is labeled with a name of a certain item.
Apriori builds the itemset trie layer by layer when the node
layer at depth k is complete, each node at depth k represents
a frequent k-itemset, where path to that node identifies the
items in the set and counter in that node the support of the
set. Candidate generation for each node N at depth k,
create candidate child nodes by inspecting sibling nodes of N.
Remove candidate
with insufficient
support
23
Breadth-first vs. depth-first
Shortcomings of Apriori when the database contains larger
frequent itemsets (e.g., containing 30-40 items), the trie will become
very large (there will be 2
k
-1 nodes in the trie for each frequent k-
itemset). As a result, the runtime and memory cost of the algorithm
will be prohibitive.
Eclat and FP-growth algorithms the algorithms first load the
transaction database into main memory. At each step they will
search for frequent itemsets {p
1
,...,p
k-1
,x} with a certain prefix P =
{p
1
,...,p
k-1
}, where P is a previously detected frequent itemset. The
in-memory representation of the database allows the algorithm to
search only transactions that contain P. After the search, the prefix
for the next step will be chosen from the detected frequent itemsets
(or found by backtracking).
Drawback the transaction database must fit into the main memory,
but this is not always the case (also for many event log data sets!).
Proposed solution use the breadth-first approach and the itemset
trie data structure, with special techniques for speeding up the
mining process and reducing its memory consumption.
24
Properties of event log data
For raw event logs, the transaction database can contain a large
number of items (e.g., values exceeding 10 million are common).
However, only few items are relatively frequent (e.g., occur at least
once per 1,000 transactions). Also, most items appear only few
times in the data set.
Frequent itemsets may contain many items (e.g., for the support
threshold 0.1% we have found that values 20-30 are quite common).
This means that Apriori is not always adequate for processing event
log data.
There are often strong correlations between frequent items in
transactions, e.g., item A never appears without item B. In the case
of raw event logs, this is often caused by the log message
formatting, e.g.,
syslog(LOG_INFO|LOG_DAEMON, Connection from %s, ipaddress);

25
Mining frequent items
A, 1 B, 3 C, 1 D, 15 E, 1 F, 1 G, 4 I, 2 H, 1
C
0
= 5 C
2
= 7 C
1
= 17
The mining of frequent items (1-itemsets) the first step of any breadth-first algorithm.
Problem the number of items in the data set can be very large, and the memory cost of item
counting (a database pass with item counters kept in memory) can be very high.
Solution make an extra pass over the database before the counting, and hash each item to
an integer value 0..m-1 (m << the number of items). Each time the value i is calculated by the
hashing function, increment the i-th counter in the summary vector. Then make another
database pass and count only those items for which the counter values are not below the
support threshold.
Example if the support threshold is 10, only items D, E, F can possibly be frequent!
Since one of the properties of the event log data is that majority of items are very infrequent,
most counters are likely to remain below the support threshold. Because the vector itself
requires small amount of memory, the overall memory savings are usually substantial.
26
Transaction cache
Motivation keep most frequently used transaction data in memory
for speeding up the work of the algorithm.
Observation if F is the set of frequent items and (tid, X) is a
transaction, we only need to consider items X F (frequent items of
the transaction) during the mining process.
Transaction cache use the summary vector technique for detecting
which sets X F have a chance to correspond to C or more
transactions, and load them to main memory (identical sets are
stored as a single record with an occurrence counter); write the rest
of the sets to disk. The value of C is given by the user.
Result the cache is guaranteed to contain most frequently used
transaction data ({Y _ I | |{(tid, X) e D | X F = Y}| C}), and the
amount of data stored to the cache is controlled by the user. There
will be no dependency on the amount of main memory like with
depth-first algorithms, while the cache hit ratio is likely to be quite
high.
27
Reducing the size of the itemset
trie
Motivation with a smaller trie less memory/CPU time is consumed.
Observation when there are many strong correlations between
frequent items in transactions, many parts of the Apriori itemset trie
contain redundant information.
Let F = {f
1
,...,f
n
} be the set of all frequent items. We call the set
dep(f
i
) = {f
j
| f
i
f
j
, cover(f
i
) _ cover(f
j
)} the dependency set of f
i
, and
say that f
i
has m dependencies if |dep(f
i
)| = m.
Dependency prefix of item f
i
: pr(f
i
) = {f
j
| f
j
e dep(f
i
), f
j
< f
i
}
Dependency prefix of itemset {f
i1
,...,f
ik
}: pr({f
i1
,...,f
ik
}) =
j
k
=1
pr(f
ij
)
The trie reduction technique if pr(X) . X, dont create a node for X
in the trie. In order to maximize the efficiency of the technique,
reorder frequent items, so that if f
i
< f
j
, then |dep(f
i
)| s |dep(f
j
)|
Note that we cant generate candidates as a separate step like
Apriori, since some nodes needed for the step could be missing
from the trie due to the trie reduction technique. Thus, we generate
candidates on-the-fly during support counting.
28
Building the reduced trie
Transaction db:
(1, abcde)
(2, abc)
(3, bcd)
(4, abc)
(5, ab)
supp.thresh.=2
b < c < a < d
F = {a, b, c, d}, dep(d) = {b, c}, dep(c) = dep(a) = {b}, dep(b) =C.
This yields the ordering b < c < a < d, and dependency prefixes
are pr(b) = C, pr(c) = pr(a) = {b}, pr(d) = {b, c}. Trie building
when the depth of the trie is k, make a data pass and count all
k+1-itemsets of frequent items that begin with frequent k-
itemsets present in the trie and that contain their dependency
prefixes. Candidate nodes are generated during the counting.
b
5
c
a
4 4
3
d
a
2
29
Deriving all itemsets from the trie
and further optimizations
It can be shown that non-root nodes of the trie represent all frequent
itemsets that contain their dependency prefixes, and all frequent
itemsets can be derived from non-root nodes if a node represents
an itemset X, we can derive frequent itemsets {X \ Y | Y _ pr(X)}
from X, with all such itemsets having a support supp(X).
Observation if the trie reduction technique was not applied at node
N for reducing the number of its child nodes, and M is a child of N,
then the sibling nodes of M contain all necessary nodes for the
candidate generation in Apriori fashion.
If we augment the algorithm with such an optimization, the algorithm
becomes a generalization of Apriori:
if at node N the algorithm discovers that the trie reduction technique is no
longer effective, it switches to Apriori for the subtrie that starts from N,
if there are no dependencies between frequent items (i.e., frequent items
are weakly correlated), the algorithm switches to Apriori at the root node,
i.e., it behaves like Apriori from the start.
30
The summary of the algorithm
1. Make a pass over the database, detect frequent items, and order them in lexicographic order
(if the number of items is very large, the summary vector technique can be used for filtering out
irrelevant items). If no frequent items were found, terminate.
2. Make a pass over the database, in order to calculate dependency sets for frequent items and
to build the transaction summary vector.
3. Reorder frequent items in dependency ascending order and find their dependency prefixes.
4. Make a pass over the database, in order to create the cache tree and the out-of-cache file.
5. Create the root node of the itemset trie and attach nodes for frequent items with empty
dependency prefixes to the root node. If all frequent items have empty dependency prefixes,
set the APR-flag in the root node.
6. Let k := 1.
7. Check all nodes in the trie at depth k. If the parent of a node N has the APR-flag set, generate
candidate child nodes for the node N in Apriori fashion (node counters are set to zero), and set
the APR-flag in the node N.
8. Build the next layer of nodes in the trie using the trie reduction technique with the following
additional condition if the APR-flag is set in a node at depth k, don't attach any additional
candidate nodes to that node.
9. Remove the candidate nodes (nodes at depth k+1) with counter values below the support
threshold. If all candidate nodes were removed, output frequent itemsets and terminate.
10. Find the nodes at depth k for which the trie reduction technique was not applied during step 8
for reducing the number of their child nodes, and set the APR-flag in these nodes. Then let k :=
k + 1 and go to step 7.
31
LogHound and examples of
detected patterns
The frequent itemset mining algorithm for event logs has been
implemented in a tool called LogHound (written in C, distributed
under the terms of GNU GPL).
Includes several features for preprocessing raw event logs (support
for regular expression filters, line conversion templates, etc.)
Sample frequent line patterns detected with LogHound:
Dec 18 * myhost * connect from
Dec 18 * myhost * log: Connection from * port
Dec 18 * myhost * fatal: Did not receive ident string.
Dec 18 * myhost * log: * authentication for * accepted.
Dec 18 * myhost * fatal: Connection closed by remote host.
Sample frequent event type pattern detected with LogHound (the
CodeRed worm footprint from the Snort IDS log):
WEB-IIS CodeRed v2 root.exe access
WEB-IIS cmd.exe access
HTTP Double Decoding Attack
WEB-IIS unicode directory traversal attempt
32
Data clustering for event logs
The data clustering problem: divide a set of data points into groups
(clusters), where points within each cluster are similar to each other.
Points that do not fit well to any of the detected clusters are
considered to form a special cluster of outliers.
The data clustering algorithm presented in this talk has been
designed for dividing event log lines into clusters, so that:
each regular cluster corresponds to a certain frequently occurring line
pattern (e.g., Interface * down),
the cluster of outliers contains rare lines (which possibly represent fault
conditions or unexpected behavior of the system).
Traditional clustering methods assume that data points belong to
space R
n
, and similarity between data points is measured in terms of
distance between points many algorithms use a variant of L
p
norm
as a distance function:
L
p
(x, y) = (
i
n
= 1
|x
i
y
i
|
p
)
1/p

(L
1
Manhattan distance, L
2
Euclidean distance)
33
Traditional clustering methods
Weaknesses of traditional data clustering methods:
- they are unable to handle non-numerical (categorical) data,
- they dont work well in high-dimensional data spaces (n > 10),
- they are unable to detect clusters in subspaces of the original data space.
When we view event log lines as data points with categorical attributes, where the
m-th word of the line is the value for the m-th attribute, e.g.,
(Password, authentication, for, john, accepted),
then all the weaknesses listed above are also relevant for event log data sets.
34
Recent clustering algorithms that
address the problems
CLIQUE and MAFIA employ Apriori-like algorithm for detecting
clusters. Instead of measuring the distance between individual
points, algorithms identify dense regions in the data space, and form
clusters from these regions. First clusters in 1-dimensional
subspaces are detected. After clusters C
1
,,C
m
in (k-1)-dimensional
subspaces are detected, cluster candidates for k-dimensional
subspaces are formed from C
1
,,C
m
. The algorithm then checks
which candidates are actual clusters, etc. Unfortunately, Aprioris
performance deteriorates as k increases (for detecting a cluster in k-
dimensional space, 2
k
-2 of its superclusters must be produced first).
CACTUS makes a pass over the data and builds a data summary,
and then makes another pass over the data to find clusters using the
summary. Although fast, CACTUS generates clusters with stretched
shapes, which is undesirable for log file data clustering.
PROCLUS uses K-medoid method to partition the data space into
K clusters. However, it is not obvious what is the right value for K.

35
The algorithm features and
definitions
The algorithm views every event log line as a data point with
categorical attributes, where the m-th word of the line is the value for
the m-th attribute.
Makes few passes over the data (like the CACTUS algorithm).
Uses density-based approach for clustering (like CLIQUE and
MAFIA algorithms) identifies dense regions in the data space and
forms clusters from them.
Region S a subset of the data space, where certain attributes
i
1
,,i
k
of all points that belong to the region S have identical values
v
1
,,v
k
: xeS, xi
1
=v
1
,xi
k
=v
k
. (Note that each region corresponds to
a line pattern, e.g., Password authentication for * accepted.)
1-region a region with one fixed attribute i
1
(i.e., k=1).
Dense region a region that contains at least N points, where N is
the support threshold value given by the user.
36
The algorithm basic steps
1. Make a pass over the data, and identify all dense 1-regions (note that
dense 1-regions correspond to frequent words in the data set).
2. Make another pass over the data, and generate cluster candidates. For
every line that contains dense 1-regions, create a cluster candidate by
combining the fixed attributes of these regions. (For example, if the line is
Password authentication for john accepted, and words Password,
authentication, for, and accepted are frequent, then the candidate is
Password authentication for * accepted.) If the candidate is not present in
the candidate table, it will be put there with a support value 1; otherwise
its support value will be incremented.
3. Optional step: for each candidate C, find all candidates representing
more specific patterns, and add their support values to the value of C.
4. Find which candidates in the candidate table have support values equal
or greater than the support threshold, and output them as clusters.
5. Detect outliers during a separate data pass.
37
SLCT Simple Logfile Clustering
Tool
The event log clustering algorithm has been implemented in a tool
called SLCT (written in C, distributed under the terms of GNU GPL).
Supports regular expression filters, so that particular lines (or
particular cluster) can be inspected more closely.
Supports line conversion templates (i.e., before inspection, certain
parts of a line are assembled into a new line), e.g., with the filter
sshd\[[0-9]+\]: (.+) and the template $1 the line
sshd[2781]: connect from 10.1.1.1 will be converted to
connect from 10.1.1.1
Custom word delimiter can be specified (default is whitespace, but a
custom regular expression can be given).
SLCT can refine variable parts of patterns by looking for constant
heads and tails, e.g., the pattern connect from * is converted to
connect from 192.168.*
After the first run, SLCT can be applied iteratively to the file of
outliers, until the file is relatively small and can be inspected
manually for unknown faults and anomalies.
38
Examples of patterns detected with
SLCT
Sample clusters detected with SLCT:
sshd[*]: connect from 1*
sshd[*]: log: Connection from 1* port *
sshd[*]: log: * authentication for * accepted.
sshd[*]: log: Closing connection to 1*

Sample outliers detected with SLCT:
sendmail[***]: NOQUEUE: SYSERR(***): can not chdir(/var/spool/mqueue/):
Permission denied
sendmail[***]: ***: SYSERR(root): collect: I/O error on connection from
***, from=<***>
sendmail[***]: ***: SYSERR(root): putbody: write error: Input/output
error
login[***]: FAILED LOGIN 1 FROM (null) FOR root, Authentication failure
sshd[***]: Failed password for root from *** port *** ssh2
imapd[***]: Unable to load certificate from ***, host=*** [***]
imapd[***]: Fatal disk error user=*** host=*** [***] mbx=***: Disk quota
exceeded
39
References
SEC home page http://simple-evcorr.sourceforge.net
LogHound home page http://kodu.neti.ee/~risto/loghound/
SLCT home page http://kodu.neti.ee/~risto/slct/
My publications see http://kodu.neti.ee/~risto

You might also like