Cse 11 001 PDF

1
Botnet Detection Through Fine Flow Classification

Xiaonan Zang, Athichart Tangpong, George Kesidis and David J. Miller
Departments of CS&E and EE
The Pennsylvania State University
University Park, PA, 16802
CSE Dept Technical Report No. CSE11-001, Jan. 31, 2011
Abstract
The prevalence of botnets, which is defined as a group of infected machines, have become the predominant factor among
all the internet malicious attacks such as DDoS, Spam, and Click fraud. The number of botnets is steadily increasing, and
the characteristic C&C channels have evolved from IRC to HTTP, FTP, and DNS, etc., and from the centralized structure to
P2P and Fast Flux Network Services. In counter to the escalations of the botnet developments, the internet security community
have designed many botnet detection and disruption systems which can be summarized into two categories: Honeynet-based and
Passive Traffic Monitoring, while the Passive Traffic Monitoring could be further divided into Behavior-based, DNS-based, and
Mining-based techniques. Among all the Intrusion Detection System designs, the mining-based method, operated on the flow level
internet traffic, has shown some promising resilience against the botnets evolutions. A preliminary experiment has been conducted
in this paper observing the discriminating capabilities of the Hierarchical and K mean clustering algorithms and exploring a RTT
adjustment procedure to mix the botnet trace with the background internet traffic.
I. I NTRODUCTION
The term Botnet denotes a network of compromised end hosts (bots) under the remote command of a botmaster [37]. Once
a botnet has been constructed, these bots are controlled autonomously and automatically, in some cases to perform some illicit
monetary activities.
A. Botnet Life Cycle
The general life cycle of a botnet, shown in Figure 1, contains four phases: initial infection, secondary injection, maintenance
& update, and malicious activities [13].
Botmaster
Malicious Commands
Maintenance & Update
Vulnerable End Host
Connection & Update
Secondary
Injection
Command and Control

(C&C) Server
Malicious Commands
Maintenance & Update
Initial Infection
Botnet
Fig. 1: A General Botnet Life Cycle.

This material is based upon work supported by the National Science Foundation under Grant No. 0915552 and a Cisco Systems URP gift.
1) Initial Infection: A computer can be infected in different ways: Inadvertently execute malicious code, exploit system
vulnerabilities, and access through engineered backdoors. Users may accidentally download and execute the malicious programs
while viewing a Web Site, opening an attachment from an email, or clicking a link in an incoming instant message. Every
released patch to update some of the most popular operating systems, such as Windows XP and Windows 7, is followed by
a flurry of reverse engineering in the hacker community in order to exploit the problems that the most recent patch has fixed,
because millions of users tend not to update their computer promptly and properly. Also, some ports, which are used for
Remote Access or File Sharing services, are under constant scanning from other bots for vulnerabilities check, for example,
port 135 - Microsoft Remote Procedure Call (RPC) service, and port 139 - Netbios File Sharing Service [3].
The term backdoor denotes as the port which is forcefully opened by the malicious softwares, allows for remote connection
and therefore gives up the administrative control of the compromised computer. Given the current circumstances, a vulnerable
computer is usually infected by multiple malicious software programs. In order to take advantage of this fact, a list of ports
has been routinely examined by a single malicious software for backdoors left by others, including port 2745 - backdoor of
Bagle worm, and port 3410 - backdoor of Optix Pro remote access trojan [3].
2) Secondary Injection: Although a particular botnet makes use of possible backdoors left by other botnets, it does not
mean that botmasters would like to have a common shared pool of bots. So, most communication and command protocols are
botnet-specifically designed. Intuitively, most of the source codes are confidential. Although some most popular botnets have
their source codes publicly available (e.g., Agobot, SDBot, and GT Bot), due to the complexity and modularity of the coding
architecture, along with the constant evolvements of the botnets, there are no standardized command and control functions [4].
Therefore, after the successful initial infection, the next step is to download and run the botnet code in order to become a bot
which is under control of a specific botmaster. This procedure can be processed by using Trivial File Tansfer Protocol (TFTP),
File Transfer Protocol (FTP), HyperText Transfer Protocol (HTTP) or CSend [3].
3) Maintenance and Update: The first two stages only contain communications between bots and targeted computer. After
becoming a bot, the infected machine starts to 1) log into the command and control server and 2) create a protected session
parsing and executing the topics in the channel. These two steps are processed periodically and require authentication. Before
the botmaster authorizes certain malicious activities, such as Distributed Denial of Services (DDoS), it usually sends out an
update command to the C&C server which in turn contacts the bots to give the botmaster an updated status feedback of the
botnet. These internet flows, especially the periodically log/listen sessions, are of great interest for botnet detections, since
the passive intrusion detection system (IDS) would like to recognize the suspicious patterns and disrupt the botnet before the
actual attacks take place.
4) Malicious Activities: The aforementioned definition of botnet indicates that botnets are mostly used for criminally
motivated activities, include Distributed Denial of Services, Click Fraudulence, Spamming, and Identity Theft.
a) Distributed Denial of Services (DDoS): The earliest utilization of a botnet is to launch a DDoS attack, which causes
a loss of service to users. Because a botnet often contains thousands of bots, the botmaster could direct all the online bots
flooding packets to a specific web server/system. These packets would consume the bandwidth of the victim network, overload
the computational resources of the victim system, or even congest the general internet traffic to make some public massive
damages [3]. Most implementations the DDoS attacks are categorized as TCP SYN and UDP flood attacks. A TCP SYN
flood attack sends an overwhelming amount of SYN messages to a web receiver. The TCP specification requires the receiver
to allocate a chunk of memory for a certain time to maintain a TCP connection. Gradually, the harmful SYN messages will
exhaust the memory of the receiver buffer, and make the TCP type services unavailable, such as Web Traffic, FTP, Telnet, and
SMTP. An UDP flood attack sends a large number of UDP packets to random ports (chargen, echo, daytime, etc.) of a server
to block the other legitimate traffic to it. Other protocols have also been used in DDoS attacks, such as recursive HTTP [3]
and ICMP flood attacks.
b) Click Fraudulence: Instead of attacking a web site at the same time, bots can be also controlled to automatically and
periodically access particular links to artificially increase the number of clicks or manipulate the outcomes of online polls. An
example is the abuse of the Googles AdSense program, which allows web sites possibly set up by the botmaster to display
the Google advertisement and pays them money for the fraudulent number of clicks on the commercial [3][46].
c) Spamming: Another nefarious task of the botnets is to spread junk emails, which is called spamming. In general, bots
could possibly open a SOCKS v4/v5 proxy or transmit their junk emails to an email spam proxy to avoid blacklists of static
spammers [34]. Certain botnets, such as Agobot, also harvest email addresses and download lists of email addresses shared
among bots. One special form of spam is phishing, which lures users to some disguised web sites and attempts to steal the
sensitive personal information.
d) Identity Theft: Other than the aforementioned phishing technique, there are methods to retrieve valuable information
from users, such as sniffing traffic, and keylogging. Bots can download some small, specialized password grabbers, such as
pwdump2, to collect the username and password data from the hosts. Or, a bot can apply some tools, such as Cain and Abel, to
masquerade as a gateway of a subnet to gather passwords from other computers. Interestingly, there may also be collaborations
between bots, for example, one bot can harvest some encrypted password data, reformat it into a UNIX-like password file,
and send it to a presumably faster bot to crack the password by using software like Lopht Crack. In addition to these bruteforce password cracking tools, some key logger programs, which capture key stroke sequences near certain keywords such as
paypal.com, can also be implemented to steal all the confidential data [3].
Overall, botnets are a hybrid of previous Internet threats with a defining characteristic of C&C channel usage. They can
propagate like worms, hide from detection like many viruses, and exploit attack methods like published toolkit [13].
B. Botnet History and Trends
Internet Relay Chat (IRC) was invented in August of 1988 by Jarkko Oikarinen of the University of Oulu, Finland [35]. This
protocol provides a platform that allows data dissemination among large number of end users by supporting multiple forms of
communication (point-to-point, point to multi-points, etc.) [37]. As the IRC protocol developed, administering busy channels,
such as handling tedious 24-hours-a-day requests from users, becomes time consuming. Bot, or robot, was then created as the
benign assistant to IRC channel management. In 1989, Greg Lindahl, an IRC server operator, created the benevolent bot called
GM which would play a game of Hunt the Wumpus with IRC users. Starting from this simple example, bots have evolved
from being code that helps a single user to code that manages and runs IRC operations on local host as well as code that
provides services for other users. Bots gradually have been developed into a comprehensive tool which operates as an IRC
channel operator, for example, Eggdrop was written in 1993 to assist channel operators. In time, IRC bots with more nefarious
purposes emerged when some IRC servers and bots began offering the capability to make OS shell available which permits
users to run commands on the IRC host. By the late 1990s, massive amount of trojan-infected computers tended to be grouped
together and remotely controlled by a botmaster connected to an IRC server. Version 2.1 of the SubSeven Trojan, released in
June 1999, included the typical malicious functions, (such as stealing password, logging keystrokes, and hiding its identity),
and provided a significant new feature that permits the SubSeven server to be remotely controlled via an IRC channel. This
link, between trojan server and IRC channels, set stage for all malicious botnets to come. In 2005, over a four month span
of botnet research conducted by the Honeynet Project, over a million computers were observed as members of botnets [3].
For over a decade, IRC based botnets were predominant among all the other existing ones. However, as the botnet detection
escalates, botnets have also evolved. In terms of protocol, more and more botnets start to implement HTTP and Fast Flux
network based on DNS servers; topology wise, instead of the traditional single server centralized structure, more sophisticated
structures, such as a group of IRC servers with inter links between each other or a Hybrid P2P system, have been implemented.
In 2009, Conficker - arguably the most influential and sophisticated botnets - has appeared [8]. Conficker has implemented the
DNS as the C&C protocol and a P2P C&C structure. In the following sections, the predominant IRC botnets are examined
with examples of some most popular IRC based botnets; other protocol based botnet technologies are mentioned; and finally,
the evolution of botnet C&C channel structure is presented.
II. IRC- BASED B OTNETS
IRC has provided a common protocol that is widely deployed across the Internet for activities among large number of
machines, such as remote control and data distribution [35]. There are a large number of existing IRC networks that lack
strong authentication, and a number of tools to provide anonymity on IRC servers are available. Also, IRC has a simple textbased command syntax which make it flexible to be extended for custom functionalities. These features have made IRC the
most suitable choice for a botmaster because IRC provides a simple, low-latency, widely available, and anonymous command
and control channel for botnet communication [13]. In this section, four of the most commonly used IRC-based botnets are
introduced: Agobot, SDBot, Spybot, and GT Bot.
A. Agobot
Agobot, named after its creator Ago, was first released in C++ in 2002 [3]. Because of its cross-platform capability, modular
functionality, and public availability of the source code, nowadays there are thousands of variants of Agobot and this number
is steadily increasing [4]. The modularity principle is used throughout the design of the botnet process. Unlike other botnets,
which commonly infect the target machine at once, Agobot corrupts the target system with three modules each of which retrieve
the next module after completing its primary tasks. First, Agobot infects the computer with the bot client and opens a backdoor
to allow the attacker to communicate with and control the machine; second, Agobot attempts to shut down processes associated
with antivirus and security system; at last, Agobot tries to block access from the infected computer to a variety of antivirus and
security-related web sites by altering DNC entries of these sites to point to the compromised local host. Furthermore, Agobot
also includes commands and functions that: fortifying the local system against other malicious attacks such as closing NetBIOS
shares and PRC-DCOM; detecting debuggers (e.g. SoftICE, OllyDbs and procdump) and virtual machines (e.g. VMWare and
Virtual PC). Along with these aforementioned functions, Agobot also has a elaborate set of malicious attack commands which
offer multiple types of DDoS attacks (UDP, TCP SYN, HTTP, PHAT SYN, PHAT ICMP, PHATwonk, and targa3 floods),
capabilities in stealing sensitive information by implementing libpcap and Perl Compatible Regular Expressions (PCRE) to
sniff and sort traffic, and multiple scanning methods (Bagle, Dcom, MyDoom, Dameware, NetBIOS, Radmin, and MS-SQL
scanners). Most of the variants of Agobot apply standard IRC for C&C channel communications. However, one branch, referred
as Phatbot, uses a distributed and organized WASTE chat network as the C&C communicating protocol. WASTE is a P2P
protocol designed by AOL to use encryption for more secure file transfers via P2P. Using WASTE has its advantages in
avoiding botnet disruption by an IRC channel shutdown. But it also limits the scalability of the bot army because WASTE
can only manage 50 to 100 client nodes at a time. Overall, with its monolithic code architecture, creative modular principles,
and standard data structures and code documentation, Agobot is arguably the most sophisticated and best-written source code
among all the existing botnet source codes.
B. SDBot
SDBot was originally written in C and released by a Russian programmer known as sd [3]. The standard compact package of
SDBot source code behaves more like a benign tool, which is to provide an utilitarian IRC-based command and control system.
The only possible malicious activities included in the original package are UDP or ICMP DDoS attacks. Public collaboration
and evolution have generated a large number of patches including specific malicious capabilities such as scanning, DDoS
attacks, sniffers, and information harvesting routines. Similar to Agobot, SDBot includes some typical exploits targeting specific
vulnerabilities. The most active ones are the brute-force password guessing attacks at ports 139 (NetBIOS sharing service),
port 445 (Crypt32.dll) and port 1433 (MSSQL) [4]. Once the hacker gains complete access to compromised systems, the
Remote Access Trojan (RAT) component of SDBot connects to an IRC server and lies silently waiting for instructions from
the botmaster. This aforementioned code structure, a standard core package attached with customized patches, has made SDBot
arguably the most active and popular botnet. As of August 2004, SDBot has been reported to have more than 4,000 variants.
In June 2006, a Microsoft report about the Malicious Software Removal Tool listed the SDBot as having been detected on
678,000 infected machines.
C. Spybot
Spybot, a derivative of SDBot, first emerged in 2003. Like SDBot, the Spybot code is open source and available for the
public to modify and contribute to develop further functionalities [3]. The main difference between SDBot and Spybot is that
Spybot was originally designed solely for malicious purposes [4]. First, Spybot adds a number of spyware-like capabilities
such as keystroke logging and email address harvesting. Second, Spybot includes some features to broadcast Spam over Instant
Messaging (SPIM) and to modify the registries to prevent installation of Windows XP SP2 or to disable the Windows XP
Security Center. This difference makes Spybot more efficient in some aspects of its malicious activities than SDBot, and it is
the main reason that Spybot has evolved into another botnet family as influential as SDBot.
D. GT Bot
GT Bot, which can be traced back to as early as 1998, is an abbreviation for Global Threat and the common names used
for all mIRC scripted botnet codes. mIRC is an IRC client software package with two important characteristics for botnet
construction [4]: it can run scripts in response to events on the IRC server; and it supports raw TCP and UDP socket connections
for remote control and access. GT Bot also includes a characteristic HideWindow program which keeps the bot hidden on the
local system [3]. GT Bot can be easily modified to suit a specific malicious purpose. However, the extensibility of this botnet
is quite limited. Based on this fact, it appears that different versions have been generated for specific malicious intent, instead
of a general comprehensive package that provides an elaborate set of malicious capabilities.
III. F URTHER B OTNET D EVELOPMENTS
The proliferation of the botnets has drawn more and more attention in the Internet security community. Multiple studies on
botnet phenomenon have been conducted and many botnet detection and disruption techniques have been invented. In order
to maintain the operations of the botnets against the escalations of internet security tools, the botnets technologies have also
advanced. For example, one of the earliest and simplest botnet detection techniques is to set a signature matched system
monitoring and inspecting all the live traffic going through known IRC ports (e.g. TCP port 6667). Once the known botnet
commands have been matched for the payload, the operator would be able to detect the corresponding IRC channel, and shut
it down to disable the whole botnet. To avoid this disruption, botnets have adopted technologies operating on non-standard ports.
More protocols have been experimented for a remote control mechanism. For example, File Transfer Protocol (FTP) has been
designed as the C&C channel for botnets such as Dumador and Haxdoor to perpetrate keglogging to steal sensitive information
[28]. These botnets sniff communications of the compromised machine and present the user with fake web sites locally when
the user enters HTTPS (Encrypted) Web sites to steal the credentials of the user. Once the credentials have been retrieved, the
FTP C&C channel (also called drop zone) would directly feed them to the botmaster. There are also some HTTP based botnets
in the wild. One example of a HTTP based botnet is the spam bot module in Rustock rootkit which implements encrypted HTTP
for C&C mechanism [10]. The use of encrypted HTTP has increased the difficulties in detection and deobfuscation. Aside from
the ordinary spam bot functionality, this spam module also has extensibility to other nefarious functionalities. BlackEnergy is
another typical HTTP based bot which is solely used for DDoS attacks. At last, a click fraudulent bot, Clickbot.A, also used
HTTP running the C&C channel [15]. In general, HTTP based botnets have encrypted C&C channels which are often Base64
obscured. Also, it is easier for HTTP based botnets to pass through firewalls than IRC based botnets do.
A. P2P Bot
Other than changing the protocols carrying out C&C mechanism, the structures of C&C channels have also evolved. In the
previous section, not only all commonly used botnets families are IRC based, but also they all have centralized C&C structure,
which is characterized by a central point that forward message among clients [13]. From the perspective of a botmaster, the
centralized C&C structure has become the fundamental weak point [47]. First, shutting down a limited number of C&C servers
could compromise the entire botnet. Second, C&C servers can be easily detected based on the incoming traffic from a large
number of bots, or simply by the backward trace from a single captured bot. Third, once a C&C server has been captured
or hijacked, the entire botnet is under exposition. In order to overcome these major weaknesses inherent to the centralized
architecture, a peer to peer (P2P) framework is a natural improvement. In a P2P architecture, bots communicate with other peer
bots rather than a central server. These peer bots act as both clients and servers such that there is no centralized coordination
point that can be incapacitated. Because of the lack of the central server, the botmaster cannot directly control all the bots.
Instead, a set of commands is defined in the P2P system. When the botmaster attempts to launch an attack, it publishes one of
the predefined commands on the P2P system, and all the bots which subscribed to the set will be able to execute this command.
In the last several years, botnets such as Slapper [2], Sinit [42], Nugache [43], and Conficker [8] have implemented multiple
forms of P2P control architectures. Along with the inherent structure and process of traditional P2P systems, each botnet has
its own advanced design and weakness. Sinit uses public key cryptography for update authentication and random probing for
communications with other Sinit bots. The extensive probing traffic has caused easy detection and poor connectivity for the
constructed botnets. Slapper builds a list of known bots for each infected computer during propagation to remove the bootstrap
process which is easily exploited by defenders to shut down a botnet. However, the lack of encryption implementation and
command authentication have made Slapper vulnerable to be hijacked by others. On the other hand, Nugache has implemented
an encrypted/obsfucated C&C channel. However, the reliance on a seed list of 22 IP addresses during its bootstrap process
has also make Nugache an easy target for detection. Conficker has its C&C channel encrypted with the most sophisticated
algorithms, and the list of possible C&C server Domain names/IP addresses are around 5000 updated on a daily basis.
Comparing with the centralized system, a P2P communication system is much harder to disrupt. However, P2P systems are
more complicated and there are typically no guarantees on message delivery or latency.
B. Fast Flux Service Network
A new technology implementing the Domain Name System (DNS) protocol within C&C communications, referred as the
Fast Flux network service (FFSN), has emerged in recent years. In general, the DNS protocol has applied two techniques to map
domain names with IP addresses: Round Robin DNS (RRDNS) [9] and content distribution network (CDN) [16]. Responding
to a DNS request, RRDNS would return a list of DNS A records (i.e., hostname to IP address mappings). The DNS server
then cycles through this list and returns them in a round robin fashion. Every A record also has a Time To Live (TTL) for
the mapping, specifying the amount of seconds the response remains valid. Typical TTL for RRDNS has been recommended
to be 1 to 5 days, according to RFC 1912 [5]. Instead of multiple A records, the CDN applies sophisticated techniques with
respect to network topology and current link characteristics to find the nearest edge server to the corresponding clients and
returns the IP addresses which belongs to the network of this server. The typical TTL of the A record for CDN is significantly
lower than the one for RRDNS, because the CDN needs to react promptly to changes in link characteristics.
With the help of the mapping techniques mentioned above, a FFSN could be constructed as a distributed proxy network - consist
of compromised machines (flux agents) - which could route the traffic to the controlling element (control node/mothership)
with the characteristical short TTL and multiple A records assignments by RRDNS. FFSN acts more like a super botnets
constructed by multiple sub botnets. A few examples of FFSN have already been detected in the wild, such as the spam
email domain thearmynext.info found in July 2007 [24]. Combining the NS records (authoritative name server for the domain)
gives the FFNS one more layer of protection. This type of FFNS, referred as Double flux FFNS, has been implemented as
a robust phishing botnet, which creates a bogus web site called login.mylspacee.com to harvest Myspace user authentication
credentials [38]. As the FFSN gains popularity, other botnets also take advantage of this technique. Other than using DNS
as the carrier of the C&C mechanism, botmasters also use FFSN to host malicious content. The P2P bot Storm Worm, one
of the most prevalent botnet, uses fast flux domains to host the actual bot binary [25]. Moreover, beyond the regular DNS
services, other services such as HTTP, SMTP, POP and IMAP can be delivered via FFSN because fast flux techniques utilize
blind TCP and UDP redirects which are suitable for all the directional service protocol with a single port. At last, even the
conventional IRC based botnets have used Dynamic DNS algorithms to frequently alternate between several IP addresses of
IRC servers. Overall, botnets gradually utilize more protocols for specific malicious attacks and adapts more decentralized
C&C structures. In order to avoid the susceptibilities to the next generation of botnets, a few advanced designs and models
have also be proposed for defending purpose. And advanced hybrid P2P botnet architecture attempts to use multiple classes
of bots, with the characteristic class of servant bots which behave as both clients and servers. This hybrid P2P architecture
provides robust network connectivity, individualized encryption and control traffic dispersion, limited botnet exposure by each
captured bot, and easy monitoring and recovery by its botmaster [47].
IV. B OTNET D ETECTION
Along with the prevalence of botnets related nerfarious activities, increasing numbers of botnet detection and tracking
techniques have been developed in recent years. These methods can be categorized into two approaches. One is honeynet
based method and the other is based on passive traffic monitoring.
A. Honeynet-based Methods
The general structure of honeynet based method consists of honeypot and honeywall [3]. Honeypot denotes an end host
which is very vulnerable to malicious attacks and is often successfully compromised in a very short time span. Honeywall
denotes software which is used to monitor, collect, control, and modify the traffic through the honeypot, such as Snort.
The Honeynet project used unpatched versions of Windows 2000 or Windows XP systems as honeypot, and snort inline as
honeywall device to track botnets on a daily basis (i.e., the honeynet would have been rebuilt in every 24 hours). This project
has also listed a set of suggestions on how to write a useful botnet tracking IRC clients. First, this client shall have SOCKS
v4 and multi-server support. Second, some useful packages, such as lbadns, libcurl, and Perl Compatible Regular Expression
(PCRE) shall be included in this client. At last, the modularity and certain functionalities, such as no threading, shall be in
consideration through out the design of this client. A similar honeynet has been constructed [13], which consists of three
vulnerable machines and a transparent proxy device (FreeBSD bridge). This project demonstrates some key features for the
honeywall/proxy device. First, the honeywall element shall be able to capture and inspect all the traffic payloads to retrieve
botnet information such as the DNS/IP address of the C&C server with the corresponding port number and the authentical
data to join the C&C channel. Second, the honeywall element shall be capable of isolating the honeypots from other machines
in the local network by blocking outgoing connections containing suspicious keywords linked to possible malicious activities.
These aforementioned projects only offer a single vantage point of view on botnet activities, thus missing a substantial portions
of botnet spreading behaviors. In order to capture the comprehensive actions of the botnets, Rajab et al. [37] have constructed
a multifaceted and distributed measurement infrastructure by combining a modified version of the nepenthes platform with the
honeynets. Although honeynet is a powerful tool for understanding botnet technology and characteristics, and tracking botnet
behaviors, it is not very effective in botnet disruption. Also, the increasing used of anti-detection techniques in botnets, along
with the propagation techniques which tend towards social engineering have make it more and more challenging to emulate
the bot and resource consuming to set up a honeynet system.
B. Passive Traffic Monitoring
Instead of purposefully setting up honeynet to attract and collect botnet data. Another approach is setting up vantage points
to passively monitor the real Internet traffic and to detect or extract the botnet related packets. Based on different types of
Internet traffic data, such as DNS data, BGP route views, Netflow data, and proprietary enterprise data, and on the complexity
and response time requirements, many Intrusion Detection System (IDS) designs have been proposed. These techniques can
be classified as behavior-based, DNS-based, and data-mining based respectively as described and summarized in the following
sections.
1) Behavior-based Detection: Behavior based dection methods can be further categorized as signature based and anomaly
based.
a) Signature-based Detection: Knowledge of useful signatures of existing and captured botnets have provided great
guidance in botnet detection. First, a library of specific botnet commands and function names could be summarized and
included in the proposed IDS. Once the IDS found matching keywords while inspecting the payload content, it can trigger the
alert and take further actions against the botnet. For example, Snort [39] is an open source IDS that monitors network traffic
to find signs of intrusion by searching matches based on the predefined set of rules and signatures. A major weakness of the
signature based detections is that they are limited to detect only the known botnets.
b) Anomaly-based Detection: Different from normal internet traffic, botnets often generates high volume of traffic that
may cause high network latency, and traffic on unusual ports. These network traffic anomalies along with other unique botnet
behaviors have been utilized for botnet detection. Binkley and Sigh [7] proposed an effective TCP based anomaly detection
technique with IRC tokenization and IRC message statistics to detect botnet clients and reveal botnet servers. First, this anomaly
based system implements an IRC parsing component to collect information on TCP packets and to determine an IRC channel.
Next, these IRC channel traffics are correlated over a large set of sampled data in search of scanning activities. At last,
the IRC channels with high scanning count would be stamped as the possible botnet channels. Akiyama et al. [1] proposed
a three-metrics based measurement to detect abnormal botnets behaviors under the assumptions that bots from the the same
botnet will have regularities in relationship, response and synchronization. Gu et al. [20] have proposed botnet detection system
(Bothunter) that recognizes the bot infection phase by running an correlation algorithm with the help of the user defined bot
infection life cycle model. Although Bothunter is a C&C protocol and structure independent IDS, its performance is greatly
effected by the accurate estimation of the predefined infection cycle dialog model. From the same authors, Botsniffer [21] has
been developed as an anomaly based algorithm designed to detect botnet C&C channels in a local area network using the
observation that bots within the same botnet would demonstrate strong synchronization in their response and activities (e.g.,
sending spam, scanning and binary downloading). This algorithm does not require prior knowledge of a botnet and has low
false positive and false negative rates.
2) DNS-based Detection: DNS based detection is a hybrid of behavior based and data-mining based techniques performed
on DNS traffic. The significant robustness and dramatic potential threat of FFSN make it necessary to emphasize the detection
algorithms on the DNS traffic. For a botmaster to maintain and hide its bots, DNS queries have been implemented in multiple
botnet stages, such as the rallying process after infection, malicious attack initiation, and C&C server update. There are two
major factors to distinguish botnet DNS queries from legitimate DNS queries. A first weakness is that queries to C&C servers,
often in the form of DDNS, come only from botnet members. In 2005, Dagon [14] has proposed a mechanism to identify
the domain names of the C&C servers with abnormally high or temporally concentrated DDNS query rates. However, this
technique could be easily evaded by using faked DNS queries thus generateing many false positives due to misclassification
of legitimate and popular domain names that use DNS with short TTL. An improved approach has been proposed in 2006
with the additional utilization of NXDOMAIN reply rates [40]. This algorithm is based on the observation that the abnormally
recurring name error (NXDOMAIN) responses to DDNS queries is mostly due to shut downs of the C&C servers. Comparing
with the previous method, this method is more effective in revealing suspicious domain names and generates less false positives
because NXDOMAIN replies are more likely to refer to DDNS than to other names. A second weakness is that bots usually
generate highly correlated DNS queries. In 2007, Choi et al. [11] proposed a botnet detection mechanism that monitors group
activities which are often consist of DNS queries simultaneously sent by a large number of distributed bots. This method is
more robust than the aforementioned two and is botnet-type independent. Furthermore, it can also detect botnets with encrypted
channels since it uses information in IP headers. The main drawback of this approach is the high processing time required for
detailed monitoring of the huge scale of network traffic.
3) Data-mining based Detection: Although abnormal DNS traffic has been successfully distinguished from the legitimate
one, botnet C&C communication pattern recognition or detection remains one of the most challenging tasks in IDS designs. In
fact, since botnets utilize some regular protocols for C&C communications, the traffic is similar to regular traffic. Moreover, the
C&C traffic is not high volume and does not cause high network latency. Along with the continuous evolution of botnets, the
previous behavior based detection algorithms are not useful to identify C&C traffic. Several data mining techniques including
data classification and clustering have been explored to distinguish botnet C&C traffic. Geobl and Holz [18] introduced Rishi, a
mining based system in 2007. Rishi constructs its data set by collecting IRC server nicknames, port numbers and implement a
n-gram analysis and a scoring system to detect bots that use uncommon communication channels which have evaded detections
from other conventional IDS. However, Rishi can easily be misguided by the disguised nicknames and can not detect encrypted
communication as well as non-IRC botnets. Mazzariello [32] provides another IRC botnet classification algorithm to differentiate
human IRC traffic from automated IRC traffic in IRC log files. This algorithm applies Support Vector Machine (SVM) and
J48 decision trees with respect to the data set of following features in Table I. Although the experimental results indicate
almost perfect separation of botnet C&C traffic from normal one, the classification process of this algorithm demonstrates its
dependence on the predefined IRC models which limits the the effective detection among different types of botnets.
TABLE I: A List of Features used in the IRC botnet classification [32].

Feature Name
User Number
Average Words Number
Aaverage/Variance of
Channel Dictionary Cardinality
Unusual Nickname
Equal Answers
Control Command Number
Join Number
SetMode Number
Nickname Changes
Ping Number
IRC Commands Number
Active User Number
Feature Detail
Total number of users in the IRC channel
Average number of unique words in a
sequence
Mean and variance of the
vocabularys cardinality
Nickname rarely seen among all
existing Nicknames
Number of sentences with
a common ordered subset of words
Count of control commands issued
JOIN rate in the IRC channel
SetMode rate in the IRC channel
Count of nickname changes in an IRC channel
Ping rate in the IRC channel
Overall IRC command rate
Number of users active in the IRC channel
TABLE II: A List of Features used in the IRC botnet classification [30].
Feature Name
start/end
IP-proto
TCP flags
pkts
Bytes
pushed pkts
duration
maxwin
role
bpp
bps
pps
PctPktsPushed
PCTBppHistBin0-7
varIAT
varBpp
Feature Detail
Flow start/end times
IP protocol of flow
Summary of TCP SYN/FIN/ACK flags
Total packets exchanged in flow
Total bytes exchanged in flow
Total packets pushed flow
Flow duration
Maximum initial congestion window
Whether client or server initiated flow
Average bytes per packet for flow
Average bits per second for flow
Average packets per second for flow
Percentage of packets pushed in flow
Percent of packets in one of the eight packet size bins;
these variables collectively form
a histogram of packet size for flow
Variance of packet inter arrival time for flow
Variance of bytes per packet for flow
The Internetwork research department at BBN technologies has also proposed a machine learning technique for IRC botnet
detection. With the utilization of network flow level statistical characteristics, a network flow in the proposed system is defined
as a group of packets with the identical IP protocol, the IP source and destination addresses, the source and destination port
numbers within a predefined time interval. This system has implemented multiple classification algorithms (J48 decision tree,
naive Bayesian, and Bayesian network) upon the data set containing the following features in Table II. The outcome of this
IDS shows successful classification of IRC based C&C traffic.
Masud et al. [31] proposed a robust and effective flow based botnet traffic detection with the consideration of correlation
between multiple log files. Furthermore, this method does not require access to payload content. This method does not impose
any restriction on the C&C protocol and is effective even if C&C channels are encrypted. One common character of all the
aforementioned detection schemes is that they all implemented classification algorithms which require a well-defined training
set to achieve good performance. This dependence on the training set has restricted these methods to be used to detect only
the captured/known botnets. A recent approach [19], Botminer, has considered this limitation and used an unsupervised Xmean clustering algorithm which does not require any training data. After the initial clustering, further correlations have been
processed to identify botnet C&C traffic. Botminer is an advanced botnet detection software which is independent of botnet
protocol and structure, and requires no botnet signature and training set, so it is able to detect real world botnets including
both centralized IRC, HTTP and distributed P2P based botnets with a very low false positive rate.
Overall, there is no universal botnet detection method which can achieve high performance by all evaluation criteria, such as
response time, accuracy and resource consumption. From previous sections it is concluded that botnets tend to be diversified
among protocols and structures, the number of botnet variants is steadily increasing, and the botnet communication techniques
have become more encrypted and disguised. Recent detection approaches with the utilization of clustering algorithm upon
netflow data without payload content have proved that in order to counter the escalations of botnet developments, botnet
detection techniques need to be independent of protocol, structure, and payload content. Although features with consideration
of time have been used extensively in both classification and clustering mechanisms, the question of how best to mix a botnet
trace with normal traffic in a proper timed manner have never drawn enough attention. Even in the most recent Botminer
system, botnet traces with possible international routing time, which is usually multiple time larger than the intra-country
routing time, have been directly mixed with campus wise background traffic. Given the usage of features such as the number
of flows per hour and the average bytes per second, this inconsideration of time factors has questioned the claimed great
performances of Botminer. In the next section, we describe how a simulated botnet traces have been mixed with the real world
Internet traffic by using round trip time (RTT) attunement, and how clustering algorithms have been conducted on the resulting
salted trace.
V. E XPERIMENT S ET U P
AND
R ESULTS
From the previous section, it has stated that the clustering algorithm based detection upon flow level internet traffic without
packet content inspection has shown promising resilience to the rapid escalations of botnet development. In this section, some
preliminary experiments, inheriting the traffic filtering ideas and mining based algorithm from previous approaches, has been
conducted with a novel introduction of RTT adjustment. One of the referenced approaches is introduced by Karasaridis et al. in
2007 [27] which has collected a specific type of netflow data - candidate controller conversation (CCC) - which is a conversation
between a suspected bot and remote host that satisfies certain criteria that are consistent with control traffic, and applied a
hierarchical scoring system to distinguish the CCC from normal traffic. This approach has demonstrated the effectiveness of
the hierarchical algorithm and the discriminated power of the control traffic. A second referenced method is the extension of
the aforementioned data-mining based method proposed by the Internetwork research department at BBN technologies [44].
In this approach, a botnet testbed has been constructed consisted of an IRC server and 10 bots. A reverse-engineered Kaiten
botnet source code [23], which is used in our experiment, has been implemented to generate the simulated botnet traffic. After
the mixture of botnet traffic with the normal background trace, and before the classification stage, a filtering stage has been
introduced. In this filtering state, a first filter selects TCP-based flows; a second filter removes the port scanning traffic; a third
filter eliminates the flows with high bit rate; a forth filter excludes the flows containing packets whose size is larger than 300
bytes; a fifth filter rejects all short flows (less than 2 packets or 60 seconds). The filtering design, although time consuming,
has provided the idea of extracting useful flows with the proper configurations. In our experiment, a filtering stage has also
been developed to extract the suitable RTT information. Our approach has mixed the simulated botnet traces with the normal
Internet traffic by unifying the RTT extracted from real candidate traffic after filtering. Then hierarchical and K mean clustering
algorithms have been implemented to distinguish the botnet C&C traffic.
A. Botnet Trace
A testbed consisted of one botmaster and three bots was constructed using VMWare by A. Tangpang [45]. Kaiten botnet
source code has been run for one hour to generate the C&C traffic described as below: after the bots start, it initiates a
connection to the botmaster and sends a NICK IRC message to convey that the client is online with a certain ID. After
receiving the corresponding ACK message from the botmaster, the bots idle for 20 minutes waiting for commands before
they reinitiate connections to the botmaster. The original IP addresses assigned in this botnet are 192.168.158.134 (botmaster),
192.168.158.131, 192.168.158.133, and 192.168.158.135, and the port number used is 6668. Overall, 6 botnet flows have been
generated and captured with wireshark [12].
B. Background Internet Traffic
The background traffic used in our experiment is captured by the internal Lawrence Berkley National Laboratory (LBNL)
router from 1643PM to 1743PM on December 15, 2004 [36]. It has 6,591,383 packets and 2,662 unique flows (i.e., group
of packets sharing same ID addresses, port numbers, and protocol). In preparation for the salting process, a filtering stage
has been designed to extract RTT from candidate IP addresses. First, an IP fan out filter select IP addresses connected to at
least 4 other IP addresses (3 bots and 1 normal host). Second, in order to calculate the RTTs, the IP addresses left from the
previous filter must have bidirectional TCP based flows. In practice, the filter is written as tcp.f lags == 0x02||tcp.f lags ==
0x12||(tcp.f lags == 0x10&&tcp.seq == 1&&tcp.ack == 1). At last, in order to calculate RTT as accurate as possible, the
candidate IP addresses shall have TCP based connections with at least 3 IP addresses sharing the same prefix of 28 bits (i.e.,
the same subnet). At last, RTTs are computed as the summation of 1 and 2 , as shown in Figure 2. There are 73 candidate
IP addresses qualified after the filtering state. One example candidate IP address, demonstrated in Figure 3, is 148.19.5.188,
which has TCP-based connections with 131.243.92.207, 131.243.92.148, 131.243.94.62, and 131.243.95.50.
10
Fig. 2: RTT Calculation Demonstration.
Source Port Source IP

49403 131.243.92.207
49405 131.243.92.207
48903 131.243.92.148
49407 131.243.92.207
55340 131.243.95.50
4842 131.243.94.62
49409 131.243.92.207
4843 131.243.94.62
4844 131.243.94.62
49411 131.243.92.207
49412 131.243.92.207
49414 131.243.92.207
Destination IP Destination IP
RTT
Slave to Master Master to Slave
148.19.5.188
110 0.00349900
0.00302100
0.00047800
148.19.5.188
110 0.00347900
0.00299072
0.00048828
148.19.5.188
22 0.00329590
0.00305176
0.00024414
148.19.5.188
110 0.00366211
0.00305176
0.00061035
148.19.5.188
22 0.00378418
0.00329590
0.00048828
148.19.5.188
22 0.00561523
0.00329590
0.00231934
148.19.5.188
110 0.00354004
0.00305176
0.00048828
148.19.5.188
22 0.00451660
0.00292969
0.00158691
148.19.5.188
22 0.00476074
0.00292969
0.00183105
148.19.5.188
110 0.00341797
0.00292969
0.00048828
148.19.5.188
110 0.00366211
0.00317383
0.00048828
148.19.5.188
110 0.00341797
0.00292969
0.00048828
Average Values: 0.00388757
0.00305428
0.00083329
Variance:
0.00070615
0.00013427
0.00067481
Fig. 3: RTTs of 12 Flows in One Subset.
C. Salting the Background Trace with Botnet Trace

Similar to one of the design ideas used for the TCPopera [26], the timestamps of the acknowledgement packet is dependent
to the corresponding data packet, and the timestamps of the next data packet is dependent to the previous acknowledgement
trace. In order to have the RTT of the botnet trace attuned to the background candidate traffic, the modification procedures,
shown in Figure 4 are:
1. The initial timestamp stays the same, w1 = t1 ;
2. The timpstamp of the first acknowledgement packet, w2 , is changed as w2 = w1 + 1 , where 1 has been calculated
following Figure2;
3. The timestamp of the second data packet, w3 , is modified to w3 = w2 + 2 , where 2 is also computed from Figure 2;
4. The rest of the timestamps are calculated correspondingly following the above two cases.
11
Fig. 4: RTT Modification Algorithm.

At last, the values of the timestamps inside each packet have been modified. The timestamps option in the tcp header consists
of two 32-bits fields, one is the Timestamp Value (TSval), and the other is the Timestamp Echo Reply (TSecr). When the
timestamps option has been activated, the TSval would record the value of the current timestamps clock. The increment of
timestamps clock is usually proportional to the real time increment. An increment of 1 in TSval field is corresponding to 1
millisecond increment in real time.
Fig. 5: One Packet Modification Example.
12
Fig. 6: One Packet Modification Example.

There are two ways to modify the pcap files. First method is to convert the pcap file into a text file, including all the packet
byte details, and then modified the corresponding field values in it. After finishing all the modifications, text2pcap.exe is used
to convert this text document back into a pcap file, as demonstrated in Figure 5 and Figure 6.
Another method is to use the netdude [29] software directly modifying the pcap file. In Figure 7, the timestamps field value
is going to be changed. Under IPv4 and TCP tabs, the IP addresses and port numbers could be changed correspondingly.
13
Fig. 7: Netdude Modification Example.
D. Experiment Results
1) Data Set: The data set used in the clustering stage consists of the LBNL Trace mentioned above and the 6 5tuple botnet
flows. Since every internet application is assigned according to the destination port number. 4 tuple flows (with the same source
IP, destination IP, destination port number, and protocol) have been extracted from the LBNL background trace. Overall, there
are 8803 flows in the data set. In Hierarchical and K-mean Clustering, a total number of 16 features, as shown as in Table III,
have been used for each flow, which means that the data set is a 880316 matrix.
TABLE III: A List of Features used in the Experiment.

Feature Name
AvgSize C
AvgSize S
AvgSize C/S
VarSize C
VarSize S
VarSize C/S
SizeHomo C/S
SizeHomo S/C
AvgDiffSize C
AvgDiffSize S
AvgIntv C
AvgIntv S
VarIntv C
IntvHomo C
IntvHomo S
Feature Detail
average size of IP payload sent by client
average size of IP payload sent by server
the ratio of AvgSize C over AvgSize S
variance of size of IP payload sent by client
variance of size of IP payload sent by server
the ratio of VarSize C over VarSize S
the ratio of SizeHomo C over SizeHomo S
the ratio of SizeHomo S over SizeHomo C
average of absolute difference in IP payload
size of two consecutive packets sent by client
average of absolute difference in IP payload
size of two consecutive packets sent by server
average time difference of two consecutive packets sent by client
average time difference of two consecutive packets sent by server
variance of time difference of two consecutive packets sent by client
the ratio of MaxIntv C over MinIntv S
the ratio of MaxIntv S over MinIntv C
14
2) Hierarchical Clustering: Hierarchical clustering is used to partition the data set using agglomerative or divisive techniques
iteratively [33][17]. The Agglomerative (bottom up) technique starts with as many clusters as data points and combines most
similar points into a single cluster. Divisive technique (top down) starts with a single cluster containing all the data points and
distinguish the most dissimilar data point as a cluster in each iteration. The overflow of the Hierarchical clustering algorithm
implemented in this experiment is described below:
1. Initiate n clusters for n data point. Each cluster has its center initiated by the 116 array values of the corresponding
data point.
2. Compute Euclidean distance (Eq 1) between all clusters. Each cluster stores the label of its nearest neighbor after this
step.
16 q
X
DE (i) =
(Xij Cij )2
(1)
j=1
3. Merge the two most similar (i.e., overall smallest Euclidean distance) data points into a new cluster. The center of the new
cluster shall be the average value of the two centers of the merged clusters (Eq 2). An Euclidean distance computation
is then processed on all the other clusters to find updated nearest neighbor.
Cnew (i) =
nold1
16
X
1
(nold1 Cold1 (i) + nold2 Cold2 (i))
+ nold2 i=1
(2)
4. Repeat step 2 until specified number of clusters has been satisfied.

Since the botnet flows are almost identical statistically for the aspects of the selected features, they have been clustered
together at the early stage of hierarchical processing. By setting our final number of clusters to be 420 (this number is the
optimal number of clusters for Kmean clustering, which would be derived in the following section), there are 44 other flows
combined with the 6 botnet flows in cluster number 26. However, out of these 44 flows, only one flow is TCP based, with a
destination port number of 111 (Open Network Computing Remote Procedure Call[41]). By setting another boolean feature to
filter out non TCP flows, it is proved that the Hierarchical clustering should be able to distinguish the botnet traces from the
legitimate background trace. Overall, it takes 1237 seconds running the programm over the flow level data set.
3) K mean Clustering: K mean is another common and easy to implement partition algorithm. Its overflow is given as
follow[22]:
1. All data points are randomly divided into predefined number of clusters. For each cluster, the centroid is computed as
the average values of all the data points arrays in this cluster. All the labels of the partitioned data points have also been
stored in the lists of corresponding clusters.
2. Every data point would be reassigned to the cluster which has the smallest Euclidean distance.
3. The centroid and group lists for the clusters would get updated.
4. Repeat from step 2 until certain condition have been satisfied.
One key feature for K mean clustering is how to find the optimal number of clusters to minimize the specific objectives. In
our experiment, the objective function denotes as the total distortion, which is the sum of squares of the Euclidean distances
between all the data points and the centroid of their belonging cluster. To achieve the minimum distortion, the data set has
been split into a training set (7000 flows) and a testing set (1803 flows) randomly for 5 times, under different assumption of
the number of clusters. After the positions of the centroid have been computed by using the training set, the corresponding
distortion for the testing should also be calculated. Based on the distortion trend plot in Figure 8, while the distortion of the
training set is monotonically decreasing, the one of the testing set seems to reach the optimal point at 420 clusters, which is
the number of clusters chosen for further K mean clustering analysis.
Among the 420 clusters, all 6 botnet flows have been clustered into one cluster (cluster 212). Similar to the Hierarchical
result, there are 18 other flows in that cluster. Nevertheless, by setting the boolean feature that eliminates all the non TCP
flows, only two TCP flows running on port 111 are left as the false positive traces. Having noticed that the objective function
of the K mean clustering algorithm in our experiment is to minimize the distortion of the entire traffic, other number of
clusters initiations have been tested for botnet flows cluster purity (i.e., a cluster with only botnet flows). For example, while
the entire data set has been partitioned into 512 clusters, all 6 botnet flows have been clustered into cluster 202, which does
not include any other flows. So it is proved that K mean clustering may reach the perfect performance on detecting the botnet
flows. Also, K mean algorithms consumes much less time to be completed, comparing to the running time of the Hierarchical
method. It takes 111 seconds to reach the steady state for 420 clusters, and 128 seconds for 512 clusters. Furthermore, with
15
Fig. 8: Distortion vs. Number of clusters.
the consideration of feature reduction, a relative score system (standard deviation/average value) could be explored with the
clusters. Basically, the feature with small intra cluster relative score (and possible large inter cluster relative score) shall be
consider as the features with great discriminating power, as shown in Figure 9.
Overall, this preliminary experiment has shown the capability of the Hierarchical and K mean clustering in detecting botnet
flows and provide a RTT adjustment method in mixing the botnet trace with the background normal internet traffic.
VI. C ONCLUSION
Since 1989, botnets have evolved from the benign assistant tool to the predominant threat in modern internet. Although the
number of bots to each botnet seems to be decreasing, the monetary damaging power of the botnets is continuously increasing
given the development of internet bandwidth. Instead of using a centralized, IRC based C&C channel to perform multiple
nefarious attacks, the botnets have been gradually developed into more complicated, stealthy, and modular based package
which perform particular malicious activity with diverse C&C protocols and structures. In order to counter the escalation of
the botnets evolution, the mining based detection methods operated on the flow level internet traffic have demonstrated some
promising performances. However, the feature extractions from the raw data, huge dimensions of possible features, and proper
mixture between botnet traces with background internet traffic, have made this method difficult to be an online IDS. Even for
conventional IRC based botnets, there are rarely invariant features among all the C&C traffic. Instead of designing an universal
IDS, a particular solution need to be developed under different circumstances such as internet data type, response time, and
complexity.
Acknowledgements: We wish to thank Berkay Celik for his feedback on this manuscript.
R EFERENCES
[1] M. Akiyama, T. Kawamoto, M. Shimamura, T. Yokoyama, Y. Kadobayashi, and S. Yamaguchi. A proposal of metrics for botnet detection based on its
cooperative behavior. In Applications and the Internet Workshops, 2007. SAINT Workshops 2007. International Symposium on, pages 8282, 2007.
[2] I. Arce and E. Levy. An analysis of the slapper worm. IEEE Security & Privacy, 1(1):8287, 2003.
[3] P. Bacher, T. Holz, M. Kotter, and G. Wicherski. Know your enemy: Tracking botnets. http://www.honeynet.org/papers/bots, 2005.
[4] P. Barford and V. Yegneswaran. An inside look at botnets. Malware Detection, pages 171191, 2006.
[5] D. Barr. RFC 1912: Common DNS operational and configuration errors. http://www.ietf.org, Feb. 1996. Obsoletes RFC1537 [6]. Status:
INFORMATIONAL.
[6] P. Beertema. RFC 1537: Common DNS data file configuration errors. http://www.ietf.org, Oct. 1993. Obsoleted by RFC1912 [5]. Status:
INFORMATIONAL.
16
Fig. 9: SizeHomo C, SizeHomo S, and VarSize C/s are good choices.
17
[7] J. Binkley and S. Singh. An algorithm for anomaly-based botnet detection. In Proceedings of USENIX Steps to Reducing Unwanted Traffic on the
Internet Workshop (SRUTI), pages 4348, 2006.
[8] M. Bowden. The enemy within. http://www.theatlantic.com/magazine/archive/2010/06/the-enemy-within/8098/, June 2010.
[9] T. Brisco. RFC 1794: DNS support for load balancing. http://www.ietf.org, Apr. 1995. Status: INFORMATIONAL.
[10] K. Chiang and L. Lloyd. A case study of the rustock rootkit and spam bot. In The First Workshop in Understanding Botnets, 2007.
[11] H. Choi, H. Lee, H. Lee, and H. Kim. Botnet detection by monitoring group activities in DNS traffic. In proceedings of the 7th IEEE International
Conference on Computer and Information Technology, pages 715720. IEEE Computer Society, 2007.
[12] G. Combs et al. Wireshark. http://www.wireshark.org, 2007.
[13] E. Cooke, F. Jahanian, and D. McPherson. The zombie roundup: Understanding, detecting, and disrupting botnets. In Proceedings of the USENIX SRUTI
Workshop, pages 3944, 2005.
[14] D. Dagon. Botnet detection and response. In OARC Workshop, 2005, 2005.
[15] N. Daswani and M. Stoppelman. The anatomy of Clickbot. A. In Proceedings of the first conference on First Workshop on Hot Topics in Understanding
Botnets, page 11. USENIX Association, 2007.
[16] J. Dilley, B. Maggs, J. Parikh, H. Prokop, R. Sitaraman, and B. Weihl. Globally distributed content delivery. IEEE Internet Computing, pages 5058,
2002.
[17] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience Publication, 2000.
[18] J. Goebel and T. Holz. Rishi: Identify bot contaminated hosts by irc nickname evaluation. In USENIX Workshop on Hot Topics in Understanding Botnets
(HotBots 07), 2007.
[19] G. Gu, R. Perdisci, J. Zhang, W. Lee, et al. BotMiner: Clustering analysis of network traffic for protocol-and structure-independent botnet detection. In
Proceedings of the 17th USENIX Security Symposium (Security08), 2008.
[20] G. Gu, P. Porras, V. Yegneswaran, M. Fong, and W. Lee. Bothunter: Detecting malware infection through ids-driven dialog correlation. In Proceedings
of the 16th USENIX Security Symposium, pages 167182, 2007.
[21] G. Gu, J. Zhang, and W. Lee. BotSniffer: Detecting botnet command and control channels in network traffic. In Proceedings of the 15th Annual Network
and Distributed System Security Symposium (NDSS08). Citeseer, 2008.
[22] A. Hinneburg and D. Keim. An efficient approach to clustering in large multimedia databases with noise. Knowledge Discovery and Data Mining, 5865,
1998.
[23] T. Holz. A short visit to the bot zoo. IEEE Security and Privacy, 3:7679, 2005.
[24] T. Holz, C. Gorecki, K. Rieck, and F. Freiling. Measuring and detecting fast-flux service networks. In Symposium on Network and Distributed System
Security. Citeseer, 2008.
[25] T. Holz, M. Steiner, F. Dahl, E. Biersack, and F. Freiling. Measurements and mitigation of peer-to-peer-based botnets: a case study on storm worm. In
Proceedings of the 1st USENIX Workshop on Large-Scale Exploits and Emergent Threats, pages 19. USENIX Association, 2008.
[26] S. Hong and S. Wu. On interactive Internet traffic replay. In Recent Advances in Intrusion Detection, pages 247264. Springer, 2005.
[27] A. Karasaridis, B. Rexroad, and D. Hoeflin. Wide-scale botnet detection and characterization. In USENIX Workshop on Hot Topics in Understanding
Botnets (HotBots 07), 2007.
[28] M. Kola. Botnets: Overview and Case Study. PhD thesis, IBM Research, 2008.
[29] C. Kreibich. Design and implementation of netdude, a framework for packet trace manipulation. In Proc. USENIX/FREENIX, 2004.
[30] C. Livadas, R. Walsh, D. Lapsley, and W. Strayer. Using machine learning techniques to identify botnet traffic. In 2nd IEEE LCN Workshop on Network
Security (WoNS2006). Citeseer, 2006.
[31] M. Masud, T. Al-khateeb, L. Khan, B. Thuraisingham, and K. Hamlen. Flow-based identification of botnet traffic by mining multiple log files. In
Distributed Framework and Applications, First International Conference on, pages 200206, 2008.
[32] C. Mazzariello. IRC traffic analysis for botnet detection. In Information Assurance and Security, 2008. ISIAS08. Fourth International Conference on,
pages 318323, 2008.
[33] J. Navarro, C. Frenk, and S. White. Hierarchical Clustering. The Astrophysical Journal, 490:493508, 1997.
[34] J. Nazario. Blackenergy DDoS bot analysis. Arbor, 2007.
[35] J. Oikarinen and D. Reed. RFC 1459: Internet Relay Chat Protocol. http://www.ietf.org, 1993.
[36] R. Pang, M. Allman, V. Paxson, and J. Lee. The devil and packet trace anonymization. ACM SIGCOMM Computer Communication Review, 36(1):38,
2006.
[37] A. Rajab, J. Zarfoss, F. Monrose, and A. Terzis. A multifaceted approach to understanding the botnet phenomenon. In Proceedings of the 6th ACM
SIGCOMM Conference on Internet Measurement, page 52. ACM, 2006.
[38] J. Riden. Know Your Enemy: Fast-flux Service Networks. http://www.honeynet.org/papers/ff, 2008.
[39] M. Roesch. Snort-lightweight intrusion detection for networks. In Proceedings of the 13th USENIX conference on System administration, pages 229238.
Seattle, Washington, 1999.
[40] A. Schonewille and D. van Helmond. The domain name service as an IDS. Research Project for the Master System-and Network Engineering at the
University of Amsterdam, 2006.
[41] R. Srinivasan. RFC 1831: RPC: Remote procedure call protocol specification version 2. www.ietf.org, Aug. 1995. Status: PROPOSED STANDARD.
[42] J. Stewart. Sinit P2P trojan analysis. http://www.secureworks.com/research/threats/sinit, 2003.
[43] S. Stover, D. Dittrich, J. Hernandez, and S. Dietrich. Analysis of the Storm and Nugache Trojans: P2P is here. USENIX; login, 32(6):200712, 2007.
[44] W. Strayer, D. Lapsely, R. Walsh, and C. Livadas. Botnet detection based on network behavior. Botnet Detection, pages 124, 2006.
[45] A. Tangpong and G. Kesidis. A controlled environment for botnet traffic generation. http://www.cse.psu.edu/tangpong/botnet/, April 2009.
[46] R. Vogt, J. Aycock, and M. Jacobson. Army of botnets. In Proceedings of the 2007 Network and Distributed System Security Symposium (NDSS 2007),
pages 111123. Citeseer, 2007.
[47] P. Wang, S. Sparks, and C. C. Zou. An advanced hybrid peer-to-peer botnet. In USENIX Workshop on Hot Topics in Understanding Botnets (HotBots07),
2007.

Cse 11 001 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cse 11 001 PDF

Uploaded by

Copyright:

Available Formats

1

Botnet Detection Through Fine Flow Classification

Command and Control

Fig. 1: A General Botnet Life Cycle.

TABLE I: A List of Features used in the IRC botnet classification [32].

Fig. 2: RTT Calculation Demonstration.

Source Port Source IP

Fig. 3: RTTs of 12 Flows in One Subset.

C. Salting the Background Trace with Botnet Trace

Fig. 4: RTT Modification Algorithm.

Fig. 5: One Packet Modification Example.

Fig. 6: One Packet Modification Example.

Fig. 7: Netdude Modification Example.

TABLE III: A List of Features used in the Experiment.

4. Repeat step 2 until specified number of clusters has been satisfied.

Fig. 8: Distortion vs. Number of clusters.

Fig. 9: SizeHomo C, SizeHomo S, and VarSize C/s are good choices.

You might also like