You are on page 1of 5

2008 International Conference on Computer and Electrical Engineering

Static Detection of API-calling Behavior from Malicious Binary Executables∗

Wen Fu Jianmin Pang Rongcai Zhao Yichi Zhang Bo Wei


National Digital Switching System Engineering & Technology Research Center
Postbox 1001 No. 717, Zhengzhou, Henan 450002, China
rachelfu2008@gmail.com

Abstract Therefore, more and more researchers make great effort


to find the similarity of an original malware and its vari-
The broad spread of malware in recent years has pre- ants. One of the most important discovery in this area is
sented a serious threat to our world. Because Windows that for one malware, the functional flows of its different
API-calling sequence usually reflects the vicious behavior variants are nearly unchanged. Because the original func-
in a piece of particular code, more and more AV researchers tionality is preserved, researchers can assume that the dif-
like to detect malware based on API-calling behavior anal- ference between a malware and its variant would not be very
ysis. However, a great many of techniques, such as obfusca- large. Based on this assumption, there are a lot of papers
tion, have been used by malware writers to evade this type talking about how to detect malware based on analysis of
of detection. These techniques makes the discovery of API- API-calling sequences[3, 6, 8], since API-calling behavior
calling behavior become more complex than before. reflects the most parts of the function for a malware.
In this paper, we illustrate some methods which are com- In order to do the analysis on API-calling sequences, the
monly used by malware writers to obscure their API-calling very first step is to dig out the API-calling behavior from a
behavior when they write their malware in assembly lan- suspicious binary executable. In fact, this is more trouble-
guage. After that, we propose a new approach, which is some than it seems at first. Because malware writers like
more universal for capturing API-calling behaviors in Win- to implement API-calling in uncommon and indirect way,
dows platform. This approach involves three databases it is difficult for static analysis to find out the target API a
and some special instruction patterns. Experimental re- Call instruction indirectly calls. Moreover, the uncommon
sults show that using this approach to extract API-calling implementation of API-calling sometimes makes the disas-
behaviors from malicious executables and their variants is sembly process failed.
favorable and effective. In this paper, we pay more attention on how to stati-
cally mine the API-calling behavior from binary executa-
bles. Note that our approach is performed directly on Win-
1. Introduction dows Portable Executable binary code. We do not attempt
to recover API-calling behaviors from any kinds of mali-
Current antivirus(AV) tools primarily use signatures to cious code. Instead, our approach is effective under the fol-
detect known malware. This signature is typically created lowing two assumptions:
by disassembling the existing samples of malware, and se- Assumption 1. Malware needs to invoke API functions
lecting some pieces of unique code. However, the rise in to achieve its malicious intention.
number of variants of an original malware and their effects Assumption 2. The functionality of malware and its
have increasingly shown that current commercial antivirus variants is preserved. It means that if a malware contains a
tools have no ability to deal with simple and slight modifi- API-calling sequence S, and a variant of this malware con-
cations to the original malware. As a result, this signature- tains a set of API-calling S’, there is only slight difference
based detection approach can be easily subverted by simply between S and S’.
changing the code in trivial ways, and they suffer a lot from The remainder of this paper is structured as follows. Sec-
the drawbacks, including the need for continuous updating tion Two presents some related work in the field of API-
of their signature set and their inability to deal with simple calling behavior analysis. Section Three analyzes some fre-
obfuscation techniques[1]. quently used techniques which are used by a malware to
∗ This work is supported by the National High-Tech Research and De- obtain addresses for all the APIs it has to call. Section Four
velopment Plan of China under Grant No.2006AA01Z408. proposes our system and goes into details of our approach

978-0-7695-3504-3/08 $25.00 © 2008 IEEE 388


DOI 10.1109/ICCEE.2008.53
on how to extract API-calling behaviors from binary ex-
...
ecutables. Section Five presents experimental results and p ush dwo rd p t r [FileH an dle + X ]
Section Six gives our conclusions and future work. call Creat eFileM ap p in gA
...
K ern el3 2 _ A P I:
2. Related work Creat eFileM ap p in gA :
db 0 B8 ; m o v eax , ?
dd ?
The concept of detecting malware or attacks by analyz- jm p eax
ing sequences of system calls is not new in the field of host- db 'Creat eFileM ap p in gA ',0
based system security. Actually, it has been used in Intru- Creat eT h read:
sion Detection Systems and Intrusion Prevention Systems a ...
long time ago since it was first used by Forest et.al [4, 5].
Nowadays, some researchers make use of API-calling se-
Figure 1. Part of source code for virus
quences to determine whether a binary executable is mali-
BOZANO.
cious or not, and also to find out the similarity of a malware
and its variants[3, 6, 8, 9].
SAVE[6], is a static analyzer for vicious executable. It
makes use of Imported Address Table (IAT) to recover the 3.1. Calling APIs by hard-coded addresses
API-calling sequences. However, a malware usually avoids
to use the IAT. Therefore, using IAT to obtain API-calling Boza, known as W95/Boza.A, is the first Windows 95
sequences is not always efficient. virus [7]. It uses hard-coded addresses for all the APIs it
In thesis[3], the author also do some research on malware has to call. In other words, it used addresses hard-coded of
detection through the analysis of the API-calling behavior a particular implementation of KERNEL32.DLL for a beta
under some assumptions. The approach the author uses to version of Windows 95. For example, the procedure address
extract API-calling sequences is to read in the IAT, which is of GetCurrentDirectoryA() is 0xBFF77744 in the English
the same to what SAVE uses. As we have illustrated above, release of Windows 95. Boza called function GetCurrent-
this is not a good way. DirectoryA() through this address.
This approach is the easiest, but, fortunately, it is not
In [8], the authors uses the virtual operating system
very successful[7]. On a different version of Windows, it
VMWare to detect and observe the behavior of computer
calls an incorrect address and, obviously, fails to replicate.
viruses. Besides that, they designed a tracing tool –
Because of that, Boza is incompatible with most Window
APISPY.EXE, to extract API function calls. This tool can
95 releases and it can not be called a real Windows 95com-
hook all API function calls in Windows 2000 server plat-
patible virus. Nevertheless, it is still used by some kinds of
form. Unfortunately, this tool is not available so that we
malware to call API functions.
can not do some comparison work.
In [9], the authors extract the patterns for two programs 3.2. Defining homonymy functions
based primarily on the system or library calls, then com-
pare these patterns to determine how similar the programs
This is a common method for a malware to call APIs.
are. To extract the patterns, they choose to statically ana-
Take Win32.Bozano as an example. Figure 1 shows part of
lyze the control and data flow of call traces[9]. However, as
the source code for this virus.
we know now, some malware use other ways to implement
Bozana declares a homonymy function for each API it
their calls to APIs, other than using the instruction CALL.
has to call. As can be seen in Figure 1, each homonymy
So, only tracing the CALL instructions may be not enough.
function prepares three data and one instruction. We can
consider each homonymy function as a structure. The last
3. API-calling methods used by malware part of this structure defines a string, which is the name
of the corresponding API. During infection, Bozano will
firstly search for the base address of the KERNEL32.DLL.
From the illustration above, we can conclude that it is Then for each API, Bozano fetches the API name with a
of great necessary to find out other characteristics to extract fixed offset (seven bytes) from the beginning of the relative
API calls from malicious binary executables. To solve this homonymy function at first. Next, it scans the memory im-
problem, in this section we will first introduce three meth- age to find out the relocation address for each API, with the
ods usually used by malware writers to invoke API func- API name and the base address of KERNEL32.DLL. Af-
tions. ter that, this relocation address will be stored back to the

389
homonymy function, as the second data of the structure. v irus_ st art label by t e
When an API is called, Bozano will call the homonymy azt ec:
function at first. In this function, two instructions will be ...
executed. The first two data will be executed as one instruc- lea edi,[ebp +@ @ O ffset z]
tion Mov eax, ?. The source operand of this instruction has lea esi,[ebp +@ @ N am ez]
been re-assigned to the relocation address of this API be- ...
forehand. Then an instruction jmp EAX will be executed. p ush eax
Finally, the homonymy function transfers control to the rel- call [ebp +_ Fin dFirst FileA ]
evant API through the execution of this unconditional jump ...
instruction. @ @ N am ez label by t e
@ Fin dFirst FileA db " Fin dFirst FileA " ,0
To sum up, Bozano achieves its calling to one API
@ Fin dN ex t FileA db " Fin dN ex t FileA " ,0
through two instructions: one mov and one jmp. Specially, ...
these two instructions are constructed in an unusual way. It @ @ O ffset z label by t e
can easily puzzle a disassembly process and finally escape _ Fin dFirst FileA dd 00000000h
from static detection. We consider this as an obfuscation _ Fin dN ex t FileA dd 00000000h
method for API-calling behavior. ...

3.3. Using arrays to achieve API-calling


Figure 2. Part of source code for virus
AZTEC.
This is another commonly used method for a malware to
call APIs. It is more popular than the methods mentioned
above. In this method, arrays are usually used to store
special methods of calling APIs, without using the instruc-
names and relocation addresses for APIs. Win32.Aztec is
tion CALL, and recover the target API function names for
one of such malware. Figure 2 shows part of the source
API calls which are called in indirect way.
code for it.
Figure 3 shows the architecture of our API-calling analy-
Aztec defines two arrays for relocation of APIs. One is sis system. Our approach is structured in three major steps.
a string array, storing the names of APIs called by Aztec. In the next part of this section, we will go into details of
The other is an address array, which storing the relocation these three steps.
addresses of these APIs. Before relocation, all the items in
the address array are zero. 4.1. Loading and disassembling
During infection, Aztec will firstly search for the base
address of KERNEL32.DLL. Then it fetches one API name The first step is to load the PE executable into virtual
from the string array, computes the relocation address for memory and disassemble it. In this part of work, AAS will
this API, and finally stores the address to the offset table. recognize some special used methods of calling APIs, other
This process will continue until all the addresses of APIs than those methods using instruction CALL.
have been stored in the offset array. We set up a database, which is used to store some impor-
There are other ways for malware to call APIs through tant instruction sequences. These instruction sequences are
arrays. Obviously, there should be differences between extracted from a large number of malware, who use these
them and Aztec in the implementations of API-calling. sequences to call APIs. Currently, if other instruction se-
However, no matter how many arrays used, or how to quences are detected to achieve the same goal, we add these
achieve the calling to APIs, an array of API names always sequences to the instruction sequence database manually.
exists in binary code. This strings will be useful for us to Take Bozano as an example. We regard the follow-
recover the target API functions the malware invokes. ing instruction sequence as a signature of API-calling be-
havior and add this sequence into the instruction sequence
4. API-calling behavior detection database:
mov eax, 0;
In Section Three, we introduce some commonly used jmp eax
methods for malware writers to call APIs. Many kinds of Except that, we have prepared a database of name strings
malware use these methods to implement their API-calling for some frequently used API functions by malware in ad-
behavior. In order to recognize and analyze the malicious vance. Once the disassembler has detected that there are
intention of a malware, we design an API-calling analysis two instructions as same as those two in this pattern at the
system(AAS). In this system, we have to recognize some very beginning of a procedure, it then examines whether

390
PE exec utable the scanner will just go on with the next bytes.

4.3. API sequence analysis


Loader
In st ruct io n sequen ce
DB Until now, we have found some CALL instructions, and
a list of API name strings. However, we don’t know which
Disassem bler
API an instruction CALL really invokes.
Therefore, the most important thing we need to do first
A P I n am e D B
in this step, is to find out the relationship between the target
operands of CALL instructions and the API name strings
which have been found from binary scanning. According
Binary File Sc anner
to the calling methods introduced in Section three, we deal
A P I sequen ce D B with the following two kinds of API calls separately:
For those API calls by hard-coded addresses, we need
to set up a two-dimension table storing these addresses and
API-c alling Analyzer
corresponding API name strings beforehand. If such a case
happens, we locate the API name strings through looking
up in this two-dimension table. After that, all the CALL
API Analysis instructions whose targets are some hard-coded addresses
Report
will be substituted with instructions with the name strings
of APIs.
For those malware who use arrays to call APIs, we need
Figure 3. Architecture of API-calling analysis
to find out the pieces of initialization code which are used
system (AAS).
by malware to relocate APIs. In fact, some special instruc-
tions can be extracted from this part of code as the signature
which indicates the relationship inside.
the next limited bytes represent a significant string. If it is For example, Win32.Aztec used two arrays to achieve
true, then the disassembler will compare this string with the the relocation of API functions it invokes. One array stores
API names in the API name database prepared beforehand. the API name strings, and the other will be used to store
This will help to determine whether this string represents the addresses of these APIs after relocation. Before the re-
the name string of an API function. If it is true, then the location, it uses two continuous instructions LEA to obtain
disassembler will recognize this sequence as equal as a call the start addresses of these two arrays separately. Because
to the corresponding API. Finally, all the calls to this proce- malware frequently uses a loop to implement the relocation
dure will be substituted as a call to this API function. of APIs, it is easy to know that the items in the string array
correspond to the items in the address array one by one.
4.2. Binary scanning Therefore, from the target operand of a CALL instruc-
tion, we can know the index to the address array. This index
The second step is to do binary scanning for mining pos- reflects the relative location, in which we can find the relo-
sible API names in binary code. From the description in cation address for some API this CALL instruction invokes.
Section three we can conclude that API name strings usu- Then we fetch out the name string with the same index from
ally exist in malicious binary code. These strings may be the API name string array. This is exactly the name string
organized together as an array, separately stored in some of the API function we want.
special structure, or in some relevant locations. This find- Obviously, for malware Win32.Aztec, the two continu-
ing is an encouraging thing for malware analyzer, and also ous LEA instructions should be recognized as the signature
the reason for why we put a binary scanner here. which is helpful to find out target API function a CALL in-
To improve the accuracy of the binary scanner, we pre- struction really calls. Actually, a lot of malicious executa-
pare a API name database storing all the API names. These bles use these instructions to achieve its relocation of APIs.
API functions are frequently used by malware. The name The next part of work in this step is to compare the API
strings are organized in alphabetic order. This database will sequences we have gotten, with the sequences prepared in
be referenced by the binary scanner. When a string is rec- the API sequence database. We pick up some malicious
ognized, we look it up in the API name database. If the API sequences after analyzing a large number of malware
string exists in this database, we consider this string as an and their variants. Then we put them into the API sequence
API which will be called by this binary executable. If not, database. In the implementation, we map each API name

391
with an assigned id. By using integer representation, we
can avoid the costly operations of string comparison. Table 2.√Malware Detection using Different AV
Tools ( means success, × means failure )
5. Experimental results
Malware Variants Rising ClamAV AAS

To do our experiment, we download some malicious Hortiga.4805 × × √
Hortiga
Hortiga.4800 × ×
code with indirect API calls from the VX Heavens √ √ √
website[2]. We analyze these malware and their variants Fosforo.a √ √ √
in our system. Fosforo.b √
Fosforo
Firstly, our system AAS can successfully recover the tar- Fosforo.c × × √
Fosforo.d × ×
get API functions the CALL instructions invokes. We com- √ √
pare our system with IDA pro. None of the API calls has Doser.4542 × √
been recognized and recovered by IDA pro. The result is Doser.4540 ×
√ × √
shown in Table 1. Doser.4539.a ×
√ √
Doser
Doser.4539.b ×
√ √
Doser.4535 √ × √
Table 1. Recovery
√ of API calls using IDA Doser.4188 √ × √
pro and AAS ( means recovery success, × Doser.4183 ×
means recovery failure )

Malware IDA pro AAS However, how to efficiently compare the similarity of a

Win32.Aztec × sequence we get after analysis, with the sequences stored
√ in the API sequences DB, is one of the work which is in-
Win32.Bozano ×
√ adequate in AAS now. This technology has been proposed
W95/Boza.A ×
√ by some researchers. We will go on with this work in the
Hortiga.4805 ×
√ future, and we believe that this part of work will make our
Hortiga.4800 ×
√ system more efficient and more practical.
Fosforo.a ×

Fosforo.b ×
√ References
Fosforo.c ×

Fosforo.d ×
√ [1] M. Christodorescu and S. Jha. Testing malware detectors. In
Doser.4542 ×
√ Proceedings of the ACM SIGSOFT Symposium on Software
Doser.4540 ×
√ Testing and Analysis (ISSTA’04), Boston, Massachusetts,
Doser.4539.a ×
√ USA, Jul. 2004.
Doser.4539.b ×
√ [2] V. heavens. http://vx.netlux.org.
Doser.4535 × [3] K. Rozinov. Efficient static analysis of executables for detect-

Doser.4188 × ing malicious behaviors. Master’s thesis, POLYTECHNIC

Doser.4183 × UNIVERSITY, Jun. 2005.
[4] F. S., H. S.A., S. A., and L. T.A. A sense of self for unix
processes. In Proceedings of the 1996 IEEE Symposium on
Next, our system AAS successfully extracts API-calling Security and Privacy, pages 120–128, Washington, DC, USA,
behaviors from these binary executables and detects these 1996. IEEE Computer Society.
malware and their variants. Table 2 shows the detection [5] H. S.A., F. S., and S. A. Intrusion detection using sequences
results using different AV tools. From these experiments, of system calls. Computer Security, 6(3):151–180, 1998.
we can conclude that our approach is sound and efficient in [6] A. Sung, J. Xu, P. Chavez, and S. Mukkamala. Static analyzer
detecting malware and its metamorphic versions. of vicious executables (save). In 20th Annual Computer Se-
curity Applications Conference, pages 326–334, Dec. 2004.
[7] P. Szor. The Art of Computer: Virus Research and Defense.
6. Conclusions and future work Symantec Press, USA, first edition, 2005.
[8] B. Zhang, J. Yin, J. Hao, D. Zhang, and S. Wang. Using sup-
This paper propose an approach on how to extract and port vector machine to detect unknown computer viruses. In-
analyze API-calling behaviors from malicous binary exe- ternational Journal of Computational Intelligence Research,
cutables for malware detection. This technique has been 2(1):100–104, 2006.
[9] Q. Zhang and D. S.Reeves. MetaAware: Identifying meta-
implemented as part of our system AAS. Experimental re- morphic malware. In Proceedings of ACSAC’07, pages 411–
sults prove that our approach is effective on capturing API- 420, Florida USA, Dec. 2007.
calling behavior from malicious code.

392

You might also like