You are on page 1of 10

An Overview of Steganography

James Madison University Infosec Techreport Department of Computer Science


JMU-INFOSEC-TR-2007-002

Shawn D. Dickman

July 2007

An Overview of Steganography
Computer Forensics Term Paper, James Madison University shawndickman@hotmail.com
Steganography is a useful tool that allows covert transmission of information over an overt communications channel. Combining covert channel exploitation with the encryption methods of substitution ciphers and/or one time pad cryptography, steganography enables the user to transmit information masked inside of a file in plain view. The hidden data is both difficult to detect and when combined with known encryption algorithms, equally difficult to decipher. This paper provides a general overview of the following subject areas: historical cases and examples using steganography, how steganography works, what steganography software is commercially available and what data types are supported, what methods and automated tools are available to aide computer forensic investigators and information security professionals in detecting the use of steganography, after detection has occurred, can the embedded message be reliably extracted, can the embedded data be separated from the carrier revealing the original file, and finally, what are some methods to defeat the use of steganography even if it cannot be reliably detected.

Shawn D. Dickman

ABSTRACT

information is revealed in much the same manner as substitution or Bacon cipher mechanisms [22]. This paper will highlight some historical examples, discuss the basic principles of steganography showing how most instances work, identify software that can be used for this purpose, and finally provide an overview of current methods employed to detect and defeat it.

2. HISTORICAL EXAMPLES

Hiding messages by masking their existence is nothing new. Classical examples include a Roman general that shaved the head of a slave tattooing a message on his scalp. When the slaves hair grew back, the General dispatched the slave to deliver the hidden message to its intended recipient [18]. Ancient Greeks covered tablets with wax and used them to write on. The tablets were composed of wooden slabs. A layer of melted wax was poured over the wood and allowed to harden as it dried. Hidden messages could be carved into the wood prior to covering the slab. When the melted wax was poured over the slab, the now concealed message was later revealed by the recipient when they re-melted the wax and poured it from the tablet [8]. From the 1st century through World War II invisible inks were often used to conceal hidden messages. At first, the inks were organic substances that oxidized when heated. The heat reaction revealed the hidden message. As time passed, compounds and substances were chosen based on desirable chemical reactions. When the recipient mixed the compounds used to write the invisible message with a reactive agent, the resulting chemical reaction revealed the hidden data. Today, some commonly used compounds are visible when placed under an ultraviolet light [19]. In another form, while Paris was under siege in 1870, messages were sent by carrier pigeon. A Parisian photographer used a microfilm technique to enable each pigeon to carry a higher volume of data. The miniaturization of information also served to deter detection and was a precursor to the invention of the microdot. A microdot is a document or photograph reduced in size until it is as small as a pencil dot (about the size of the period at the end of this sentence). Between World War I and II Germany used microdots for steganographic messaging purposes and later many countries passed these microdot messages through insecure postal channels [24]. With any type of hidden communication, the security of the message often lies in the secrecy of its existence and/or the secrecy of how to decode it. Cryptography often uses only a worst case approach assuming only one of these two conditions holds.

1. INTRODUCTION

Within the field of Computer Forensics, investigators should be aware that steganography can be an effective means that enables concealed data to be transferred inside of seemingly innocuous carrier files. Knowing what software applications are commonly available and how they work gives forensic investigators a greater probability of detecting, recovering, and eventually denying access to the data that mischievous individuals and programs are openly concealing. Generally speaking, steganography brings science to the art of hiding information. The purpose of steganography is to convey a message inside of a conduit of misrepresentation such that the existence of the message is both hidden and difficult to recover when discovered. The word steganography comes from two roots in the Greek language, Stegos meaning hidden / covered / or roof, and Graphia simply meaning writing [15]. Similar in nature to the slight of hand used in traditional magic, steganography uses the illusion of normality to mask the existence of covert activity. The illusion is manifested through the use of a myriad of forms including written documents, photographs, paintings, music, sounds, physical items, and even the human body. Two parts of the system are required to accomplish the objective, successful masking of the message and keeping the key to its location and/or deciphering a secret. When categorized within one of the two fundamental security mechanisms of computer science (cryptographic protocols and maintaining control of the CPUs instruction pointer), steganography clearly fits within cryptography. It closely mirrors common cryptographic protocols in that the embedded

Kerckhoffs principle states that a cryptographic systems security should rely solely on the key material [14]. As an illustration, when knowledge of the reactive agents used to expose invisible inks became widely known after World War I, the German World War II special operations executive agents were trained never to risk their lives through reliance on insecure inks because most of them were of World War I vintage [19]. Given the historical examples above, it should be clear that if a steganographic systems key were to be discovered, the security of the system would be irrevocably broken. Simply shaving the hair off the head of everyone passing through a checkpoint, or melting the wax off of any discovered tablets reveals not only the existence of a hidden message but the message itself.

(4.999999999999 vs. 4.999999999998). In essence, additional bits can eventually become unnecessary when the accuracy of the waveform has been achieved. A trade off between file size and sample accuracy is often performed and the bit depth (number of bits per sample) chosen based on an amicable medium. Selection of the optimum amount of bits needed to represent the information using the smallest amount of storage space is a goal for many data formats. Using the previous example, sampling voltage level over discrete time intervals, it is also possible to graphically represent the waveform in a voltage vs. time plot. With enough bits to provide fidelity to the measurement, a close approximation of the actual signal can be reproduced. In the following example, 8 bits will be chosen to represent a value between -5 and +5 volts with the most significant bit determining the sign (+/-) of the measurement. The remaining seven bits provide 128 discrete values for the amplitude of the sampled voltage. Thus, each discrete value is 0.04 volts. Voltage samples of the signal taken 25,000 samples per second produce 25 Kilobytes (200 Kilobits) of data over a one second time interval. A plot of our hypothetical wave form is displayed in Figure 1. Six randomly selected samples (represented by eight binary digits) are included below simply to illustrate that the binary data changes over the time interval.
Voltage vs. Time 5

3. FUNCTIONAL OVERVIEW

Focusing the discussion on steganographic techniques used in digital media, traditional methods are employed to modify the data that defines the carrier or cover file. Modifications are made to achieve a desired pattern. The pattern used to modify the carrier defines a bit sequence that contains the hidden message or data. The basic principle of steganography ensures that modifications to the data in the cover file must have insignificant or no impact to the final presentation. Insignificant or no impact on final presentation means changes so minor in nature that the casual observer cannot tell that a hidden message is even present [12]. Every digital file is composed of a sequence of binary digits (0 or 1). It is also a relatively simple task to modify the content of a file by changing a single bit in the sequence. Accomplishing the modification without changing the presentation or the final form of the file is altogether a different task. For example, the binary value of the decimal number 13 consists of 4 bits (1101), changing one bit in the sequence changes the decimal value of the number it represents and ultimately changes the meaning of the value, (i.e. 1100 is the decimal equivalent of the number 12 not 13). What is required for steganography is a data set represented by large numbers of bits per datum. For illustration, an electric signal conducted on a wire can contain varying voltage levels over time. When using a single bit to sample the voltage level, we can only represent two states for any given time interval (off or on 0 or 1). We cannot represent a specific value such as +3.3v unless the value happens to be a boundary condition (i.e. the high voltage of this signal is +3.3v). By adding bits to the representation of the measurement we can reproduce measurements between the boundary values. Two bits can define up to four states (0, 1.1, 2.2, and 3.3v for example), three can define eight, four bits define sixteen, and so on. The level of precision used in the measurement is proportional to the number of bits used in the binary representation of the voltage level. The downfall of using additional bits per datum is seen in the impact to the size of the stored data that represents the measured waveform. When measurements are taken over time intervals, each additional bit multiplies the size of the data file. Depending on the level of fidelity needed for the data representation, additional bits can eventually cease to contribute desirable or distinguishing information such as round off errors

4.5

4 3.5

3 Voltage

2.5

2 1.5

0.5

0 0 0.1 0.2 0.3 0.4 0.5 Time (sec) 0.6 0.7 0.8 0.9 1

01001010, 01001011, 01001100, 01001101, 01001110, 01001111

Figure 1 Example Carrier Wave Form and Binary Data Representing Six Distinct Points By modifying the least significant bit of each sample, it is possible to embed information into the waveform without having significant impact to the graphical representation of the data. In the next section the waveform above, and its associated binary data, will become the carrier or cover file for our steganographically embedded covert message.

3.1 Modifying the Carrier

Noting that by using 7 bits to represent 5 volts of amplitude, we create a relatively small division between values (0.04V). By modifying the least significant bit (LSB) of any datum we can only change its reproduced value by the same amount (0.04V). This imperceptible change means that intentional modifications to the LSB of every sample may go unnoticed and allow data to be embedded into the bit sequence. Using sequential data points to carry our message, we can inject a 25,000 bit message into the LSB for every second of data we have recorded. When viewing the waveform after modification, the difference in voltage at any given datum is imperceptible to the naked eye. To illustrate consider the following illustrative bit stream: 01001010, 01001011, 01001100, 01001101, 01001110, 01001111, 01010000, 01010001 In the event we wished to inject the 8 bit message (11110000) into the data, we would modify the corresponding LSBs of the above bit stream to match our message. The resulting steganographic data stream would become 01001011, 01001011, 01001101, 01001101, 01001110, 01001110, 01010000, 01010000 where the modified bits are in blue bold typeset. Note that while the carrier data has changed, what is represented or displayed in the final form (i.e. the form delivered to the end user) has been modified only in an imperceptible manner. Figure 2 shows our example waveform embedded with the following ASCII message after conversion to binary: The truth shall set you free. The existence of the embedded message can only be seen in the blow-up of the first few samples of the reproduced waveform.
Voltage vs. Time 5 4.5

embedded in standard TCP/IP packet headers. The most common image formats include BMP, GIF, and JPEG. The majority of software applications designed for steganography utilize the JPEG image file format as the carrier.

3.2.1 Text & Printed Documents

Text documents of all types can contain embedded messages that are difficult if not impossible to locate. This paragraph contains a hidden message that can be decoded using a decoding key provided at the end. In the case of this paragraph, the decoding is performed by referencing a character in each line by its position and using the characters numeric location as the key to the hidden message. This type of data embedding is identical to one time pad cryptography where a key is used to extract the message from a stream of data. Steganography is not the encryption methodology, but rather the means by which to conceal the message. The message contained in this paragraph reads secret, and is decoded using the following key (14, 3, 21, 2, 2, 11) the letters in the paragraph above have a blue typeset for ease of location. In some documents binary information can be stored by shifting the placement of letters slightly to represent a binary value. Although usually accomplished with a pictorial representation of the letter or the entire document, it is possible to embed the information in a Microsoft Office Word document such as this. Consider embedding the binary value of the ASCII letter T 01010100 into the word Singular. We can inject the binary string by varying the spacing between the letters to indicate a zero or a one. For comparison, a fixed or naturally spaced version of the word is displayed below the encoded version. Grey lines have been added to more easily identify the characters that have been shifted to represent a binary value of one. In the example below, all non-shifted (i.e. normally spaced and not touching the reference line) characters are assumed to represent a zero. S in gu la r Singular Note that the i", g, and the l are touching grey lines thus indicating a high state or the binary value one for that position. When pieced back together the values are as follows S-0, i-1, n0, g-1, u-0, l-1, a-0, r-0 or 01010100. Other methods of encoding files include a stepped character approach (where the message is conveyed with embedded characters separated by a fixed number or constant step) and the addition or subtraction of white space and/or carriage returns at the end of every line. The stepped character method is more difficult to accomplish because producing indistinguishable carrier messages that mask the hidden content may require unnatural or awkward language [13]. Consider encoding secret words into a carrier sentence using a seven character stepping algorithm (again the characters are in blue typeset for clarity). This is much easier curing roses, each petal had_too few dewdrop awards for sicknesses Not only does the sentence not make sense (yielding context) but it is obvious that the word sequence is not a natural discussion or commonly spoken phrase. There are several bodies

4 3.5 3 Voltage

2.5 2 1.5 1

0.5 0 0 0.1 0.2 0.3 0.4 0.5 Time (sec) 0.6 0.7 0.8 0.9 1

Figure 2 Example Steganographically Injected Wave Form and Blow-up of the Injected Data Area

3.2 Typical Carrier Data Types

Any file that requires multiple bits to reasonably quantify its message such that minor changes to the data are imperceptible when the file is presented in final form is an acceptable candidate for a carrier. Digital data types fitting this description include image, video and sound files, data can even be

of research in the field of linguistic steganography. The primary focus of the field is in the area of automating the selection of synonyms for common words to embed data in writing such that it eliminates the unnatural and/or awkward wording problems [4].

follow-on function must be employed. The following formula; where Q(u, v) defines the quantization table for the internal elements is used:

3.2.2 Still and Motion Image Files

F Q (u, v) =

When considering an image file, a direct correlation to the previous signal sampling example can be made. Images consist of pixels with contributions from primary colors (red, green, and blue) adding to the total color composition of the pixel. Most images are represented as triples (Red contribution, Green contribution, Blue contribution) [6]. Depending on the depth of color desired in the final image, each component is represented by a separate number of bits. In the case of a 24 bit bitmap, each color component has eight bits. Represented as decimal contributions for ease of reading, a value of (255, 0, 0) would describe a 100% red pixel. By mixing the contribution of each component a large palette of colors can be represented. Value mixtures such as (31, 187, 57) can result in a dark green while (255, 255, 0) represents pure yellow. When closely viewing any specific color, single digit modifications to the contribution level are imperceptible to the human eye. (i.e. a pixel with a value of (255, 255, 0) is indistinguishable from (254, 255, 0)). Figure 3 illustrates the impact of modifying one bit in the red contribution for two yellow boxes.

F (u , v) Q(u , v)

When 24 bits (8 bits per color) are input into the DCT, the information describing any given pixel can be reduced from 24 bits to as little as 3. Depending on the number of bits used to represent the DCT coefficients, the resulting compression of data describing the pixel can reduce the total size of the file without noticeably altering its final form. JPEG steganography uses the least significant bit of the DCT coefficients to hide the desired message. Since the coefficient represents the relative difference from the grids quantized value, modifying the LSB of each coefficient changes the value of the entire grid and has an imperceptible impact to the reproduced image. When using the LSB of the discrete coefficients for any given block, the modifications to a single coefficient affect the values of each of the 64 discrete pixels. This translates into 64 minor changes to a single block and that results in a smoother color transition between blocks. Some of the picture formats (such as GIF) embed information pertaining to the visual representation to the color palette which affects all of the bit layers in the image. Steganographic systems that use these formats and modify the LSB to embed data, often impart noticeable changes to the reproduced image and that serves as a cue indicating the existence of embedded data. Because JPEG images do not distribute the image information across the entire image, but rather to discrete 8 8 pixel blocks, the format is less susceptible to those visual attacks [18]. Figure 4 shows a JPEG photograph of my son and daughter (twins). No obvious aberration between the original and the image with a steganographically embedded message is apparent between the photographs.

(255, 255, 0)

(254, 255, 0)

Figure 3 Red Channel LSB Modification Example Color Change The JPEG encoding algorithm sections the input image into 8 8 grids containing 64 pixels per grid. Within each grid a discrete cosine transform coefficient for every color component in the pixel is calculated. Thus, the data in each grid is an 8 8 matrix of 64 DCT coefficients for each color. The formula used to calculate the DCT coefficient F(u, v) of an 8 8 grid of image pixels f(x, y) is:

1 F (u, v) = C (u )C (v) 4 cos

f ( x, y )
Figure 4 Original & Steganographic (right) images

x =0 y = 0

(2 x + 1)u (2 y + 1)v cos 16 16

3.2.3 Audio Files

where C(x) = 1/ 2 when x=0 and C(x) = 1 at values of x other than 0 [9]. To determine the grid bias or color offset, a

The human ear can distinguish frequencies between 20 Hz and 20 kHz [17]. By embedding a stream of data into an audio signal

at frequencies above those, the effect is inaudible and cannot be detected by the human ear. Not only does the carriers reproduction of the data sound identical to that of the original, but the only impact the added data has is an increase to the size of the file. A frequency spectrum analyzer or a calculation of the total amount of data required to produce the same audible spectrum over that time interval would be able to detect the presence of the additional information. LSB modification of the bit stream can also be used but has a noticeable detrimental impact to the carrier file when it is reproduced. Typically the reproduced audio has a higher occurrence and level of what sounds like static or hiss. As displayed in the blow-up of Figure 2, the audio signal can be expected to diversify which results in an increase in the amount and level of background noise.

state). Both of these approaches can embed binary data inside of an overt communications channel [1].

4. COMMERCIALLY AVAILABLE SOFTWARE

3.2.4 TCP/IP Headers

Steganography is also possible through manipulation of existing overt mechanisms to create covert channels. The covert channel concept was first introduced by Lampson as a channel used for information transmission, but not designed nor intended for communication [16]. Steganography practitioners have already shown that the carrier can be any message or information source that contains redundancy or irrelevancy [2]. When detailed understanding of a specific technology is placed in the hands of an individual with ill intent and technical proficiency, covert channels are often created as a result. If the capability to generate and read the contents of well formed TCP/IP packets exists at both ends of a communication channel, it is possible for the two parties covertly pass hidden data. The exploitation modifies several of the fields in a standard IPv4 packet header to carry information over a covert channel [7]. Specifically, the flags and sequence number fields of the IPv4 header are particularly susceptible to manipulations that serve this purpose [21]. The flags field in the IPv4 header contains three bits; a reserved bit, a DF bit (do not fragment), and finally a MF (more fragmentation) bit. Provided that the parties wishing to communicate covertly both know the maximum transmittable unit (MTU) of the network, they can manipulate the flags field to carry a message within standard TCP/IP packets that contain innocuous data. By keeping the total packet size below the MTU, modification of the DF bit has no impact the transmission of the cover message. Alternatively, packets exceeding the MTU with the DF flag set are returned to the sender as undeliverable [25]. By keeping the size of the cover packets below the MTU for the given network, the DF flag can be arbitrarily assigned allowing the field to carry binary data covertly. Conversely, by exceeding the MTU of the packet, the sender is able to transmit packets that will arrive out of sequence and thus, can convey binary information through the order of packet arrival. If the order of arrival of the cover message packets is unimportant, individuals can take advantage of the packet sequence number field to relay covert information to the receiver. An algorithm that packetizes the cover message and then transmits the packets in an altered sequence can indicate binary data by either transmitting a sequential packet to indicate a zero, or an out of sequence packet to indicate a one. Further, a program that controls the composition of packets can transmit all packets in the correct order but modify the sequence number of each transmitted packet when needed (to indicate a change in

Johnson & Johnson technology consultants maintains a website that contains a survey of more than 140 software titles that perform steganography using all of the various types of data files discussed earlier [11]. The software includes Freeware, Shareware, and licensed versions for both individual and business users. Various titles on the site will run on Linux variants, Microsoft Windows, and Macintosh computers. Of the more than 140 titles listed on their web site, over half (85) deal with embedding information in still images. 37 of these titles can encode information into BMP files, 20 into GIF, and 15 can embed data in JPEG files. The remaining titles can use any binary input to produce encoded PCX, PICT, and PNG output files. The second most popular type of steganography software is for plain text and HTML file types. The data is embedded through the use of character spacing, insertion of sequences of tabs and spaces at the end of the lines in the carrier file, or through production of poorly formed English sentences and poetry. Even the web sites descriptions of these types of tools indicate poor grammar and awkward word selection when they write substitution cipher that makes text files look like a cross between adlibs and bad poetry [11]. Software for audio steganography is also widely available. Formats suited for injection include WAV, PCM, AVI, MIDI, MPEG, MP3, RIFF, and VOC. Finally, data hiding software titles are available to embed information in unused or hidden locations on physical drives. Through manipulation of unused space or hidden directories, data can be stored between files or at any unused area of the file system. The tools take advantage of the slack space often located between the legitimate end of a file and the start of the next cluster. Hidden directories can be created that are not included in the allocation tables of the main operating system. Files are stored in these directories through a ghost or mirror OS directory structure that is managed by the software. By using areas of the drive unlikely to be accessed by the OS or by marking the sectors as bad or unreadable in the main OS allocation tables, the steganography software is able to reduce or eliminate the likelihood that the hidden data will be overwritten. By encoding or encrypting the data stored in slack space and hidden directories, this software is also able to reduce the chance that simple file scans will detect or indicate the presence of the hidden data. The vast majority of the steganography titles incorporate the use of cryptographic protocols such as AES, 3DES, RSA, and Blowfish to either encrypt the hidden message prior to embedding, or use the protocol to randomize the injection sequence for the data. When the file containing the embedded information is provided to the recipient, only the correct password and decoding algorithm will produce the decoding sequence or decrypt the embedded file.

5. DETECTION AND RECOVERY METHODS

Steganalysis is the art and science behind the detection of the use of steganography by a third party. The basic function of steganalysis is to first detect or estimate the probability that hidden information is present in any given file. The detection and estimation is based only on the data presented in its observable form (i.e. nothing is known about the file prior to investigation). Because simply detecting the presence of hidden data may not be sufficient, steganalysis also covers the functions of extracting the message, disabling and/or destroying the hidden message so that it cannot be extracted, and finally, altering the hidden message such that misinformation can be sent to the intended recipient instead of the original message [10]. Depending on how much information is known about the embedded image, steganalysis techniques and methods closely mirror traditional cryptanalysis methods [3]. The steganalysis attack methods can be broken into six types: Steganography-only attack: Only the file with the embedded data is available for analysis. Known-carrier attack: Both the original carrier file and the final (hidden message embedded) files are available for analysis. Known-message attack: The original message prior to embedding in the carrier is known. Chosen-steganography attack: Both the algorithm used to embed the data and the final (hidden message embedded) file are known and available for analysis. Chosen-message attack: The original message and the algorithm used to embed the message are available, but neither the carrier nor the final (hidden message embedded) file are. This attack is used by the analyst for comparison to future files. Known-steganography attack: All components of the system (the original message, the carrier message, and the algorithm) are available for analysis. It follows that the success of any steganalysis technique is tied to the amount of information known about the file prior to investigation. As more information about the file is known prior to investigation, the investigator can move from simply detecting to modifying or altering the hidden message before sending it on to the intended recipient. In the first category (steganography-only attack), the purpose of analysis is to simply detect the existence of a hidden message. Without prior knowledge of the encoding mechanism, key, or data contained within the message, recovery of the contents using this method while possible, can take an excessive amount of time. With access to the original carrier and the final file with the embedded content (known-carrier), the purpose of analysis can move toward recovering the embedded message by comparing the differences between the two files. If the algorithm is known and the file with the hidden message embedded is also available (chosen-steganography attack), the analyst may have the ability to reverse the embedding to recover the hidden message and can easily alter or destroy the hidden contents. Finally, if the analyst has the algorithm and a message prior to embedding (chosen-message attack), they can move towards

identifying possible (hidden message embedded) files to attempt to recover the original carrier. If the carrier can be recovered or closely reproduced, the ability to insert alternate messages in lieu of the original message is possible. The steganography-only attack can be accomplished through the use of statistical analysis performed on the final medium. In the following example, the color contents of JPEG images are examined. A modification to each coefficients LSB produces variations in the data that results in deviations to the histogram for the given file. If the deviations are large enough to produce noticeable aberrations, the embedded files histogram can identify the existence of the hidden message. Likewise, LSB modifications to palette-based images (GIF, etc.) cause duplications of the colors in the palette with identical or nearlyidentical colors appearing. This duplication of colors can also serve as an indicator pointing to the existence of hidden data. When examining the grayscale histograms for an original and a steganographically embedded JPEG (such as in Figure 4), slight deviations in the histograms are noticeable. The grayscale histogram provides a cumulative value for all three color channels (red, green, and blue) at each brightness level (0-255). As such the value displayed in the graph for brightness level 100, would be the total number of pixels in the image with a value of 100 in grayscale brightness. By modifying the original palette LSBs or the LSBs of the DCT coefficients, the histogram values shift to reflect the change in the number of pixels containing that specific value. To demonstrate this phenomenon, Figure 5 compares the same photograph in its original form (containing 42,784 colors) to an embedded version of the file (containing 42,886 colors).
42,784

42,886

Figure 5 Original & Steganographic (below) Histograms

The arrows in the embedded histogram indicate two obvious differences in the waveform (the introduction of pixels near brightness level 64 and the reduction of pixels near level 175). Steganalysis takes this phenomenon one step further by comparing the normalized distribution of colors against a predicted value. For palette based images, a normal distribution of color frequency is likely. A scalable standard bell curve can be assumed as the comparison benchmark against the suspect file. As seen previously, changes to the LSBs for any given pixel can create duplicate (or near duplicate) colors in the images color palette. The duplicate colors increase the frequency for that value and can create a spike in the distribution exceeding the benchmark reference. Any large deviations from the benchmark can be an indicator of anomalies or modifications to the contents of the file. The process for JPEGs can be a bit more complicated. Because the JPEG format does not use a palette based encoding algorithm, a second step is necessary to compare DCT frequency to a benchmark. Recall that DCTs are reference points based on the quantized value for each color channel in the 8 8 grid. As references, they are small by nature and plotting the frequency of a grids coefficient values to another without compensating for the quantization reference is pointless. Further, the value of any given coefficient only affects a small percentage of the total number of pixels in the image. When tallied individually, the histogram for the DCTs will only tell whether the image contains elements of high contrast or not. (i.e. a photo of the blue sky vs. a picture of the international balloon fiesta in Albuquerque, NM.) The coefficients for the blue sky should have less variance than the coefficients for the photos of a colorful balloon. Algorithms that sequentially modify the DCT coefficients in JPEG files tend to cause distortions in the histogram that flatten out the frequency values of adjacent DCTs [23]. To compensate for this issue, newer algorithms do not sequentially embed the data but rather use a password or key to generate a random order for DCT or LSB modifications. Some readily available software titles for steganography detection include StegDetect, Stego Watch, and Steg Spy. Each of these titles use some form of statistical analysis on the target image to predict the existence of a hidden message. Westfeld and Pfitzmann used a 2 test to predict the probability that an image contained steganographic content by comparing the expected distribution (the null hypothesis) against the sampled values [23]. If the measured value produced a deviation from the expected, then the amplitude of the deviation was proportional to the probability of steganographic content at that point in the file. Because their algorithm ran on sequential bytes with an increasing sample size for each calculation, when the probability dropped, the size of the hidden message was often revealed as well. Statistical steganalysis has been made more difficult recently because some steganography algorithms specifically take measures to preserve the carrier file' s first-order statistics to avoid this type of detection. Further, encrypting the content of the embedded message makes detection even harder because encrypted data generally has a high degree of randomness associated with it [18].

After detection of hidden content with a carrier file, the next step is recovery of the hidden message itself. For known-carrier and chosen-steganography attacks (where the algorithm used to embed the data is known) some of the same detection tools have been extended to make use of brute force message recovery to also break the key used to embed or encrypt the data. With respect to JPEG files, there are several software titles that hide information using these variations of LSB insertion. JSteg sequentially embeds the hidden data in least significant bits, JP Hide&Seek uses a random process to select least significant bits, F5 uses a matrix encoding based on a Hamming code, and OutGuess preserves first-order statistics [18]. If intercepted en-route, after the hidden message is recovered (by breaking the encryption and embedding key or otherwise) the same carrier file can be used to embed an alternate message prior to sending it on to its final destination. Because modifications to the data comprising the carrier file are made without incorporating a mapping back to the original values, recovery of the original carrier file is difficult and sometimes impossible. For digital pictures, audio, video, and even file slack space, steganographic modifications to the original contents often destroy the integrity of the carrier file in the process. Should a carrier file need to be reused, a close approximation of the original file can be made using the techniques described in the next section.

6. DENYING STEGANOGRAPHY

Far from the technical challenges facing the detection and recovery of hidden data, altering steganographically embedded information for common carrier types is relatively easy. System owners and administrators seeking to disrupt the communications channel provided by steganography can implement file transformations in the communication channel to accomplish this goal. Recalling that the most common data types are image, video, and sound files, one simple approach is to simply change the format of (transform) the data by re-encoding it into an alternate format. The use of a guard processor at the entry and exit point(s) of the systems network could accomplish this task. For example, Figure 6 displays a photograph of my oldest son with his twin brother and sister. The photo on the left is the original photo in a bitmap format, the photo on the right has a Microsoft Office Excel spreadsheet (~44 Kb in size) steganographically embedded in it. Again, the two photographs are indistinguishable to the human eye. Proving that the files contain differences can be done through the use of a cryptographic hashing algorithm that verifies differences indeed exist.

Figure 6 Original & Steganographic Bitmap

An MD5 128-bit hash provides a high degree of confidence that different inputs produce different hash outputs. Thus, differences in MD5 hashes provide a high level of certainty that the given inputs (the binary contents composing the two photos) contain differences [20]. While the properties of the files (image size, number of pixels, etc.) remained identical during the embedding process (see table 1), as expected the MD5 hash between the files is not consistent. Table 1 File Properties File Name Kids_orig.bmp Kids_steg.bmp Kids_steg.jpg Kids_recov.bmp MD5 Sum Size (bytes) 401,910 401,910 29,493 401,910 Pixels 448 x 299 448 x 299 448 x 299 448 x 299 Depth (bit) 24 24 24 24 c1b865197b559747be78a86bfa106b16 a39fb606650363bd064d5d76b0af3c10 a03448ae1050d4bece4be38615253fac 1e55e9e65645892af9fe24e195e4dd53

cleaned reproduction of the file should show no noticeable deviations from the original. For text based denial techniques, the process can be a bit more complicated. Removal and/or addition of carriage returns and white space (such as adding an additional space after every period in the text) can shift the placement of characters which can break the character mapping decryption keys rely on. Techniques like this can also alter the spacing of characters in a stepped character approach. Character space shifting approaches often require that the final document, or at a minimum the individual character, is an image instead of text. These steganographic insertions can be defeated using standard original character recognition software to rebuild the original file from the OCR output. Documents that are not image based (such as this report in its PDF format) can have the text copied and pasted into another document. Synonyms can also be used to replace the awkward text often found when words are substituted in stepped character routines. This approach not only denies the steganographic channel, but leaves the intended message in the carrier intact and can make the document more pleasant to read. The injection of bits into the headers of TCP/IP packets does not modify the content of the payload in any way. Steganographic covert channels utilizing techniques such as this are easily defeated through the use of monitoring features at the switch or router level. Malformed packets can be screened out or modified to conform to a specific rule set. Consider packets with the do not fragment (DF) bit manipulated so that the packets carry a covert message. A history or state based rule set could trigger on packets going to the same destination under the same protocol but having inconsistent DF bits. Other network steganography denial techniques could include a security specification stating that the DF bit on every packet leaving the switch/router should have a value of one and all packets entering should have a value of zero. At a more rudimentary level (knowing that it could be detrimental to some fragment sensitive applications) network security could be achieved by forcing the above conditions and modifying the flags.

The photograph with the embedded data (Kids_steg.bmp) is the same size, contains the same number of pixels, and the same depth as the original but the binary contents of the file are different than the original (Kids_orig.bmp). Visually, the two files appear to be identical but the MD5 sum provides credible evidence that that is not the case. To illustrate how to defeat the steganographic mechanism, the final file (Kids_steg.bmp) was converted into a JPEG by opening it in Microsoft Paint and using the save as feature to save it in the JPEG format. Note that the MD5 sum of Kids_steg.jpg does not match either the original or the embedded version of the photograph. An expected and noticeable reduction in file size is achieved when using JPEG compression. In this case, once the final file is converted into a new format, the embedded message is destroyed and the covert steganographic channel is effectively denied. The final step to proving this is the case was to reconvert the JPEG image back into a bitmap. Again Microsoft Paint was used to open the JPEG image and the save as feature was used to save it in the bitmap format. Note that the recovered image (Kids_recov.bmp) has identical properties to the original and steganographic files, but contains a different MD5 sum. The recovered image no longer contains the hidden message and it is not the same file as the original. The modifications to the original file when the Microsoft Office Excel spreadsheet was embedded made irrecoverable changes to the bits defining each pixel. For video and audio files the process outlined above remains the same. Convert the file to another format that requires a conversion, such as a lossy compression or expansion routine, and the embedded data will be destroyed in the process. With the exception of high compression data formats, the resulting

7. Conclusion

Computer forensic professionals need to be aware of the difficulties in identifying the use of steganography in any investigation. As with many digital age technologies, steganography techniques are becoming increasingly more sophisticated and difficult to reliably detect. Once use is detected or discovered, obtaining the ability to recover the embedded content is becoming difficult as well. Acquiring knowledge of current steganographic techniques, along with their associated data types, can provide a critical advantage to an investigator by adding valuable tools to their forensic toolkit. Finally, due to the relatively simple techniques capable of denying the exploitation of a covert steganographic channel, companies may wish to take precautionary measures. By enacting measures discussed in this paper, they can ensure their proprietary and trade secret information is not being shoplifted inside of the daily podcast, shared in family photos, or distributed via the latest YouTube video.

REFERENCES
[1] K. Ahsan, and D. Kundur, Practical Internet Steganography: Data Hiding in IP found online at <http://www.ece.tamu.edu/~deepa/pdf/txsecwrksh03.pdf> [2] R.J. Anderson and F.A.P. Petitcolas, On the Limits of Steganography, J. Selected Areas in Comm., vol. 16, no. 4, 1998, pp. 474481 [3] Curran, K. and Bailey, K. An evaluation of image-based steganography methods. International Journal of Digital Evidence, Fall 2003. [4] M. Chapman, G Davida, and M. Rennhard. A Practical and Effective Approach to Large-Scale Automated Linguistic Steganography found online at <http://www.nicetext.com/doc/isc01.pdf> [5] Y. Dai, G. Liu, and Z. WangBreaking, Predictive-CodingBased Steganography and Modification for Enhanced Security, IJCSNS International Journal of Computer Science and Network Security, vol.6 no. 3b, March 2006 [6] J. Fidrich, M. Golijan, and R. Du, Reliable Detection of LSB Steganography in Color and Grayscale Images, found online at <http://www.ssie.binghamton.edu/fridrich/Research/acmwr kshp_version.pdf> [7] T. Handel and M.Sandford, Hiding data in the OSI network model, Cambridge, U.K., May-June 1996, First International Workshop on Information Hiding. [8] Herodotus, The Histories, Penguin Classics; Reprint edition, September 1, 1996 [9] International Telecommunication Union, "Information Technology - Digital Compression and Coding of Continuous-Tone Still Images - Requirements and Specifications Recommendation T.81", ITU Sept 1992 [10] Jackson, J. T., Gregg, H., Gunsch, G. H., Claypoole, R. L., and Lamont, G. B. Blind Steganography detection using a computational immune system: A work in progress. International Journal of Digital Evidence, December 2003. [11] N. Johnson, Digital Image Steganography and Digital Watermarking Tool Table, found online at http://www.jjtc.com/Steganography/toolmatrix.htm [12] N.F. Johnson and S. Jajodia, Exploring Steganography: Seeing the Unseen, Computer, vol. 31, no. 2, 1998, pp. 2634.

[13] N.F. Johnson and S. Jajodia, Steganalysis: The Investigation of Hidden Information, found online at <http://www.jjtc.com/pub/it98jjgmu.ps> [14] A. Kerckhoffs, Military Cryptography, French Journal of Military Science, Feb. 1883. [15] R. Krenn, Steganography: Implementation & Detection, found online at <http://www.krenn.nl/univ/cry/steg/presentation/2004-0121-presentation-steganography.pdf> [16] B.W. Lampson, A note on the confinement problem, in Proc. of the Communications of the ACM, October 1973, number 16:10, pp. 613615. [17] C. Nave, Sensitivity of the Human Ear, found online at <http://hyperphysics.phyastr.gsu.edu/hbase/sound/earsens.html> [18] N. Provos and P. Honeyman, Hide and Seek: An Introduction to Steganography [19] Public Record Office, SOE Syllabus: Lessons in ungentlemanly warfare, World War II with an introduction by Denis Rigden, Richmond, 2001 [20] R. Rivest, The MD5 Message-Digest Algorithm, MIT Laboratory for Computer Science and RSA Data Security, Inc, April 1992, .can be found online at <http://www.faqs.org/rfcs/rfc1321.html> [21] University Southern California Information Sciences Institute, Internet protocol, DARPA internet program, protocol specification, September 1981, Specification prepared for Defense Advanced Research Projects Agency. [22] M. Weiss, Principles of Steganography found online at <http://www.math.ucsd.edu/~crypto/Projects/MaxWeiss/ste ganography.pdf> [23] A. Westfeld and A. Pfitzmann, Attacks on Steganographic Systems, Proc. Information Hiding3rd Intl Workshop, Springer Verlag, 1999, pp. 6176. [24] W. White, The Microdot: History and Application. Williamstown, NJ: Phillips Publications, 1992. [25] M. Wolf, Covert channels in LAN protocols, in Proceedings of the Workshop on Local Area Network Security (LANSEC89), T.A. Berson and T. Beth, Eds., 1989, pp. 91 102

You might also like