You are on page 1of 5

Boosting the Performance of LZW Compression

through Block Sorting for


Universal Lossless Data Compression
Sajib Kumar Saha, Md. Masudur Rahaman†
Khulna University/CSE, Khulna, Bangladesh, e-mail: to_sajib_cse@yahoo.com / tosajib@gmail.com

Khulna University/CSE, Khulna, Bangladesh, e-mail: masud_cse02@yahoo.com

Abstract— In this paper a new data compression Therefore, the output of the block sorting for the string
technique has been proposed based on Lampel Ziv Welch mississippi is the pair {pssmipissii, 4}.
(LZW) coding and block sorting. At first block sorting is Remarkably, it is possible, given only this output, to
performed on input data to produce permuted data. LZW reconstruct the original text. This is possible because the
coding is then applied on that permuted data to produce matrix in Figure 1 is constructed using cyclic rotations;
more compressed output. Though block sorting takes some that is, the characters in the last column of Figure 1(b)
extra time (the amount is very negligible), it increases the cyclically precede those in the first column. Figure 1(d)
performance of LWZ compression much. The proposed shows this relationship more clearly: the characters in the
model is compared with respect to LZW compression.
left-hand column (which are taken from the last column of
the sorted matrix) immediately precede those in the right-
Keywords—Block sorting, cyclic rotation, permuted string,
reverse block sorting, LZW.
hand column (the first character of the sorted matrix). If
we refer to the first and last columns of the sorted matrix
as F and L respectively, then L is the output from the
I. INTRODUCTION block sorting, and F can be obtained by the decoder
simply by sorting the characters of L. Then we see that
Data compression algorithms are developed to reduce each character in L is followed by the corresponding
redundancy in data representation in order to decrease data character in F. In particular, we know that in this case L[4]
storage requirement. It is broadly divided into 2 is the first character of the text; therefore, F[4] (the letter
categories; lossless and lossy. LZW compression is very i) must be the second character. We must now consider
much popular in the field of lossless data compression. In the problem of locating the next character to decode. We
that paper a modification of LZW compression has been know that each character in F must correspond to a
devised through block sorting [3], which is a concept for character in L, since the two strings are merely
producing permuted data. In the organization of this paper permutations of one another. However, the letter i occurs
we first describe block sorting, and then LZW coding four times in L. Which one corresponds to the, i that has
technique. The proposed model and experimental results just been decoded (the fourth i in F)? To answer this
comes in succession and proves the superiority of the question, we must observe that each group of characters in
proposed technique over LZW compression. F occurs in the same order in L. For example, the third s in
F corresponds to the third s in L, and so on. Since we are
dealing in this instance with the fourth i in F, this
II. LITERATURE SURVAY corresponds to the fourth i in L (L[11]), which, we then
discover, is followed by an s, and so forth. Decoding the
A. Basic Concepts about Block Sorting rest of the string in a similar manner gives the order 4, 11,
9, 3, 10, 8, 2, 7, 6, 1, 5.
Block sorting [3] is a technique for making text files
more amenable to compression by permuting the input file
so that characters that appear in similar contexts are
clustered together in the output.
Figure 1 shows how the block sorting would be
performed on the shorter text mississippi. First, all the
cyclic rotations of the text are produced. Next, the
characters are sorted into the order of their contexts.
Finally, the permuted characters are transmitted. For
example, the complete permuted text for the string
mississippi is pssmipissii (Figure 1(c)).
It is also necessary to transmit the position in the
permuted text of the first character of the original text. In
the case of the example in Figure 1, we transmit the
number four, since the letter m, which is the first character
of the input, appears at position four in the permuted
string.
F L
extend the 8-bits to few more bits to hold 256(100h) and
above. If we extend it to 12-bits, then we can store up to
1 mississippi 1 i mississip p 4096 elements in the table. So when we store each
2 ississippim 2 i ppimissis s element in the table it is to be converted to a 12-bit
3 ssissippimi 3 i ssippimis s number.
4 sissippimis 4 i ssissippi m For example, when you want to store A(dec-65, hex -
5 issippimiss 5 m ississipp i 41), T(dec-84, hex-54), O(dec-79, hex-4F) and Z(dec-90,
6 ssippimissi 6 p imississi p hex-5A), you have to store it in bytes as 04, 10, 54, 04, F0,
7 sippimissis 7 p pimississ i 5A . The reason is, we have allotted only 12-bits for each
8 ippimississ 8 s ippimissi s character. Consider a string .ATOZOFC.. It takes 7x8(56)
9 ppimississi 9 s issippimi s bits. Suppose if a code is assigned to it as 400(190h), it
10 pimississip 10 s sippimiss i will take only 12-bits instead of 56-bits!
11 imississipp 11 s sissippim i Example
Input string: ATOZOFCATOZOFCATOZOFC
(a) (b)
L F Characters String Process in In file
Read Stored / Table
1 p 1 p i Retrieved
2 s 2 s i
3 s 3 s i A Store
4 m* 4 m* i T AT Store Store
5 i 5 i m O TO Store Store
6 p 6 p p Z OZ Store Store
7 i 7 i p O ZO Store Store
8 s 8 s s F OF Store Store
9 s 9 s s
C FC Store Store
10 i 10 i s
A CA Store -
11 i 11 i s
T AT Retrieve Store
Relevant
(c) (d) Code
Figure 1: Block sorting of the string mississippi: O AT0 Store -
(a) rotations of the string; (b) sorted matrix; (c) permuted Z OZ Retrieve Store
string (last character of sorted matrix); (d) permuted Relevant
string and sorted string. Code
O OZO Store -
B. Basic Concepts about LZW Compression F OF Retrieve Store
In LZW compression algorithm [1, 2, 4], the input file Relevant
that is to be compressed is read character by character and Code
they are combined to form a string. The process continues C OFC Store -
till it reaches the end of file. Every new string is assigned A CA Retrieve Store
some code and stored in Code table. They can be referred
when the string is repeated with that code. The codes are Relevant
assigned from 256, since in ASCII character set we have Code
already 256(0-255) characters. The decompression T CAT Store -
algorithm expands the compressed file. Here the file, O TO Retrieve Store
which is created in the compression, is read character by Relevant
character and it is expanded. This decompression process Code
doesn’t require the Code table built during the Z TOZ Store -
compression. O ZO Retrieve Store
Here the 1st and the 2nd characters are combined to Relevant
form a string and they are stored in the Code table. The Code
code 256(100h) is assigned to the first new string. Then F ZOF Store -
2nd and 3rd characters are combined and if that string is
not available in the Code table, it is assigned a new code C FC Retrieve Store
and it is stored in the Code table. Thus we are building a Relevant
Code table with every new string. When the same string is Code
read again, the code already stored in the table will be
used. Thus compression occurs when a single code is
outputted instead of a set of characters [1, 5]. In this example-string, the first character A is read and
The extended ASCII holds only 256(0 to 255) then the second character T. Both the characters are
characters [4, 5]and it requires just 8-bits to store each concatenated as AT and a code is assigned to it. The code
character. But for building the Code table, we have to is stored in the Code table. Since this is the first string that
is new to the table, it is assigned 256(100h). Then the is also built concurrently when each new string is read.
second and the third characters are concatenated to form When we read 100, 102 etc., we can refer to the relevant
another new string TO. This string is also new to the Code code in the table and output the relevant code to the file.
table and the table expands to accommodate this new For example, when we reach the 4th set of characters and
string and it is assigned the next code 257(101h). Thus read 04, 31 and 00 they must be converted to 12-bit form
whenever a new string is read after concatenation it is as 043 and 100 will refer to the code in the table and
assigned a relevant code and the Code table is build. The outputs the string C and AT respectively. Thus we can get
table expands till the code reaches 4096 all the characters without knowing the previous Code
(since we have assigned 12-bits) or it reaches the end of table.
file. When the same set of characters that is stored in the
table is again read it is assigned to the code in the Code
table. Thus according to the number of bits specified by III. A NEW MODEL FOR COMPRESSING DATA
the program the output code is stored. In other words, if LZW works on input data to produce compressed
we have extended the bits from 8 to 12 then the character output. But in that model LZW works on block sorted data
that is stored in 8-bits should be adjusted so as to store it to produce a more compressed output. The proposed
in 12-bit format. model is given in Figure 2.
C. Basic Concepts about LZW Decompression Input data Compressed data
The file that is compressed is read byte by byte [1, 2].
The bytes are concatenated according to the number of
bits specified by us. For example, we have used 12-bits Block sorting LZW decompression
for storing the elements so we have to read first 2-bytes
and get the first 12-bits from those 16-bits. Using this bits
Code table is build again without the Code table
previously created during the compression. Use the LZW compression Reverse bock sorting
remaining 4-bits from the previous 2-bytes and next byte
to form the next code in the string table. Thus we can
build the Code table and use it for decompression. This Compressed data Source data
decompression algorithm builds its own Code table and it
(a) (b)
will be same as the table created during the compression.
The decompression algorithm refers this newly created Figure 2: Proposed model (a) Compression
Code table but not the Code table created during the (b) Decompression.
compression. This is the main advantage in this algorithm.

Example A. Proposed Compression Technique


Consider the same example given above and do the The proposed compression algorithm consists of two
decompression. phases:
1. Block sorting.
2. LWZ Coding.
Block Sorting
Block sorting is performed by adopting the following
steps as proposed in [3]:
i. Write the entire input as the first row of a
matrix, one symbol per column.
ii. Form all cyclic permutations of that row and
write all the permutation as the other rows of the
matrix.
iii. Sort the matrix rows according to the
lexicographical order of the elements of the
rows.
iv. Take as output the final column of the sorted
matrix, together with the number of the row
which corresponds to the original input.

Here each byte is read one by one as hexadecimal code


and 3 of the bytes are combined so as to convert them
from a 12-bit format to a 8-bit character (ASCII) format.
Thus the bytes 04, 10 & 84 are combined as 041084. The
combined code is split to get A(041) and T(084). The table
LZW Coding [5] IV. EXPERIMENTAL RESULTS
i. Specify the number of bits to which you have to We have taken files from the CORPUS for experiment.
extend The experimental result is given on Table I and Table II.
ii. Read the first character from the file and store it
in ch Table I Comparison with LZW compression and the
iii. Repeat steps (4) to (7) till there is no character in proposed model for files from CORPUS
the file
File Name Original Using Using the
iv. Read the next character and store it in ch2
Size LZW Proposed
v. If ch+ch2 is in the table
Compression Model
get the code from the table
otherwise
(bytes) (bytes) (bytes)
output the code for ch+ch2
Bib 1,11,261 46,864 36,804
add to the table
vi. Store it to the Output file in the specified book1 7,68,771 3,54,912 2,94,991
number of bits book2 6,10,856 3,04,800 2,34,050
vii. ch = ch2 News 3,77,109 1,87,456 1,47,356
viii. Output the last character ch paper1 53,161 19,980 17,198
ix. Exit paper2 82,199 32,567 27,467
Alice29.txt 1,55,648 59,020 51,102
plrabn12.txt 4,81,861 2,82,049 1,82,104
B. Proposed Decompression Technique Chntxx.seq 1,55,844 63,625 39,125
The proposed decompression algorithm consists of two Chmpxx.seq 11,67,360 3,92,905 2,92,303
phases: Mj 4,48,779 2,45,927 1,45,967
1. LZW Decoding. Sc 29,00,352 13,63,336 11,63,361
2. Reverse block sorting

LZW Decoding [5]


i. Read the character l
ii. Convert l to its original form
iii. Output l
iv. Repeat steps(5) to (10) till there is no character
in the file
v. Read a character z
vi. Convert l+z to its original form
vii. Output in character form

viii. If l+z is new then


Store in the code table
ix. Add l+z first char of entry to the code table
x. l = first char of entry
xi. Exit
Table II Comparison with LZW and the proposed model for compression and decompression time
File Name Original Compress time Decompress time
Size
Using LZW Using the Using LZW Using the
Proposed Proposed
Model Model
(bytes) (ms) (ms) (ms) (ms)
Bib 51, 202 874 885 351 363
book1 39, 821 547 554 194 201
book2 8, 714 102 102 35 35
News 65, 178 769 775 263 309
paper1 1,37, 566 1912 1954 671 688
paper2 1,64, 597 2550 2652 994 1063
Alice29.txt 1,55,648 2344 2420 974 1005
plrabn12.txt 4,81,861 4598 4680 1099 1105
Chntxx.seq 1,55,844 2176 2199 797 823
Chmpxx.seq 11,67,360 19231 20102 7067 7101
Mj 4,48,779 7681 7764 2501 2522
Sc 29,00,352 28678 35304 9952 10211
bible.txt 40,47,392 31453 37003 12601 12995
E.Coli 46,38,690 33211 38906 13210 13403
Total 159,99,283 136126 155300 49735 51824

V. CONCLUSION
The proposed model modifies the LZW compression [2] Jacob Ziv and Abraham Lempel,
Compression of Individual Sequences Via
[1, 2] by performing block sorting [3] on inputted data
Variable-Rate Coding, IEEE Transactions on
first; and then performs all the operations similar as Information Theory , September 1978.
[2]. The results have shown that the method achieves [3] Burrows, M. and Wheeler, D.J (1994) A block
sorting Lossless Data Compression
better compression ratio but an increase in compression Algorithm, SRC Research Report 124, Digital
and decompression time. Here the increase of System Research center, Palo
Alto,CA,gatekeeper.doc.com,
compression and decompression time is very negligible /pub/DEC/SRC/research-reports/SRC-
considering the increase of compression ratio. 124.ps.Z.
[4] http://marknelson.us/1989/10/01/lzw-data-
compression
REFERENCES [5] http://en.wikipedia.org/wiki/LZW#References
[1] Welch, T. A. , A technique for high-
performance data compression, Computer. Vol.
17, pp. 8-19, June 1984.

You might also like