You are on page 1of 49

DATA COMPRESSION

CERTIFICATE
This is to certify that Mr. Kanishka S Joshi has completed the necessary seminar work and prepared the bona fide report on

Data Compression
In a satisfactory manner as partial fulfillment for requirement of the degree of B.E. (Computer) of University of Pune in the Academic year 2002-2003

Date : 5th Aug 2002 Place : Pune

Prof. G.P. Potdar Seminar Coordinator

Prof. Dr. C V K Rao H.O.D., Internal Guide

DEPARTMENT OF COMPUTER ENGINEERING PUNE INSTITUTE OF COMPUTER TECHNOLOGY PUNE - 43

INDEX Chapter 1 Introduction to Information Theory & data Compression 1.1 Introduction 5 1.2 Simple methods showing data compression 7 1.3 The concept of Information 8 1.4 The concept of Entropy 9 1.5 Definition of Information Rate 10 1.6 Compression Methodology 10 Chapter 2 Huffman Encoding 2.1 Good codes 2.2 The Huffman Coding algorithm (lossless) 2.3 Design of Algorithm 2.4 Probing further into Huffman coding 2.4.1 Minimum variance Huffman codes 2.4.2 Length of Huffman Codes 2.4.3 Extended Huffman Codes 2.4.4 Non binary Huffman Codes 2.4.5 Adaptive Huffman Coding 2.5 Huffman Coding theory applications to lossless text compression Chapter 3 Arithmetic Coding 3.1 Motivation 3.2 Coding a sequence 3.3 Decoding Scheme 3.4 Advantages of Arithmetic Coding Over Huffman Coding 3.5 Applications 3.6 Summary Chapter 4 Dictionary Techniques 4.1 Introduction 4.2 Static Dictionary 4.3 Adaptive Dictionary 4.3.1.1 The LZ77 approach 4.3.1.2 Variations on the LZ77 scheme 4.3.2 The LZ78 coding scheme 4.3.3.1 The LZW algorithm

11 12 13 14 15 15 17 17 21

23 24 26 28 28 28 29 30 30 31 33 33 34

4.3.3.2 An exception 4.3.3.3 Applications of LZW 4.4 Summary Chapter 5 Lossy Compression 5.1 Distortion Criterion 5.2 Scope of Lossy Algorithms 5.3 Lossy Text Compression 5.3.1 Lossy Text compression background and motivation 5.3.2 Word by word semantic Compression 5.3.3 Generative Compression in the Style of Hemmingway 5.3.4 Synthesizing the semantic and Generative approaches 5.3.5 Conclusions References and Further Reading

36 37 37

39 40 41 43 44 45 47 49

Chapter 1. Introduction to Information Theory & data Compression


1.1 Introduction: It is very interesting that data can be compressed because it insinuates that information that we generally pass on to each other can be said in shorter information units (infos). Instead of saying yes a simple nod can do the same work, but by transmitting lesser data. In his 1948 paper, A Mathematical Theory of Communication, Shannon established that there is a fundamental limit to lossless data compression. This limit, called the entropy rate, is denoted by H. The exact value of H depends on the information source --- more specifically, the statistical nature of the source. It is possible to compress the source, in a lossless manner, with compression rate close to H. It is mathematically impossible to do better than H.

Fig 1.1.1 Claude E Shannon

Shannon also developed the theory of lossy data compression. This is better known as rate-distortion theory. In lossy data compression, the decompressed data does not have to be exactly the same as the original data. Instead, some amount of distortion, D, is tolerated. Shannon showed that, for a given source (with all its statistical properties known) and a given distortion measure; there is a function, R (D), called the ratedistortion function. The theory says that if D is the tolerable amount of distortion, then R (D) is the best possible compression rate. When the compression is lossless (i.e., no distortion or D=0), the best possible compression rate is R(0)=H (for a finite alphabet source). In other words, the best possible lossless compression rate is the entropy rate. In this sense, rate-distortion theory is a generalization of lossless data compression theory, where we went from no distortion (D=0) to some distortion (D>0).

Lossless data compression theory and rate-distortion theory are known collectively as source coding theory. Source coding theory sets fundamental limits on the performance of all data compression algorithms. The theory, in itself, does not specify exactly how to design and implement these algorithms. It does, however, provide some hints and guidelines on how to achieve optimal performance.

1.2 Simple methods showing compression. A. We have already understood that the basic theory underlying that of data compression is of encoding. Let us consider some straightforward methods to achieve compression. Consider a distribution of numbers that are to be transmitted 12 13 14 15 16 17 18 Clearly if the above numbers were to be encoded, it would take us 5 bits at least to encode each number. So if we were to transmit the above numbers it would take us at least 35 bits in all. Now one would think of a better and a clever idea of representing the above distribution in the form of a generating function x(n) = n + 12 n> 0 or n=0

Thus, a better idea therefore would be to send the number 12 first (as the starting number) and then pass the other numbers as deviations from the base. This scheme would require 4 + (3* 7) = 25 bits. Clearly data has been compressed. B. Another interesting method is no method at all. If one already knows what data is going to be transmitted accurately, no transmission of data is required. Thus a data compression of 100% is achieved. However this is rather far fetched and will almost certainly be never the case except in some know(no)-telling world where people (dont) keep on telling things when they are already known. The theory that emerges from the above discussion is that, the more you know about the type of data that is going to be transmitted, the more data compression can be achieved. In terms of entropy, entropy (measure of chaos) reduces as knowledge increases, which is related to other sciences as well.

1.3 The Concept of Information Information gives us the aggregate of facts in an event. Intuitively, we understand that if a event keeps on occurring with a very high probability we know that it does not give us any new information. For example if someone tells you that he is breathing does not tell us much and therefore has very less information. But if someone tells you that the earths moon has exploded, then this message carries a lot of information. Why? This is because it has a very low probability of occurrence. We can therefore define information as:
1 i(a) = log x (P(a) )

i(a) = log x P(a)


Note that the log scale is used for the following reasons [1]: 1. It is practically more useful. Parameters of engineering importance such as time, bandwidth, number of relays, etc., tend to vary linearly with the logarithm of the number of possibilities. For example, adding one relay to a group doubles the number of possible states of the relays. It adds 1 to the base 2 logarithm of this number. Doubling the time roughly squares the number of possible messages, or doubles the logarithm, etc. . 2. It is nearer to our intuitive feeling as to the proper measure. This is closely related to (1) since we intuitively measures entities by linear comparison with common standards. One feels, for example, that two punched cards should have twice the capacity of one for information storage, and two identical channels twice the capacity of one for transmitting information. 3. It is mathematically more suitable. Many of the limiting operations are simple in terms of the logarithm but would require clumsy restatement in terms of the number of possibilities. The choice of a logarithmic base (x) corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly bits, a word suggested by J. W. Tukey. A device with two stable positions, such as a relay or a flip-flop circuit, can store one bit of information. N such devices can store N bits, since the total number of possible states is 2N and log 2 2N =N. If the base 10 is used the units may be called decimal digits. Since log2M = 3:32 log10M a decimal digit is about 3.5 bits. So, this is the reason why a logarithmic scale is preferred over a simple linear scale.

1.4 The Concept of Entropy(H) 1.4.1 Relating Entropy in Thermodynamics to Entropy in Information theory Entropy literally means the measure of chaos in the system. Today the word entropy is as much a part of the language of the physical sciences as it is of the human sciences. Unfortunately, physicists, engineers, and sociologists use indiscriminately a number of terms that they take to be synonymous with entropy, such as disorder, probability, noise, random mixture, heat; or they use terms they consider synonymous with antientropy, such as information, negentropy, complexity, organization, order, improbability. There are at least three ways of defining entropy:

in terms of thermodynamics (the science of heat), where the names of Mayer, Joule, Carnot, and Clausius (1865) are important; in terms of statistical theory, which fosters the equivalence of entropy and disorder -- as a result of the work of Maxwell, Gibbs, and Boltzmann (1875), and in terms of information theory, which demonstrates the equivalence of negentropy (the opposite of entropy) and information -- as a result of the work of Szilard, Gabor, Rothstein, and Brillouin (1940-1950)

Negentropy is a non-recommendable but near synonym for information. The term has created considerable confusion suggesting that information processes negate the second law of thermodynamics (In a closed system the entropy always increasescreating chaos from order) by producing order from chaos. The history of the confusion stems from the mere formal analogy between Boltzmann's thermodynamic expression for entropy S = k log W and the Shannon-Wiener expression for information H = - log x P (a). The only motivation for the negative sign in the latter is that it yields positive information quantities (the logarithm of a probability is always negative). The probability p of an event a and the thermodynamic value W including Boltzmann's constant k measure entirely different phenomena. A meaningful interpretation of negentropy is that it measures the complexity of a physical structure in which quantities of energy are invested, e.g., buildings, technical devices, organisms but also atomic reactor fuel, the infrastructure of a society. In this sense organisms may be said to become more complex by feeding not on energy but on negentropy (Schroedinger)

1.4.2

Definition of Entropy (H)

Entropy is the average information content of the source S. Let the messages be m1,m2,m3,..mM and their probabilities be p1,p2,p3pM Suppose a sequence of L messages is transmitted. Then, if L is very large, we may say that,

(p1* L) messages of m1 have been transferred (p2* L) messages of m2 have been transferred (p3* L) messages of m3 have been transferred (pM*L) messages of mM have been transferred. Hence the information due to message m1 will be

I 1 = log2


1 p1

Since there are p1*L number of messages m1,the total information due to all m1s will be

I 1(total) = p1L log2


1 p1

Similarly, the total information carried by m2 will be

I 2(total) = p2L log2


1 p2

and so on.

Thus, the total information carried by all the L sequences will be

I (total) = p1L log2


1 p1

+ ::: for all M


I total L

and so, the average information per message in the L transmitted messages will be Entropy(average information) H = Hence the entropy is obtained as follows: H= P P P(A i) i(A i) P(A i) logxP(A i)

H=

By this revolutionary idea, Shannon showed that, there is a limit to the encoding of symbols. As discussed earlier, knowing something about the data reduces the entropy. Entropy is measured in bits/character (if base is 2;binary)

1.5 Definition of Information Rate Rate is nothing but the bits/second. R= rH that is, Information Rate = Symbol/second * Bits/symbol = bits/second 1.6 Compression methodology Data compression is a two step process.

1. Model the data 2. Code the data Modeling the data is using probabilistic models so that we get the least possible entropy and thus leading to better and more efficient compression algorithms. Models will be covered later in the discussion. _______________________________________________________________________ _

Chapter 2 Huffman encoding.


We will now discuss a very popular method to generate very efficient codes. We first present a method to build a Huffman code when the probability model for the source is known and then discuss a procedure that builds the code when the probability model is unknown. Huffman encoding is still fairly cheap to decode, cycle-wise. But it requires a table lookup, so it cannot be quit as cheap as RLE, however. The encoding side of Huffman is fairly expensive though; the whole data set has to be scanned, and a frequency table built up. In some cases a "shortcut" is appropriate with Huffman coding. Standard Huffman coding applies to a particular data set being encoded, with the setspecific symbol table prepared to the output data stream. However, if not just the single data set--but the whole type of data encoded--has the same regularities, we can opt for a global Huffman table. If we have such a global Huffman table, we can hardcode the lookups into our executables, which makes both compression and decompression quite a bit cheaper (except for the initial global sampling and hard-coding). For example, if we know our data set would be English-language prose, letter-frequency tables are well known, and quite consistent across data sets. 2.1 Good codes As we already know by now that more the frequency(probability) of occurance of a particular character the lesser should be its code. However the average length of the code is not the only thing that is necessary . For example consider the following sequence of letters with probabilities, P(a1) = 0.5 P(a2) = 0.25 P(a3) = P(a4) = 0.125

H(s) = 1.75 bits/symbol Letters a1 a2 a3 a4 Average length Code1 0 0 1 10 1.125 Code2 0 1 00 11 1.25 Code 3 0 10 110 111 1.75 Code4 0 01 011 0111 1.825

The average length l for each code is defined as: l =


4 P i=1

P (ai )n(ai )

where n(ai) is the number of bits in the codeword for the letter a i .Here, code 1 has the least average length (the formula for average length is the way it is because average is nothing but the sum of all frequency distributions upon the total number of samples, which is exactly why the probability comes into picture) , however for a code to be effective it should be unambiguous. Clearly code 1 is ambiguous. Code 2 looks at first glance as an unambiguous code, but on closer inspection it can be understood that it is in fact, an ambiguous code, because 11 specifies both a4 and a1a1. Code 3 is a unique kind of a code and is called as the prefix code because none of the codes form a prefix of any other code.As a result this is the kind of code that we are on the lookout for. The Huffman encoding scheme must generate such kind of codes. _______________________________________________________________________ _

2.2 The Huffman Coding Algorithm (Lossless)


This technique was developed by David Huffman as a class assignment: the class was first ever in the area of information theory, taught by Robert Fano at MIT. The codes generated using this technique are called as Huffman Codes. These codes are prefix codes and are optimum for a given probability distribution. This technique guarantees codes that perform within 1 bit of entropy. (We shall state the bounds later). The Huffman procedure is based on the following observations regarding prefix codes: 1. In an optimum code, symbols that occur most frequently will have shorter size 2. In an optimum code, the two symbols that occur least frequently will have the same length. The first statement is obvious , the second not so much. Consider that C is an optimal code and that the two least frequently occurring symbols do not have the same length. Let us assume that the two codes differ by k bits. Because this is a prefix code, no code can be a prefix of the other. As these code words correspond to the least probable symbols, no other symbols can be larger than these two ( by hypothesis 1) , therefore there is no danger that the shorter codeword would become the prefix of some other codeword. Furthermore by dropping these k bits we can understand that C will now have a shorter average length. This violates our assumption that C is an optimal code. Hence we conclude that k =0 and that both least occurring codes have the same length.

2.3 Design of the algorithm


Let us design a Huffman code for a source that puts out letters from the alphabet A= {a1,a2,a3,a4,a5} With P(a1) = P(a3) = 0.2 , P(a2) = 0.4 and P(a4)=P(a5)=0.1 Therefore H(A) = 2.122 bits/symbol
(0.4) a 2(0.4) a 1(0.2) a 3(0.2) a 4(0.1) a 5(0.1) 0 (0.2) 1 1 (0.2) 0.2 0 0.4 0 (0.4) 1 0.6 0 (1.0)

0.2

The method for generating the tree is clear from the tree. The algorithm is that, at every step all the probabilities are sorted in a descending order and the lowest two probabilities are combined together. Thus, from the above tree diagram , we can get the prefix code as follows: Letter a1 a2 a3 a4 a5 The average length of this code is: l = 0:4 1 + 0:2 2 + 0:2 3 + 0:1 4 + 0:1 4 = 2.2 bits/symbol Probability 0.4 0.2 0.2 0.1 0.1 Codeword 1 01 000 0010 0011

Thus the redundancy of the code is H-l = 0.078 bits/symbol. It can be proved that the redundancies are zero when the probabilities are negative powers of zero. There is one more way in which the Huffman codes can be generated, however we will not consider it here. The output by both methods is still the same.

2.4 Probing further into Huffman coding


2.4.1 Minimum variance Huffman Codes. As we still dip deeper we realize that the subject becomes more and more fascinating and gripping. At this point one may ask a questionIs that the maximum I can achieve in the Huffman algorithm? Well and interestingly the answer is NO. We can still achieve lesser redundancy with some very small adjustments. Just goes on to show how big small things are! By performing the sorting procedure in a slightly different manner we could have found out a different Huffman code. In the first re-sort we could put the newly added probability (0.2= 0.1 + 0.1 here) as the first, instead of the last that we have done in the example given above. The Huffman Tree would then have become as follows:

a 2(0.4) a 1(0.2) a 3(0.2) a 4(0.1) a 5(0.1) 0

a 2(0.4) (0.2) (0.2) (0.2)

(0.4) (0.4) 0 0 1 (0.2) 1

(0.6) 0 (1.0) (0.4) 1

Thus, the newly created Huffman tree generates the following prefix codes, Letter a1 a2 a3 a4 a5 Probability 0.4 0.2 0.2 0.1 0.1 Codeword 10 00 11 010 011

The average length of the above code is given as : l = 0:4 2 + 0:2 2 + 0:2 2 + 0:1 3 + 0:1 3 = 2:2 bits= symbol

Thus, redundancy is 0.078 bits/symbol, which is same as the redundancy of the method 1.However the above code has an interesting property. The variance of the code is minimum above ..observe the codes, the varance is 3-2 = 1 for the second method, which was 4-1 = 3 for the earlier method. What does this imply? This means that by using the second code, we can use a buffer of a size using which less wastage would occur when transmitting a character which is lesser than the maximum. An proof can be reached that says that, to obtain the Huffman code with minimum variance, we always put the combined letter as high as possible in the list. 2.4.2 Length of the Huffman Codes As already stated above, the Huffman coding procedure generates an optimal code, but we havent stated what the optimal length of the code is.Here we will prove that the optimal code for source S , hence the Huffman code for the source S has an optimal (average) code length l bounded by the following upper and lower bounds: H(S) l <= <= H(S) + 1

Because of lack of space, we shall not consider the proof here. Interested people can refer to [1]. 2.4.3 Extended Huffman Codes In applications where the alphabet size is large, p max is generally quite small , and the probability function is not very skewed. However in such cases where the p max is large and probability density function is highly skewed, then the Huffman code can become somewhat inefficient as regards the entropy (of course it still follows the bounds). In such cases grouping together of characters helps to improve the average length. Example 2.4.3.1 Consider a source that puts out independent and identically distributed (iid) letters from the alphabet A = {a1,a2,a3} with the probabilities model as P(a1)= 0.8 , P(a2) = 0.02 , P(a3) = 0.18. The entropy of the sequence comes out to be 0.816 bits/symbol Solving the problem for a Huffman solution, we get the following values, Letter a1 a2 Probability Codeword 0.8 0 0.02 11

a3

0.18

10

The average length of this code comes out to be 1.2 bits/symbol. The redundancy = 0.384 bits/symbol which is about 47% of the overall entropy. Thus the Huffman Encoding has become highly inefficient and we must search for some other better method to overcome this problem. We can sometimes reduce the coding rate by blocking (combining) more than one symbol together. Consider a source S that emits the alphabet A = {a1,a2,a3,,am}.The entropy for this source is given as H(S) =
m P i=1

P(ai)log2P(ai)

We already know that the bounds for Huffman code are H(S) < = R < H(S) + 1 ---------------------------(1) Suppose now we encode the sequence by generating one codeword for every n symbols, then, the total number of possible permutations with repetition will be, (m)(m)n times = mn which implies that there will be m n codes in the Huffman code. We would generate this cide by by viewing the mn codes as letters of the extended alphabet Aextended = {a1a1a1a1ntimes, all possible permutations, amamamam} This is the reason why this method is called as the extended Huffman Codes. Let us denote the new rate for the source S(n) as R(n).Then we can say that, H(S(n) )< = R(n) < H(S(n) ) +1 ---------------------from (1) above Where R(n) ( dont read this as R raised to nit is not that!, it is merely notation) is the number of bits required to code n symbols .Therefore the number of bits required per symbol , R, is given by, R = (1/n) * R(n) The number of bits can be bounded as
H(S(n) ) n

<= R

<

H(S(n) ) n

1 n

In order to find out what H(S (n)) in terms of H(s) we solve the equation for entropy, using which we finally get, (see [1] pp 38-39 for a proper proof)

H(Sn) = nH(S)
and thus, we can write H(S) <= R <= H(S) + 1/n Which thus tells us that combining two symbols takes us closer to the entropy.For the example 2.4.3.1 that was shown above, combining two symbols gives us the following Letter a1a1 a1a2 a1a3 a2a1 a2a2 a2a3 a3a1 a3a2 a3a3 Probability 0.8* 0.8 = 0.64 0.8* 0.02 = 0.016 0.144 0.016 0.0004 0.0036 0.1440 0.0036 0.0324 Codeword 0 10101 11 101000 10100101 1010011 100 10100100 1011

The average codeword length for this extended code is 1.7516 bits/symbol .However each symbol here corresponds to 2 from the original .Therefore in terms of the original the average codeword length is 0.8758, with redundancy =0.06 bits/symbol which is just more than 7% of the entropy. Thus we reach the conclusion that extending codes can be very helpful to minimize the redundancy and produce more efficient codes. However as the alphabet size increases, extended alphabet size increases exponentially with exactly m z for blocking (combining) of z characters. Under these circumstances, the Huffman scheme becomes way too impractical and therefore certain other schemes need to be searched. One such scheme is the Arithmetic Coding .(To be discussed later.) 2.4.4 Nonbinary Huffman Codes The Non-binary Huffman codes are just an extension of the binary case. Here, instead of combining the lowest two probabilities, the lowest m probabilities are combined and codes are designed accordingly. We will not consider an example here, moreover I believe that it is trivial. 2.4.5 Adaptive Huffman Coding In all the above cases we had assumed that the probability model is present. In most practical cases it is not! Then the Huffman tree becomes a 2-pass process, with the first pass collecting the statistics, and the second pass creating the code. Instead, Faller

and Gallagher independently developed algorithms so that this could be done in one pass. These algorithms format the Adaptive Huffman coding schemas. In this procedure, neither transmitter nor receiver knows anything about the first symbol that is to be transmitted. This symbol is shown as NYT (not yet transmitted) here and has a weight of zero. Another couple of things deserve attention. 1. All external nodes (shown by squares) i.e. leaves contain the value of the number of times a symbol has appeared 2. All internal nodes(shown by circles) have a weight that is a sum of its two sons. 3. Each node is associated with a weight and node number.

Fig 2.4.5.1 The Adaptive Huffman tree for the sequence (aardv)

The adaptive process for encoding the aardv symbols is shown in the figure.The figure is more or less self explanatory, hence we will only consider the simple algorithms that are necessary for the following processes of Huffman Adaptive encoding.

The update procedure flowchart is as follows


Start

Yes NYT gives birth to new NYT and external node

Is this the first appearance of the symbol No

Increment Weight of external node and old NYT node

Go to the symbol external node

Go to old NYT node

Node number max in block

No

Switch node with highest numbered node in the block

Yes

Increment node weight

Is this the Root node Yes Stop Update procedure for the adaptive Huffman algorithm

No Go to Parent node

Assume that we are encoding the sequence [aardvark], where our alphabet consists of the 26 characters of the English language. We will not go into more details of this procedure , however the procedure is exactly as outlined in the above flowchart. Note that the updating can cause several trees from one node .The adaptive process must take one path out of many as per the requirement of minimum variance Huffman Code or some other variation of the Huffman code.

Flowchart for encoding procure is as follows

Start Read in symbol

Yes

Is this the first appearance of the symbol

No

Send code for NYT node followed by index in the NYT list

Code is the path from the root node to the corrosponding node

Call Update procedure No Last symbol Yes Stop

Encoding Procedure

The decoding procedure is outlined in the flowchart as shown below


Start

Go to the root of the tree

Is the node an external node Yes Is the node theYes NYT node? No c a Yes

No

Read bit and go to corrosponding node

Read e bits

c No

Decode element corrosponding to node

Is the e-bit number p less than r

No Add one to p

Yes

Read one more bit

Call update procedure

Decode the (p+1) elemnt in the NYT list

No

Is this the last bit Yes Stop

Figure 2.4.5.1 Flowchart for decoding procedure

Thus, the above flowchart shows the decoding procedure. 2.5 Huffman Coding theory applications to Lossless Text Compression Text compression seems very natural for the Huffman encoding. In text, we have a discrete alphabet that in a given class has relatively stationary probabilities. Then, using the probability distribution of the English language characters, we can generate a Huffman tree and thus achieve reasonable and lossless compression.

THE AMERICAN CONSTITUTION FREQUENCY MAP..PAGE 55

Chapter 3. Arithmetic lossless Coding.

3.1 Motivation This is yet another very popular method of generating variable length codes. Sometimes the most obvious coding schemes are not found out first, the arithmetic coding is one of them, not only is it beautiful and elegant, one must also say that it again tell us that not all things are as good as they appear to be in all cases, like Huffman encoding above. Arithmetic coding is especially useful when dealing with small alphabets with highly skewed probabilities. It is also very useful when for various reasons the modeling and coding aspects are to be kept separate. We will firstly consider an example where the Huffman coding does not yield satisfactory results even after blocking of two symbols. Consider the probability model with the distribution as shown below. The entropy for this source is : 0.3335 bits/symbol Letter a1 a2 a3 Probability 0.95 0.02 0.03 Codeword 0 11 10

The average length of this code is 1.05 bits/symbol. Therefore the redundancy is 0.715 bits/symbol, which is 213% of the entropy. That means that to code this sequence we would need more than twice the number of bits promised by the entropy. So, we will consider blocking of the two symbols like we did in the previous example. We will then get the following Letter a1a1 a1a2 a1a3 a2a1 a2a2 a2a3 a3a1 a3a2 a3a3 Probability 0.95* 0.95 = 0.902 0.95* 0.02 = 0.019 0.0285 0.0190 0.0004 0.0006 0.0285 0.0006 0.0009 Codeword 0 111 100 1101 110011 110001 101 110010 110000

The average length for the extended alphabet is 1.222 bits/symbol, which in terms of the original alphabet is 0.611 bits/symbol. The additional entropy is still about 72% of the total entropy. It is observed that the redundancy drops to acceptable levels when a blocking of 8 characters is done. As discussed earlier, as the alphabet size increases, the blocking alphabet rises exponentially, costing very inefficient Huffman coding procedures. This is where the Arithmetic Coding comes to our help. In Arithmetic coding, a unique identifier or tag is generated for the sequence to be encoded. This tag corresponds to a binary fraction, which becomes the binary code for the sequence. In practice the generation of the binary code and the tag are part of the

same. The arithmetic coding is however easier to understand if we divide the process into two parts as follows: 1. Phase 1: 2. Phase 2 : A unique identifier or tag is generated for the given set of symbols. The tag is given a unique binary code.

It is interesting to observe that unlike Huffman codes, where we needed to find out the codes for all the symbols of the alphabet, in arithmetic coding a unique arithmetic code can be generated for alphabet or length m without generating codes for each symbol of the alphabet. 3.2 Coding a sequence It has only been in the last ten years that a respectable candidate to replace Huffman coding has been successfully demonstrated: Arithmetic coding. Arithmetic coding completely bypasses the idea of replacing an input symbol with a specific code. Instead, it takes a stream of input symbols and replaces it with a single floatingpoint output number. The longer (and more complex) the message, the more bits are needed in the output number. It was not until recently that practical methods were found to implement this on computers with fixed sized registers. The output from an arithmetic coding process is a single number less than 1 and greater than or equal to 0. This single number can be uniquely decoded to create the exact stream of symbols that went into its construction. In order to construct the output number, the symbols being encoded have to have a set probabilities assigned to them. For example, if we were to encode the random message "BILL GATES", we would have a probability distribution that looks like this: Character Probability --------- ----------SPACE 1/10 A 1/10 B 1/10 E 1/10 G 1/10 I 1/10 L 2/10 S 1/10 T 1/10 Once the character probabilities are known, the individual symbols need to be assigned a range along a "probability line", which is nominally 0 to 1. It doesn't matter which characters are assigned which segment of the range, as long as it is done in the same manner by both the encoder and the decoder. The nine character symbol set use here would look like this:

Character Probability Range --------- --------------------SPACE 1/10 0.00 - 0.10 A 1/10 0.10 - 0.20 B 1/10 0.20 - 0.30 E 1/10 0.30 - 0.40 G 1/10 0.40 - 0.50 I 1/10 0.50 - 0.60 L 2/10 0.60 - 0.80 S 1/10 0.80 - 0.90 T 1/10 0.90 - 1.00 Each character is assigned the portion of the 0-1 range that corresponds to its probability of appearance. Note also that the character "owns" everything up to, but not including the higher number. So the letter 'T' in fact has the range 0.90 - 0.9999.... The most significant portion of an arithmetic coded message belongs to the first symbol to be encoded. When encoding the message "BILL GATES", the first symbol is "B". In order for the first character to be decoded properly, the final coded message has to be a number greater than or equal to 0.20 and less than 0.30. What we do to encode this number is keep track of the range that this number could fall in. So after the first character is encoded, the low end for this range is 0.20 and the high end of the range is 0.30. After the first character is encoded, we know that our range for our output number is now bounded by the low number and the high number. What happens during the rest of the encoding process is that each new symbol to be encoded will further restrict the possible range of the output number. The next character to be encoded, 'I', owns the range 0.50 through 0.60. If it was the first number in our message, we would set our low and high range values directly to those values. But 'I' is the second character. So what we do instead is say that 'I' owns the range that corresponds to 0.50-0.60 in the new subrange of 0.2 - 0.3. This means that the new encoded number will have to fall somewhere in the 50th to 60th percentile of the currently established range. Applying this logic will further restrict our number to the range 0.25 to 0.26.

The algorithm to accomplish this for a message of any length is is shown below:

Set low to 0.0 Set high to 1.0 While there are still input symbols do get an input symbol code_range = high - low. high = low + range*high_range(symbol)

low = low + range*low_range(symbol) End of While output low Following this process through to its natural conclusion with our chosen message looks like this: New Character Low value High Value ------------- -----------------0.0 1.0 B 0.2 0.3 I 0.25 0.26 L 0.256 0.258 L 0.2572 0.2576 SPACE 0.25720 0.25724 G 0.257216 0.257220 A 0.2572164 0.2572168 T 0.25721676 0.2572168 E 0.257216772 0.257216776 S 0.2572167752 0.2572167756 So the final low value, 0.2572167752 will uniquely encode the message "BILL GATES" using our present encoding scheme. 3.3 Decoding Scheme Given this encoding scheme, it is relatively easy to see how the decoding process will operate. We find the first symbol in the message by seeing which symbol owns the code space that our encoded message falls in. Since the number 0.2572167752 falls between 0.2 and 0.3, we know that the first character must be "B". We then need to remove the "B" from the encoded number. Since we know the low and high ranges of B, we can remove their effects by reversing the process that put them in. First, we subtract the low value of B from the number, giving 0.0572167752. Then we divide by the range of B, which is 0.1. This gives a value of 0.572167752. We can then calculate where that lands, which is in the range of the next letter, "I" The algorithm for decoding the incoming number looks like this: get encoded number Do find symbol whose range straddles the encoded number output the symbol range = symbol low value - symbol high value subtract symbol low value from encoded number

divide encoded number by range until no more symbols Note that we have conveniently ignored the problem of how to decide when there are no more symbols left to decode. This can be handled by either encoding a special EOF symbol, or carrying the stream length along with the encoded message. The decoding algorithm for the "BILL GATES" message will proceed something like this: Encoded Number Output Symbol Low High Range ---------------------------- ---- ----0.2572167752 B 0.2 0.3 0.1 0.572167752 I 0.5 0.6 0.1 0.72167752 L 0.6 0.8 0.2 0.6083876 L 0.6 0.8 0.2 0.041938 SPACE 0.0 0.1 0.1 0.41938 G 0.4 0.5 0.1 0.1938 A 0.2 0.3 0.1 0.938 T 0.9 1.0 0.1 0.38 E 0.3 0.4 0.1 0.8 S 0.8 0.9 0.1 0.0 In summary, the encoding process is simply one of narrowing the range of possible numbers with every new symbol. The new range is proportional to the predefined probability attached to that symbol.Decoding is the inverse procedure, where the range is expanded in proportion to the probability of each symbol as it is extracted. The above structure beautifully demonstrates how the arithmetic coding works and we finally have a floating point number in the range [0-1) that uniquely identifies a permutation of the alphabet.

3.4 Advantages of Arithmetic Coding over Huffman Coding

1.

2.

It is much easier to adapt the arithmetic coding to changing input statistic.All we need to do is estimate the probabilities of the input alphabet. This can be done by keeping a count of the letters as they are coded, there is no need to preserve a tree, as with adaptive Huffman Coding. No priori generation of code is necessary in arithmetic coding method.

3.5 Applications The arithmetic coding is used extensively for a variety of lossless text compression as well as image compression schemes sch as the bi-level image compression (the JBIG , Joint Bi-level image processing group).These applications can be a separate study topic of their own and so will not be considered here. _______________________________________________________________________ _ 3.6 Summary Thus, in this chapter we introduced the basic ideas behind arithmetic coding. The arithmetic code is a uniquely decodable code that provides a rate close to entropy for long stationary sequences. _______________________________________________________________________ _

Chapter 4 Dictionary Techniques.


In the above two coding techniques we have assumed (iid) independent identically distributed input. However as we had seen in the introduction, we understand that the more we know about the structure of data that is to be transmitted, the more can we compress it. The dictionaries Techniques are a group of techniques that use the structure in the data. As most sources are correlated to start with, a correlation is always followed by de-correlation step.These techniques, both static or dynamic(adaptive) build a list of commonly occurring patterns and encod ethese patterns by transmitting their index in the list.The Unix Compress uses dictionary techniques.

4.1 Introduction
So, we finally come to a technique that uses correlations and recurring patterns in the data. A classic example is a text source in which certain patterns or words recur constantly. Also there are certain patterns that occur very rarely. A very reasonable approach to encode such sources is to make a dictionary or list of all frequently occurring patterns. If a pattern does not appear in the dictionary it can be encoded using some less efficient method. In effect we are trying to split the existing patterns into two, frequently occurring, and non-frequently occurring. 4.1.1 Example. Suppose we have a text that consists of four-character words. Let A = {all 26 English alphabets + 6 punctuation marks} be the source alphabet. If we want to encode each word at a time (as against one symbols at a time in earlier methods), then the total number of permutations will be (32) x (32) x (32) x (32) = 324 = 220 = 1 M four-character patterns As the number of patterns = 32, 5 bits are needed to encode each symbol. So, a fourcharacter word will take 20 bits. Let us now put the 256 most likely patterns(words) into a dictionary. The transmission scheme works as follows. Whenever we need to send a word from the dictionary, we first send a flag (say 0) bit and then the index in the dictionary (which is 8 bit here). If the word is not in the dictionary, we send a flag =1 and the complete 20-bit code(so a total of 21 bits) . Let p be the probability that a word is found in the dictionary. Average number of bits/(pattern(word)) R = 9p + 21 (1-p) = 21- 12p For our scheme to be useful, R must be less than 20. This will occur when p>= 0.084. If we consider that each word is equally probable, then the probability that that word is in the dictionary is 256 / 1million = 0.00025. So, we can see from these numbers that our scheme will be of any consequence if the probability of hitting the elements in the cache

(hit rate) is at least 0.084.Thus, any random words if put into the dictionary will be of no consequence. Thus we must select such words that their probability of occurrence is at least 0.084. We will now consider the two approaches to dictionary based algorithms, static and dynamic. 4.2 Static Dictionary This approach is useful when a prior knowledge of the source is available. For example, If we had the task of compressing a student database, we could use this approach, as we already know that some words, such as name, age, sex etc are going to appear a lot. However there are some generalized static dictionary methods that are less specific to a single application. 4.2.1 Digram Coding In this type of coding, the dictionary contains all the letters of the source alphabet, followed by as many pairs of letters called as digrams (di + gram), as many as can be contained by the dictionary. For example, if dictionary size of 256 characters was to be constructed for ASCII, it would contain 95 printable characters + 161 most frequently used pairs Example 4.2.1.1 Consider A = {a,b,c,d,r}.Based on priori knowledge be build the dictionary as : Code 000 001 010 011 Entry a b c d Code 100 101 110 111 Entry r ab ac ad

Suppose we wish to encode the sequence abracadabra. First the encoder reads two symbols ab and looks-up the dictionary, if found assign code 101 and moves two symbols ahead, else searches for a, then for r and encodes accordingly. The output sequence of abracadabra , continuing in this fashion is, 101100110111101100000

4.3 Adaptive Dictionary Most adaptive dictionary schemes have their roots in two landmark papers buy Jacob Ziv and Abraham Lempel in 1977 and 1978.These two papers provide two different approaches to adaptively building dictionaries, and each approach has given rise to a number of variations. The approaches based on the 1977 paper are called as LZ77 family (also LZ1) and that belonging to the 1978 category are called as LZ78 or LZ2, family. 4.3.1.1 The LZ77 approach

In the LZ77 approach, the dictionary is simply a portion of the previously encoded sequence. The encoder examines a part of the previously encoded sequence through a sliding window as shown in the figure below. Consider that the sequence to be transmitted is cabracadabrarrarrad suppose the length of the window is 13, the size of the look ahead buffer is 6, and the current condition is as follows:
Pointer

cabraca
Search buffer

dabrar
Lookahead Buffer

With dabrar in the lookahead buffer. We look back in the already encoded portion of the window to find a match for d.In general, the encoder searches the search buffer for the longest match.Once the longest match is found, the encoder encodes it with the triplet <o,l,c> where; o is the offset(i.e. the distance of the pointer from the lookahead buffer) l is the length of the match (i.e. the number of matching symbols) c is the codeword corresponding to the symbol in the lookahead buffer that follows the match. In the above example, as there is no match for the symbol d the triplet <0,0,C(d)> is transmitted. As soon as we finished for d we left shift the sliding window. So as to get
Pointer

abracad
Search buffer

abrarr
Lookahead Buffer

Now, we will search for a, which we find at an offset of 2 (note that offsets are relative to the lookahead buffer start). With a length of match 1.Further we find another a at o = 4,l=1 and yet another at o=7, l = 4 .So, we encode the string abra with the triple <7,4,C(r)> We now move the window by 5 characters
Pointer

adabrar
Search buffer

rarrad
Lookahead Buffer

Here the triplet is <3,5,C(d)>, why this is so is evident from the explanation of the method of decoding. So, decoding the sequence cabraca will give us three triplets <0,0,C(d)> , <7,4,C( r)>, <3,5,C(d).The first triple is easy to decode.; it means that o=0;l=o that is, there was no match within the previously decoded string and the next decoded sting is d . The decoded string now becomes cabracad The first triple now tells us to move the pointer 7 steps back and copy 4 characters from that point. The decoding process works as shown on the next page.

Enlarged Xerox copy of the page 104.

4.3.1.2 Variations on the LZ77 scheme

The LZ77 scheme as outlined in the previous discussion has asymptotically the same characteristics like an encoder that would have had full statistical information of the source. The LZ77 algorithm is a very simple adaptive scheme that requires no prior knowledge of the statistical structure. There have been several variations to the LZ77 scheme in the classical literature. Several variations to this scheme included variable block length codes (that we have assumed as fixed) in the above discussion. Popular compression packages such as PKZip, Zip, Lharc, ARJ all use LZ77 based algorithm followed by a variable length coder.

4.3.2 The LZ78 coding scheme The LZ77 approach implicitly assumes that like patterns will occur close together. It makes use of this structure by using the recent past of the sequence as a dictionary for encoding. However this means that any pattern that recurs over a period longer than the one covered by the coder window will not be captured. LZ78 algorithm solves this problem by dropping the reliance on the search buffer and keeping an explicit dictionary. Example 4.3.2.1 Consider the encoding of the sequence wabbabwabbabwabbabwabbabwoobwoo (b stands for blank). As an adaptive method the dictionary goes on encoding like this. Note that the inputs are coded as a double <i, c>, with I being an index in the dictionary, and c being the code. Encoder Output <0,C(w)> <0,C(a)> <0,C(b)> <3,2> <0,C(b)> <1,2> <3,3> <2,5> <6,3> <4,5> <9,3> <8,1> <0,C(o)> <13,5> <1,13> <14,1> <13,13> Dictionary Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Entry w a b ba b wa bb ab wab bab wabb abw o ob wo obw oo

Notice that the entries in the dictionary keep on getting longer an longer, and if the words keep on repeating, like in a song, at some time an entire sentence will come into the dictionary. While this algorithm has the capacity to hold patterns indefinitely, it also has a serious drawback that the dictionary size goes on increasing for every entry not found in the dictionary, and may cause the dictionary size to run amok. _______________________________________________________________________ _ 4.3.3.1 The LZW algorithm This algorithm is a direct variation of the LZ78 algorithm. This algorithm is the most well known modification of the LZ algorithms and this algorithm is the one that has sparked interest in the LZ algorithm. Welch proposed a technique for removing the necessity of the encoding of the second element of the pair<I,c>.That is, the encoder would send only the index into the dictionary. We will study the algorithm with the help of an example. Example 4.3.3.1 Encode the sequence wabbabwabbabwabbabwabbabwoobwoo (b stands for blank). Assuming therefore , that the alphabet consists of A = { b, a, b,o,w} The dictionary is build up initially consisting of the primary symbols in the alphabet Index Entry 1 b 2 a 3 b 4 o 5 w The encoder first reads the letter w. This letter is in the dictionary, so we can concanate the next letter to it, forming wa. This is not in the dictionary. So we add this. We now start with a. This is in the dictionary. So we concat and create ab, which is not in the dictionary. So we add it. Notice that the 12th through 19 entries are three or four letters in length. Then we encounter the pattern for woo for the first time and we drop back to two-letter patterns for three more entries, after which we go back to entries of increasing length. Index 1 2 3 4 5 6 7 Entry b a b o w wa ab Index 14 15 16 17 18 19 20 Entry Abw Wabb Bab Bwa Abb Babw Wo

8 9 10 11 12 13

bb ba ab bw wab bba

21 22 23 24 25

Oo ob Bwo Oob Bwoo

The encoded message is therefore, 5 2 3 3 2 1 6 8 10 12 9 11 7 16 5 4 4 11 21 23 4 It is interesting to see how the adaptive scheme goes on adapting in the sequence. Starting from simple alphabets to more and more complicated structures. Decode: In the above discussion we saw how LZW encodes the given string. In this section we will consider the decoding of the given pattern of numbers. The encoded message is 5 2 3 3 2 1 6 8 10 12 9 11 7 16 5 4 4 11 21 23 4 The decoding procedure carries the same dictionary as existed before the start of the encoding procedure on the encoding side. Index Entry 1 b 2 a 3 b 4 o 5 w We will now see how the decoder adaptively constructs the dictionary from the existing sequence. The index value 5 corresponds to the letter w in the dictionary. At the same time, in order to mimic the dictionary construction procedure of the encoder, we begin construction of the next element in the dictionary. We start with the letter w.This pattern exists in the dictionary and so we need not add it to the dictionary. The next decoder input is 2, which corresponds to a .We decode and concanate it to the current pattern under construction w to form wa. As this pattern does not exist in the dictionary we add it as the sixth element, and start a new pattern beginning with the letter a. The next four inputs correspond to the letters bbab, and generate the directory entries ab,bb,ba and ab.The dictionary now looks like the following in which the 11 th entry is under construction. Index 1 2 3 4 5 6 Entry b a b o w wa

7 8 9 10 11

ab bb ba ab b

The next input is 6 = wa in the dictionary. As earlier input was 1,our pointer is pointing to b , we will see if bw exists in the dictionary, it doesnt, so its added as the 12 th entry. We will then move our pointer to w and see if wa exists, it does, so we read the next input, 8 and the pointer starts pointing to the last read entry = 6 .The 8 th entry corresponds to bb. So we first concatenate wab, it doesnt exist in the dictionary, so we add it at position 12. Then we consider the combination bb which already exists (look at it this way, our pointer was pointing to 6, and we got a new data as 8(bb). So we will see the few join combinations that are possible, they are wab,bb (and not abb, because a was not part of 6 independently, 6 was wa as one symbol and not w + a)) We continue doing this so as to get the final o/p string perfectly.

4.3.3.2 An exception There is a case where in the above method of decoding breaks down. The case being that the index that should have been present in the dictionary is not. Consider A = {a,b} and if we were to encode the sequence beginning with abababab..The encoding process is still the same. We begin with the initial dictionary as Index 1 2 And end up with the final dictionary as: Index 1 2 3 4 5 6 7 Entry a b ab ba aba abab b Entry a b

The transmitted sequence is therefore 1 2 3 5 Now comes the snagdecoding begins with the same initial dictionary as shown in the table above. 1. The first number is 1,pointer points to null. Already exists! So go ahead and read

2. Next entry is 2, pointer points to previous i.e. index = 1, and check join combinations of a + b = ab, does it exist, no! so add it as the 3 rd, and the next join combination b.Does it exist , yes! So go ahead and read 3. Next input is 3, pointer is pointing to index=2;join combinations are, b + ab => ba, ab .Out of this ba doesnt exist, so add it as the 4 th.ab already exists, so go ahead and read the next. 4. Next input is 5, pointer is pointing to index=3; here comes the problem! 5th is still not defined!! But this problem can be solved. We know that our pointer is pointing to index=3, so the 5th index will have a join combination of ab + ? (Though we do not know what this ? is!) This problem can be solved by assumption of the incoming symbol. It is surely from amongst the A. So let us firstly assume it as a. Then the join combination of ab + a = aba would occur, does this exist in the directory? No!.So we could add this as the 5th element, and we doand then merrily carry on with our gameas if nothing had happened. Thus this special case has to be handled by an exception handler of the LZW algorithm. 4.3.3.3 Applications of the LSW algorithm. The LSW algorithm is a most widely used compression technique, used in many well-known applications 1. Unix Compress 2. GIF 3. V.42 bits (compression over modems) 4. Adobe Acrobat PDF

4.4 Summary In this chapter, techniques were introduced to keep a dictionary of recurring patterns and transmit the index of those patterns instead of the patterns themselves in order to achieve compression. There are a number of ways in which a dictionary can be created. 1. In applications where certain patterns constantly recur, we can build application specific static directories. Care should be taken not to use these techniques outside their area of application otherwise we might end up with data expansion instead pf data compression. 2. The dictionary can be the source output itself. That is, triplets are transmitted as data is read without using a explicit table. This is the approach used by the

LZ77algorithm.Whwn using this algorithm there is a implicit assumption that the recurring phenomena is local 3. This assumption is removed by the LZ78 approach, which dynamically constructs a dictionary from patterns observed in the input Thus this chapter shows how dictionary techniques can be used to compress a variety of data.

Chapter 5 Lossy Compression.


Up till now we have considered all lossless schemes in which no loss of data existed. When we were looking at lossless compression, one thing we had never to worry about was how the reconstructed data varied from the original existing data. The limited amount of compression that can be achieved from lossless compression techniques (max compression possible is equal to the entropy rate) is sufficient in some circumstances. In many circumstances, where resources are limited and we do not require absolute integrity, lossy compression can be the best altenative. For the lossless compression schemes we used rate as the essential measure of efficiency. That would not be feasible however for lossy compression. Therefore, some additional performance measure is necessary, such as some measure of the difference between the original and the final reconstructed data. This measure is referred to as distortion in the reconstructed data. In the best of possible words, we would like to incur the minimum amount of distortion while compressing at the lowest rate possible. Obviously there is a trade-off between minimizing the rate and keeping the distortion small. The extreme cases are the ones in which no information is transmitted (our hypothetical case in the introduction of the seminar report)or the case where all data is kept, where distortion is zero. The study of the situations between the two extremes is called as Rate-distortion theory.

Source X

Source Encoder X
c

Channel

User Y

Source Decoder X
c

Fig 5.1 Block diagram of generic compression scheme

5.1 Distortion Criterion A natural thing to do when looking at the fidelity of the reconstructed sequence is to look at the differences between the original and the reconstructed values-in other words the distortion introduced in the compression process. Two popular measures of distortion are the squared error measure and absolute difference measure.These are called as difference distortion measures. If {xn} is the input and {yn} is the output then,

d(x,y) = (x-y)2 and the absolute difference is given as, d(x,y) = |(x-y)| In general it is difficult to measure the distortion on a term by term basis. Several statistical measures exist to aid in this. Mean squared error(MSE) : =
2 1 N N P n=1

(xn yn)2

If we are interested in the size of the error relative to the signal, we can find the ratio of the average squared value of the source output and the MSE called as Signal-noise ratio(SNR).

SNR =
2

2 2

Where x is the average squared value of the source output, or signal and d is the SNR. The SNR is often measured on a log scale in dB so as to give the formula,

SNR(dB) = 10 log10 2
d

Sometimes we are more interested in the size of the error relative to the peak value of the signal xpeak, than with the size of the error relative to the average squared value of the signal. This ratio is called as the peak-signal-to-noise-ratio and is given by;

PSNR(dB) = 10 log10
N P n=1

xpeak d
2

Another difference distortion measure that is used quite often, although not as frequently as the MSE is the average of the absolute difference, or
1 d1 = N

jxn ynj

This measure is especially useful for evaluating image compression algorithms. 5.2 Scope of lossy algorithms It is obvious that schemes for text compression are lossless.It is the nature of the human organs (esp eyes and ears) that allow us to create lossy algorithms.The human eye is much more sensitive to edges than it is to the image.It is a direct appreciation of the fact that the mind does not see everything that the eye sees. We can use this knowledge to design compression schemes such that the distortion introduced by our lossy compression scheme is not noticeable.

Similarly, because the human ear cannot hear sounds beyond 20hz 20000hz, they can safely be eliminated by a band pass filter and thus inverse transforming the frequency, would give us a compression on the speech signal. It is interesting to note therefore that lossy schemes work only with data analogous to human limitations, discrete systems such as text however seem to remain into the area of lossless compression 5.3 Lossy Text Compression? We have thus seen the apparent divergence between the research paradigms of text and image compression despite the fact that both are concerned with compressing information whose subjective quality must be recoverable. Schemes for text compression are invariably reversible or lossless, whereas although there certainly exist lossless methods of image compression, much research effort addresses irreversible or lossy techniques such as transform coding, vector quantization, and fractal approximation. The divergence between the text and image paradigms is unfortunate because the opportunity for symbiosis between the two approaches is lost, and advances in one domain have negligible impact on the other. Although there are superficial reasons why one might choose to neglect the topic of lossy text compression such as the difficulty of evaluating the quality of the regenerated message it is argued by Witten et al [3] that a great deal can be gained by taking seriously the idea of approximate compression of text. In a paper by Witten et.al[3] two techniques for lossy compression of text have been described, though not very mathematically. The necessarily very short texts that are used in the examples exhibited in the paper, along with their statistical limitations, can be no more than indicative of the underlying power of the techniques that describe.The next section develops a semantic approach that uses an auxiliary thesaurus. However, any word-based approach limits the amount of compression that can be achieved if used in isolation. Hence we next consider syntactic techniques, using fractal compression , for the generation of approximate text. As when fractals are applied to image coding, extremely high compression factors can be achieved for certain data sets, but the process is time-consuming and it is not clear whether the method can be extended to apply efficiently to all texts. The final section points the way to a synthesis of the two approaches, which together promise to form an extremely powerful and yet general method of lossy text compression. 5.3.1 Lossy text compression: background and motivation Everyday experience abounds with examples of approximate text compression. The art of prcis, for example, is lossy compression par excellence and is widely used for a variety of practical purposes, though in manual rather than automatic implementations. Further examples, at a much higher compression rate, occur in newspaper headlines the creation of which is an art that blends current affairs with an almost poetic feeling for words and their juxtaposition. Finally at the extreme end of the scale is the trash can,

surely the epitome of irreversible compression! While the last example may seem flippant (though it is distinguished as the only one that has been machine-implemented to date), the point is that lossy compression can serve a wide range of purposes. The examples also show that as in image compression perceived quality is not easily specified. In synthetic languages the lossy text compression problem is easy to specify: compress the text (making suitable lexical and syntactic transformations) but preserve semantics. In a programming language such as Pascal, simple lossy compression may be achieved by removing superfluous white space. This yields substantial compression using a trivial algorithm. Literate programmings notion of tangling is an example of lossy compression that not only loses layout information but all comments as well . Further compression can be achieved by applying compiler optimisation methods to the text (optimising for source code length, rather than object code length), and for example, may result in variable names being shortened or even lost, or expressions being rewritten. In natural language, other lossy techniques include the commonly-suggested device of omitting vowels from text. This sacrifices readability for compression and is hardly suitable for practical use, though variations are used in shorthand (both speed writing and non-roman scripts), Braille and stenography. Thus, Dearborns Speedwriting (1924) was designed for typewriter use, using standard letters and punctuation. For example, the code eC represents the sound each, the C designating the sound ch. Sixty rules in Speedwriting provide for lossy compression of a vocabulary of around 20,000 words. The Soundex system uses a simpler lossy compression technique to avoid problems of requiring exact spelling matches for text database searches. At first sight the historical standing and simplicity of sound-based lossy compression might seem very attractive. Its effectiveness is easily demonstrated. The optimal encoding of quick brown fox using an order0 model of English (Brown Corpus) is 81.3 bits; after making the lossy phonetic transcription x->cs, qu->cw, k->c (obtaining cwic brown focs, which sounds the same when read aloud) it compresses to 68.6 bits using the appropriately modified order0 model. Surely a dictionary-based phonetic model would do better? In fact a conventional ordern model can do even better, since it necessarily models text that is pronounced: it is effectively a phonetic model for soundsof at most n+1 characters. Not only can an ordern model compress more effectively, but it can do so losslessly! The compression of the various shorthand techniques results from the poor correspondence of single letters to sounds; moreover, every letter- or phoneme-based approach to lossy text compression assumes an appropriate phonetic model to intepret it. Conventional lossless compression can be far more effective, in terms of compression, but obviously results in unreadable data that requires a computer to unencode. Such observations lead us to base the text compression work to be described on semantic units larger than letters (and independent of sound), a tactic corresponding to one that has precedent in lossless compression.

5.3.2 Word-by-word semantic compression Excellent, comprehensive thesauri have recently become available in machinereadable form and already some compression researchers have begun to take advantage of them (e.g., [8]). Thesaurus compression is a macro word-replacement strategy for lossy compression: replace each word in the text with a shorter equivalent form taken from a (given) thesaurus. Despite its remarkable simplicity, this technique provides worthwhile compression with little semantic loss sometimes the richness and literary texture of the prose improves. Here is an example of the first two paragraphs of this paper, so compressed: We win been struck by the true divergence mid the dig plans of book and bust compression, dig the life that both are scary and jamming lore whose biased top must be recoverable. Cons for tome compression are invariably reversible or lossless, as as there well be lossless uses of idol compression, ton dig go owners irreversible or lossy ways like as vary lawing, vector quantization, and fractal guess. The divergence amid the work and bust plans is sad due to the gap for symbiosis mid the two approaches is gone, and goes in a area win off tap at the some. As there are brief wits why a pep opt to omit the item of lossy work compression like as the fix of trying the top of the regenerated sense in this page we jog that a key buy wc be got by asking sadly the yen of put compression of tome. This reduces the text from 1031 bytes to only 804, giving a compression figure of 78%. Of course, the word count is unaffected and therefore common operations, including word-counting, still function correctly on the compressed text. A striking advantage of the thesaurus technique, particularly in comparison with lossless methods for compression is that it can be re-applied to the same text with a further gain in compression performance. The semantic loss tends to increase every time the transformation is reapplied this is the inevitable price of improved compression. For example, a second iteration of the method on the first two paragraphs of the paper yields: We buy been struck by the due divergence mid the cut maps of bag and dud compression [1, 2], dig the root that both are dire and baring data whose biased bow must be recoverable. Lies for opus compression are invariably reversible or lossless, as as there fit be lossless uses of god compression, ton cut go heads irreversible or lossy ways will as go lawing, vector quantization, and fractal go. The divergence mid the do and bag aims is sad due to the gap for symbiosis mid the two approaches is lost, and goes in a sod net lax dab on the a. As there are tell gags why a go opt to bar the text of lossy do compression such as the fit of hard the key of the regenerated wit in this hail we jog that a tip win wc be won by begging sadly the yen of put compression of text. The reduction is a further 3.7%. Repeated applications tend to converge rather quickly to a fixed point that we call an

attractor of the original paragraph (following the terminology of non-linear dynamics) . An attractor of the example paragraph is reached after a further 6 iterations, and hardly differs from the second-iteration version above, except under very careful examination in other words, its deep meaning is preserved. Tests show that the average distance to an attractor is about 7.28 iterations, although this varies with the style of text and the particular thesaurus. Different replacement strategies can yield different attractors; we define the attractor set of a given text in the obvious way. Analysis of a large number of attractor sets shows that, as one might expect, the members of the set generally bear a strong resemblance to each other, giving a small but useful degree of variation in the compressed text. We have experimented with improved methods of word-by-word semantic compression. The basic idea is to generalize to an expanded attractor set by progressing up the semantic hierarchy before each iteration. This tends to produce slightly better compression at the expense of semantic accuracy. Unfortunately the improvement is not guaranteed, for we can construct texts on which the generalization produces worse compression than any member of the original attractor set. A possible solution to this degeneration involves a simulated annealing process. The procedure replaces a word by one at a level above in the hierarchy with a probability that depends on the current temperature value. This probability steadily approaches zero as time progresses. The operation proceeds in cycles: in every cycle each word has an opportunity to move up the hierarchy before being replaced by a shorter equivalent, and at the end of the cycle the temperature decreases according to the predetermined schedule. All word-by-word compression schemes suffer a common flaw: they can never reduce the number of words in the text. (Human-implemented lossy compression, such as prcis, does not necessarily suffer the same disadvantage.) We discarded as insufficiently powerful converse schemes that locate phrases that occur in a thesaurus and replace them by a single word. Another method is required; we turn to one possibility in the next section.

5.3.3 Generative compression of text in the style of Hemingway The final abstract expression of every art is a number Wassily Kandinsky, 1912 Parallel work on story generation [11] led us to consider the question of generating works in a particular literary style. Hemingways prose was selected arbitrarily to illustrate the method; it has a strongly characteristic and easily-identified style (many genuine examples are available, along with some notable imitations [12]). Linguistic analysis of the available corpus in terms of both the syntactic constructions commonly adopted, and the semantic entities that form the centerpiece of many Hemingway stories,resulted in a program that generates appropriately styled generated texts in the chosen genre. Here is a brief example of output:

The old man who had waited for the old beggar from Madrid was certain that the locals had argued with his martini and should have argued with the waiter. Only he had not sat beside the waiter. No one but he had tried to fool the old beggar from Madrid he had heard about and knew that the bullfighter had not argued with the waiter. Only he knew that the parrot had told him about the matadors friend on Kilimanjaro. The old man knew that the locals who had not sat beside the American girl had argued with the old beggar from Madrid in a well lighted room and believed that the locals had joined up with his martini in the caf. The old man had argued with his martini while fast asleep and believed that she who had not joined up with the American girl had not waited for the American girl. The old man had not cheated the matadors friend with a certain understanding. The old man was certain that the small dog with three legs who should have joined up with his martini had not waited for the old beggar from Madrid and had not cheated death. The old man had not sat beside death at the corner table. The individuality and variety in the stories is directly attributable to the use of a pseudorandom number generator, which produces a sequence that depends solely on an initial seed. From a different seed one may grow a different text, within the constraints of the genre. For example, here is a text that begins in the same way: The old man had waited for his martini. The old man had not tried to fool death he had heard about. He who knew that the small dog with three legs had told him about the matadors friend felt that the bullfighter had not told him about the old beggar from Madrid while fast asleep. He felt that the locals who had not seen the old beggar from Madrid had not waited for death for nothing. The old man thought that the man with the patch over one eye had brought him the waiter in the caf. The old man who should have tried to fool the waiter had not brought him the waiter and knew that the bullfighter who had not brought him his martini had not told him about death he had heard about. The old man had not sat beside death at the corner table. No one but he who had sat beside the American girl believed that the bullfighter should have brought him the waiter. He who knew that she had joined up with his martini for nothing was certain that the bullfighter should have tried to fool the matadors friend in a well lighted room and had not cheated the old beggar from Madrid while fast asleep. The number generator has one seed, hence just 232 possible states. Consequently any text so generated can therefore be stored in 32 bits. The seed represents a very substantial compression (indeed, of a magnitude that has never previously been realized in text compression). This technique produces lossless codes for a particular class of texts: namely, those generated by the Hemingway pseudo-text program. The crucial insight is that with no modification it can produce lossy codes for a much larger class of texts. Of course, worthwhile compression with reasonable fidelity can only be expected on stories within

the Hemingway genre on which the program is modeled; nevertheless this does comprise a substantial number of samples. 5.3.4 Synthesizing the semantic and generative approaches The next step is to combine the semantic and generative approaches to provide a more powerful approximate compression technique. The semantic and generative approaches can be combined in two distinct ways. The thesaurus can be used to increase the match between a generated story and the one to be compressed; we call this semantic enhancement. Or it can be used to decrease the size of the generated story through the normal semantic compression procedure: this is lexical contraction. Although lexical contraction does not reduce the bit rate (the story is already represented by 32 bits), controlled experiments with human subjects, who had already been exposed to our earlier compression technique, showed that it increases the verisimilitude of the compressed text the resulting taut, brusque prose accords better with the readers idea of how a compressed version should read than the original, more florid, language. A lexical contraction of the first example generated text above is: The old guy who had bided for the old bum from Madrid was set that the folks had rowed mid his martini and must get bugged too the waiter. One he had not sat at the waiter. No a yet he had sure to ass the old bum from Madrid he had heard re and knew that the bullfighter had not irked mid the waiter. Odd he knew that the parrot had told him re the matadors pal by Kilimanjaro. The old rig knew that the folks who had not sat on the yank kid had rowed mid the old bum from Madrid in a far lit live and bought that the folks had wedded up and his martini in the bar. The old man had irked and his martini as lax idle and bought that she who had not wedded up and the yank kid had not waited for the yank kid. The old arm had not conned the matadors pal mid a set wit. The old arm was set that the off pup mid three arms who must use wedded up mid his martini had not held for the old bum from Madrid and had not fobbed ruin. The old guy had not sat on ruin on the jam list. Semantic enhancement is clearly the more powerful compression combination. Compared with the rather stilted vocabulary of the raw pseudo-text, semantic substitution offers much richer and more variegated language. For instance, here is one such transformation of the first sample text: The perennial gear who had procrastinated for the archaic beggar from Madrid was unquestionable that the near-athands had battled additionally his martini and must concede haggled among the waiter. Peerless he had not sat around the waiter. No one though he had infallible to inveigle the obsolete drifter from Madrid he had learned about and knew that the bullfighter had not warranted midst the waiter. Solely he knew that the parrot had told him respecting the matadors promoter atop Kilimanjaro. The perennial homo sapiens knew that the folks who had not sat around the yankee adolescent had scrapped in addition the ancient mendicant from Madrid in a ruddy delicated elbowroom and knowed that the verging ons had fused jack up midst his martini in the tearoom. The dead male had irked with his martini whereas precipitous motionless and gathered that she who had not tied boost within the yankee coed had not delayed for the American daughter. The past fortify had not bilked the matadors companion within a factual

insight. The outmoded widower was stated that the limited pup moreover three legs who should have laced acquainted inside his martini had not waited for the grizzled pauper from Madrid and had not robbed passing. The pass fellow had not sat nearby decease atop the bottle up remit. The much larger space of possible compressed texts that can be created with this method does exacerbate the problem, mentioned above, of finding the best match to a given source text. It may not be apparent how a text that has been generated and subjected to semantic enhancement can be coded efficiently. Although four bytes suffice to represent the original pseudo-text, it seems to be necessary to specify the enhancement individually for each word, thus negating the compression that the generative method yields. Fortunately, the problem can be solved very simply. Examination of the program reveals that sentence enhancement, like story generation, is fully characterized by a 32-bit random number generator seed. This seed is all that is needed to regenerate the enhanced text without any loss of fidelity. Thus a total of 8 bytes is necessary for lossy compression of a text of any size: 4 for the generator and 4 for the semantic enhancement. Still further gains may be had by deriving one of the seeds from the other via an appropriately parameterized transformation. However, 8 bytes is already a rather efficient representation and the potential for further improvement is small,perhaps insignificant.

5.3.5 Conclusions
Tall oaks from little seeds grow David Everett 1769-1813 (adapted) Te above discussion illustrated the benefits that are obtained by taking the idea of lossy text compression (itself motivated by lossy image compression) seriously and adapting some of the techniques from the image compression world. Thesaurus substitution is a straightforward technique that results in appreciable compression: it has the advantage that, up to a point, it can be applied repeatedly to further reduce the size of the compressed text. However, it suffers from the serious disadvantage that although it reduces the size of each individual word, it can never reduce the number of words in the text. Generative techniques and the coding of a text in terms of a random number seed gives remarkably effective lossless compression for a restricted class of texts, and can be viewed as a lossy compression method for a more general class indeed, for a genre. The verisimilitude of the compressed stories can be increased by thesaurus substitution, to ensure that the reader perceives the result as compressed. Alternatively, accuracy can be increased through semantic enhancement. Although this technique doubles the bit rate of the compressed text, it permits a much more accurate rendering of the original text. One criticism of the scheme is the slow encoding speed; however, this is more than made up for by the very fast decoding that is possible. We are also investigating other methods of lossy text compression. One technique that shows promise is based on progressive image transmission For example, in progressive text

transmission of a paper we first send the title, then section headings, the abstract, the conclusion, and so on. The more of this representation that is stored or transmitted, the more lossless the representation is. In experiments with transmitting papers, it appears that most of the time users cancel transmission quite early on, achieving significant savings in transmission costs Undoubtedly the largest problem for lossy text compression is the question of evaluating the texts produced, and providing satisfactory measures of their subjective quality.

References and further reading:


[1] Sayood, Khalid (1996) Introduction to Data Compression, Morgan Kaufman Publishers; ASIN: 1558603468; 2 edition (January 1996). [2] Nelson, Mark (1991) Arithmetic Coding + Statistical Modeling = Data Compression, Dr Dobbs Journal, February 1991 [3] Witten Ian H, Bell Timothy C, Alistair M, Nevil-Manning C, Smith T C, Thimbleby H, Semantic and generative models for lossy text compression, Computer journal 37(2), pp 83-87,1994

You might also like