You are on page 1of 35

EC 2214: Coding & Data Compression

Introduction
Ashutosh Marathe
ashutosh.marathe@vit.edu
Vishwakarma Institute of Technology
DEPARTMENT OF ELECTRONICS ENGINEERING
Data Compression A Complete Reference
David Saloman, 4
th
Edition, Springer Verlag
Coding Techniques introduction to
Compression and Error control Graham
Wade, Palgrave Publications
Data Compression W.J.Lewis, 2
nd
edition,
Springer Verlag
Introduction to Data Compression Khalid
Sayood, 3
rd
Edition, Elsvier India Limited
Data Compression Book Mark Nelson, Jean
Loup Gaily, 2
nd
edition, BPB Publications


Books
Lossless compression
Intro/math preliminaries (1 wk)
Huffman coding (~1.5 wks)
Arithmetic coding (~1.5 wks)
Dictionary techniques (~1.5 wks)
Context-based compression (~1 wk)
3
Preliminary List of Topics
Lossy compression
Math preliminaries (1 wk)
Scalar quantization (1 wk)
Vector quantization (1 wk)
Transform Coding (1 wk)
Final on everything with emphasis on 2
nd
half
4
Preliminary List of Topics (2)
A:The art & science of representing data in a
compact form.
Physical analogy--suitcase packing:

5
What is Data Compression?
In the physical world
Not enough space
What about the digital world?
Q: Bandwidth/capacity grows exponentially--why
bother?
A: Data is generated even faster!
Examples:
1 sec of CDDA audio:
44100 samples x 2 channels x 16 bits/sample = 1,411,200 bits
1 sec of HD audio:
192,000 x 2 x 24/32 = 9,216,000/12,288,000 bits
1 sec of CCIR 601 video = 20
+
MB
1 VGA frame (RGB):
1024 x 768 x 24 = 18,874,368 bits
6
Why Compress?
Bottom line--without compression:
Many applications/services will still not be
feasible
E.g., streaming video
Many others will be much more expensive
E.g. analog vs. digital cell phones
Why not keep data compressed at all times?
It is difficult to live out of a suitcase
Similarly, non-compressed data formats are
related to data acquisition/consumption, not
efficient storage
7
Why Compress? (3)
Lossless compression: x = x
A.k.a. entropy coding, reversible coding
Lossy compression: x = x
A.k.a. irreversible coding
8
Basic Terminology
Encoder Decoder
x y x
original compressed decompressed
Compression ratio: |x|/|y|
|x| represents the number of bits in x
E.g.: |x| = 65,536, |y| = 16384, compression = 4:1
Alt., data has been reduced by (|x|-|y|)/|x| = 75%
Other measures of coding performance
Bits per sample
E.g. ASCII: 8 bits/char, RGB: 24/48/72 bits/pixel
Distortion (lossy methods)
The human-perceived/mathematical difference
between x and x
9
Basic Terminology (2)
Developing compression algorithms:
Phase I: Modeling
Develop the means to extract redundancy
information
Redundancy => predictability
Phase II: Coding
Binary representation of the difference between
the model and the observed data
A.k.a. residual
10
Modeling & Coding
Consider the sequence:
S
n
= 9, 11, 11, 11, 14, 13, 15, 17, 16, 17, 20, 21
Binary encoding requires 5 bits/sample
Consider the model:

n
= n + 8:
9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
The residual
e
n
= S
n
-
n
: 0, 1, 0, -1, 1, -1, 0, 1, -1, -1, 1, 1
Ex. coding: 00 <=> -1, 01 <=> 0, 10 <=> 1
2 bits/sample
Compression scheme == model + residual
Note that model also needs to be encoded (usually in the algorithm)
11
Modeling & Coding Example #1
Consider the sequence:
S
n
= 27, 28, 29, 30, 30, 32, 31, 31, 29, 28, 27
Consider the model:

1
= 0;
n
= S
n-1
for n > 1
0, 27, 28, 29, 30, 30, 32, 31, 31, 29, 28
The residual
{e
n
}= 27, 1, 1, 1, 0, 2, -1, 0, -2, -1, -1
Can be encoded with much fewer bits
Predictive coding
The use of previous samples to predict future samples
Very useful in exploiting temporal redundancies
E.g. in audio & video
12
Modeling & Coding Example #2
13
Real-World Coding: Braille (1821)
Be nice to others
14
Real-World Coding: Morse (1844)
Note differences in code lengths
Generally, high/low frequency => short/long code
Not 100% consistent (e.g. l vs. m)
This is the rationale behind encoding scheme exploiting statistical
redundancy
15
Morse Code vs. Letter Frequencies
We defined (informally) the notion of data
compression
Lossless
Lossy
Presented the basic approach behind compression
algorithms:
Modeling + coding
Presented some early examples of international
standardized encoding schemes:
Braille, Morse
16
Summary
EC 2214: Coding & Data Compression

Review of information theory
and some aspects of coding
Ashutosh Marathe
ashutosh.marathe@vit.edu
Vishwakarma Institute of Technology
DEPARTMENT OF ELECTRONICS ENGINEERING
Compression == squeezing out the inefficiencies of
the information representation
Note #1: in lossy compression we threw out less
important/imperceptible information
Note #2: We must be able to reverse the process
to make the data usable again
Q1: What data can be compressed?
Q2: By how much?
Q3: How close are we to optimal compression?
A: Information Theory: a mathematical description of
information and its properties
19
Achieving Data Compression
Analog (continuous) data
Represented by real numbers
Note: cannot be represented by computers
Digital (discrete) data
Given a finite set of symbols {a
1
, a
2
, , a
n
},
All data represented as symbol sequences (or
strings) in the symbol set
E.g.: {a,b,c,d,r} => abc, car, bar, abracadabra,
We use digital data to approximate analog data
20
Representing Data
Roman alphabet plus punctuation
ASCII - 256 symbols
Braille, Morse
Binary - {0,1}
0 and 1 are called bits
All digital data can be represented efficiently in
binary
E.g.: {a, b, c, d} fixed length binary representation
(2 bits/symbol):
21
Common Symbol Sets
11 10 01 00 Binary
d c b a Symbol
First formally developed by Claude Shannon at Bell
Labs in the 1940s/50s
Explains limits on coding/communication using
probability theory
Self-information
Given event A with probability P(A)
22
Information
( ) A P
A P
A i
b b
log
) (
1
log ) ( = =
23
Self-Information
Observations
Low P(A) => high i(A)
High P(A) => low i(A)
Rationale:
Low probability (surprise)
events carry more
information; think
man bites dog vs.
dog bites man
Suppose A and B are
independent then
i(AB) = i(A) + i(B)

) ( ) (
) (
1
log
) (
1
log
) ( ) (
1
log
) (
1
log ) (
B i A i
B P A P
B P A P AB P
AB i
b b
b b
+ = + =
= = =
0
1
2
3
4
5
6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
2
log
Fair coin
Let H & T be the outcomes
If P(H) = P(T) = 1/2, then
i(H) = i(T) = -1/log
2
(1/2) = 1 bit
Unfair coin
Let P(H) = 1/8, P(T) = 7/8
i(H) = 3 bits
i(T) = 0.193 bits
Note that P(H) + P(T) = 1
24
Coin Flip Example
Let
A
1
,,A
n
be all the independent possible outcomes
from an experiment
with probabilities P(A
1
), ,P(A
n
)
25
(First-Order) Entropy

= =
= =
n
i
i b i
n
i
i i
A P A P A i A P H
1 1
) ( log ) ( ) ( ) (
If the experiment generates symbols, then (for b=2) H is the
average number of binary symbols needed to code the
symbols.
Shannon: No lossless compression algorithm can do better.
Note: The general expression for H is more complex but
reduces to the above for iid sources
Consider the sequence:
1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10
Assume it correctly describes the probabilities
generated by the source; then
P(1) = P(6) = P(7) = p(10) = 1/16
P(2) = P(3) = P(4) = P(5) = P(8) = P(9) = 2/16
Assuming the sequence is iid
26
Entropy Example #1
bits i P i P H
i
25 . 3
16
2
log
16
2
6
16
1
log
16
1
4 ) ( log ) (
2 2
10
1
2
=
|
.
|

\
|

|
.
|

\
|
= =

=
Assume sample-to-sample correlation
Instead of coding samples, code difference:
1 1 1 -1 1 1 1 -1 1 1 1 1 1 -1 1 1
Now P(1) = 13/16, P(-1) = 3/16
H = 0.70 bits (per symbol)
Model also needs to be coded
Knowing something about the source can help us
reduce the entropy
Note the we cannot actually reduce the entropy of
the source, as long as our coding is lossless
Instead, we are reducing our estimate of the
entropy
27
Entropy Example #2
Consider the sequence:
1 2 1 2 3 3 3 3 1 2 3 3 3 3 1 2 3 3 1 2
P(1) = P(2) = 1/4, P(3) = 1/2, H = 1.5 bits/symbol
Total bits: 20 x 1.5 = 30
Reconsider the sequence
(1 2) (1 2) (3 3) (3 3) (1 2) (3 3) (3 3) (1 2) (3 3) (1
2)
P(1 2) = 1/2, P(3 3) = 1/2
H = 1 bit/symbol x 10 symbols = 10 bits
In theory, structure can eventually be extracted by taking
larger samples
In reality, we need an accurate model as it is often impractical
to observe a source for long
28
Entropy Example #3
Physical models
Based on understanding of the process generating
the data
E.g., speech
A good model leads to good compression
Usually impractical
Empirical data instead
Statistical methods can help take a proper sample
30
Models
Ignorance model
1. Assume each letter is generated independently
from the rest
2. Assume all letters are generated with equal
probability
Examples?
ASCII, RGB, CDDA,
Improvementdrop assumption 2:
A = {a
1
, a
2
, , a
n
}, P = {P(a
1
}, P(a
2
), , P(a
n
)}
Very efficient coding schemes exist already
Note
If 1. does not hold, a better solution likely exists
31
Probability Models
Assume that each output symbol depends on
previous k ones. Formally:
Let {x
n
} be a sequence of observations
We call {x
n
} a k
th
-order discrete Markov chain
(DMC) if
32
Markov Models
( ) ( ) , , , , ,
1 1 k n n n k n n n
x x x P x x x P

=
Usually, we use a first-order DMC:
( ) ( ) , , ,
1 1 k n n n n n
x x x P x x P

=
Knowledge of the past k symbols is equivalent to knowledge of the entire
process
Linear dependency model
x
n
= x
n-1
+ c
n
c
n
=> white noise
33
Non-linear Markov Models
Consider a BW image as a
string of black & white pixels
(e.g. row-by-row)
Define two states: S
b
& S
w
for
the current pixel
Define probabilities:
P(S
b
) = prob of being in S
b

P(S
w
) = prob of being in S
w
Transition probabilities
P(b|b), P(b|w)
P(w|b), P(w|w)
S
w
S
b
P(b|w)
P(w|b)
P(b|b)
P(w|w)
( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
w w b b
b
w
S H S P S H S P H
b w P b b P w b P w w P
b b b b P w b b w P S H
w w w w P w b w b P S H
+ =
= =
=
=
/ 1 / , / 1 /
/ log / / log /
/ log / / log /
Assume
34
Markov Model (MM) Example
( ) ( )
( ) ( ) ( ) ( ) 3 . 0 / 7 . 0 / 01 . 0 / 99 . 0 /
31 / 1 31 / 30
= = = =
= =
b w P b b P w b P w w P
S P S P
b w
For the iid model:
206 . 0 =
iid
H
For the Markov model:
( )
( )
107 . 0 881 . 0
31
1
081 . 0
31
30
081 . 0 99 . 0 log 99 . 0 01 . 0 log 01 . 0
881 . 0 7 . 0 log 7 . 0 3 . 0 log 3 . 0
= + =
= =
= =
Markov
w
b
H
S H
S H
In written English, probability of next letter is heavily
influenced by previous ones
E.g. u after q
Shannons work
Natural allocation to 26 letters + space H = 3.1
bits/letter
2
nd
-order MM, 26 letters + space Word-based
model H=2.4 bits/letter
Longer context => better prediction
Practical concerns:
Context model storage (e.g. 4
th
-order w/ 95 chars = 95
4
contexts)
Zero frequency problem
36
Markov Models in Text Compression
Different realizations of the source output may vary considerably in
terms of repeating patterns
Hence the context modeling schemes tend to be adaptive.
Probabilities for different symbols in the different contexts are updated as
they are encountered
Thus we will often encounter symbols that have not been encountered
before for any given context.
This is called zero frequency problem
The larger the context, the more often this will happen.
Zero frequency problem
Send a code to indicate that following symbol
was encountered for first time, followed by
the pre-arranged code.
If the situation is not to occur too often, the
overhead would not be significant.
But for longer contexts, the problem may not
be ignored.
Solutions can be presented by
prediction with partial match technique
Solution ???

You might also like