Multimedia

A negative
Main article: Photographic film
A 35 mmfilmstrip.
Film for 135 film cameras comes in long narrow strips of chemical-coated plastic or cellulose acetate. After each image is captured by the camera onto the film strip, the film strip is advanced so that the next image is projected onto unexposed film. When the film is developed, it is a long strip of small negative images. This strip is often cut into sections for easier handling. In larger cameras this piece of film may be as large as a full sheet of paper or even larger, with a single image captured onto one piece. Each of these negative images may be referred to as a negative and the entire strip or set of images may be collectively referred to as negatives. These negative images are the master images, from which all other copies will be made, and thus they are treated with care. [edit]Negative
image
A positive image is a normal image. A negative image is a total inversion of a positive image, in which light areas appear dark,dark is light and vice versa. A negative color image is additionally color reversed, with red areas appearing cyan, greens appearing magenta and blues appearing yellow. This sometimes can have a reverse effect and cause the greens to appear a reddish brown. Film negatives usually also have much less contrast than the final images. This is compensated by the higher contrast reproduction by photographic paper or by increasing the contrast when scanning and post processing the scanned images. [edit]Negative
film
Many photographic processes create negative images: the chemicals involved react when exposed to light, and during developing these exposed chemicals are retained and become opaque while the unexposed chemicals are washed away. However, when a negative image is created from a negative image (just like multiplying two negative numbers in mathematics) a positive image results (see Color print film, C-41 process). This makes most chemical based photography a two step process. These are called negative films and processes. Special films and development processes have been devised such that positive images can be created directly from film; these are called positive, or slide, or (perhaps confusingly) reversal film (see Transparency, Black and white reversal film, E-6 process).
Despite the market's evolution away from film, there is still a desire and market for products which allow fine art photographers to produce negatives from digital images for their use in alternative processes such as cyanotypes, gum bichromate, platinum prints, and many others.[1]
Edge detection
From Wikipedia, the free encyclopedia
Edge detection is a fundamental tool in image processing and computer vision, particularly in the areas of feature detection and feature extraction, which aim at identifying points in adigital image at which the image brightness changes sharply or, more formally, has discontinuities. The same problem of finding discontinuities in 1D signals is known as step detection.
Motivations
Canny edge detection applied to a photograph
The purpose of detecting sharp changes in image brightness is to capture important events and changes in properties of the world. It can be shown that under rather general assumptions for an image formation model, discontinuities in image brightness are likely to correspond to[1][2]: discontinuities in depth,
discontinuities in surface orientation, changes in material properties and variations in scene illumination.
In the ideal case, the result of applying an edge detector to an image may lead to a set of connected curves that indicate the boundaries of objects, the boundaries of surface markings as well as curves that correspond to discontinuities in surface orientation. Thus, applying an edge detection algorithm to an image may significantly reduce the amount of data to be processed and may therefore filter out information that may be regarded as less relevant, while preserving the important structural properties of an image. If the edge detection step is successful, the subsequent task of interpreting the information contents in the original image may therefore be substantially simplified. However, it is not always possible to obtain such ideal edges from real life images of moderate complexity. Edges extracted from non-trivial images are often hampered by fragmentation, meaning that the edge curves are not connected, missing edge segments as well as false edges not corresponding to interesting phenomena in the image thus complicating the subsequent task of interpreting the image data.[3] Edge detection is one of the fundamental steps in image processing, image analysis, image pattern recognition, and computer vision techniques. During recent years, however, substantial (and successful) research has also been made on computer vision methods[which?] that do not explicitly rely on edge detection as a pre-processing step. [edit]Edge
properties
The edges extracted from a two-dimensional image of a three-dimensional scene can be classified as either viewpoint dependent or viewpoint independent. A viewpoint independent edgetypically reflects inherent properties of the three-dimensional objects, such as surface markings and surface shape. A viewpoint dependent edge may change as the viewpoint changes, and typically reflects the geometry of the scene, such as objects occluding one another. A typical edge might for instance be the border between a block of red color and a block of yellow. In contrast a line (as can be extracted by a ridge detector) can be a small number of pixels of a different color on an otherwise unchanging background. For a line, there may therefore usually be one edge on each side of the line.
1. INTRODUCTION
Edge detection refers to the process of identifying and locating sharp discontinuities in an image. The discontinuities are abrupt changes in pixel intensity which characterize boundaries of objects in a scene. Classical methods of edge detection involve convolving the image with an operator (a 2-D filter), which is constructed to be sensitive to large gradients in the image while returning values of zero in uniform regions.
There are an extremely large number of edge detection operators available, each designed to be sensitive to certain types of edges. Variables involved in the selection of an edge detection operator include Edge orientation, Noise environment and Edge structure. The geometry of the operator determines a characteristic direction in which it is most sensitive to edges. Operators can be optimized to look for horizontal, vertical, or diagonal edges. Edge detection is difficult in noisy images, since both the noise and the edges contain highfrequency content. Attempts to reduce the noise result in blurred and distorted edges. Operators used on noisy images are typically larger in scope, so they can average enough data to discount localized noisy pixels.
Edge detection is a fundamental tool used in most image processing applications to obtain information from the frames as a precursor step to feature extraction and object segmentation. This process detects outlines of an object and boundaries between objects and the background in the image. An edge-detection filter can also be used to improve the appearance of blurred or anti-aliased image streams. The basic edge-detection operator is a matrix area gradient operation that determines the level of variance between different pixels. The edge-detection operator is calculated by forming a matrix centered on a pixel chosen as the center of the matrix area. If the value of this matrix area is above a given threshold, then the middle pixel is classified as an edge. Examples of gradient-based edge detectors are Roberts, Prewitt, and Sobel operators. All the gradient-based algorithms have kernel operators that calculate the strength of the slope in directions, which are orthogonal to each other, commonly vertical and horizontal. Later, the contributions of the different components of the slopes are combined to give the total value of the edge strength. The Prewitt operator measures two components.
An Introduction to Image Compression

Compressing an image is significantly different than compressing raw binary data. Of course, general purpose compression programs can be used to compress images, but the result is less than optimal. This is because images have certain statistical properties which can be exploited by encoders specifically designed for them. Also, some of the finer details in the image can be sacrificed for the sake of saving a little more bandwidth or storage space. This also means that lossy compression techniques can be used in this area. Lossless compression involves with compressing data which, when decompressed, will be an exact replica of the original data. This is the case when binary data such as executables, documents etc. are compressed. They need to be exactly reproduced when decompressed. On the other hand, images (and music too) need not be reproduced 'exactly'. An approximation of the original image is enough for most purposes, as long as the error between the original and the compressed image is tolerable.
Error Metrics
Two of the error metrics used to compare the various image compression techniques are the Mean Square Error (MSE) and the Peak Signal to Noise Ratio (PSNR). The MSE is the cumulative squared error between the compressed and the original image, whereas PSNR is a measure of the peak error. The mathematical formulae for the two are
MSE = PSNR = 20 * log10 (255 / sqrt(MSE))
where I(x,y) is the original image, I'(x,y) is the approximated version (which is actually the decompressed image) and M,N are the dimensions of the images. A lower value for MSE means lesser error, and as seen from the inverse relation between the MSE and PSNR, this translates to a high value of PSNR. Logically, a higher value of PSNR is good because it means that the ratio of Signal to Noise is higher. Here, the 'signal' is the original image, and the 'noise' is the error in reconstruction. So, if you find a compression scheme having a lower MSE (and a high PSNR), you can recognise that it is a better one.
The Outline
We'll take a close look at compressing grey scale images. The algorithms explained can be easily extended to colour images, either by processing each of the colour planes separately, or by transforming the image from RGB representation to other convenient representations like YUV in which the processing is much easier. The usual steps involved in compressing an image are 1. Specifying the Rate (bits available) and Distortion (tolerable error) parameters for the target image. 2. Dividing the image data into various classes, based on their importance. 3. Dividing the available bit budget among these classes, such that the distortion is a minimum. 4. Quantize each class separately using the bit allocation information derived in step 3. 5. Encode each class separately using an entropy coder and write to the file. Remember, this is how 'most' image compression techniques work. But there are exceptions. One example is the Fractal Image Compression technique, where possible self similarity within the image is identified and used to reduce the amount of data required to reproduce the image. Traditionally these methods have been time consuming, but some latest methods promise to speed up the process. Literature regarding fractal image compression can be found at <findout>. Reconstructing the image from the compressed data is usually a faster process than compression. The steps involved are 1. Read in the quantized data from the file, using an entropy decoder. (reverse of step 5).
2. Dequantize the data. (reverse of step 4). 3. Rebuild the image. (reverse of step 2).
Data compression
Jump to: navigation, search "Source coding" redirects here. For the term in computer programming, see Source code. This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (November 2011) In computer science and information theory, data compression, source coding,[1] or bitrate reduction involves encoding information using fewer bits than the original representation. Compression can be either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by identifying marginally important information and removing it. Compression is useful because it helps reduce the consumption of resources such as data space or transmission capacity. Because compressed data must be decompressed to be used, this extra processing imposes computational or other costs through decompression. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it is being decompressed, and the option to decompress the video in full before watching it may be inconvenient or require additional storage. The design of data compression schemes involve trade-offs among various factors, including the degree of compression, the amount of distortion introduced (e.g., when using lossy data compression), and the computational resources required to compress and uncompress the data.
Contents
[hide]

1 Lossy 2 Lossless 3 Theory o 3.1 Machine learning o 3.2 Data differencing
4 Outlook and currently unused potential 5 Uses o 5.1 Audio 5.1.1 Lossy audio compression 5.1.1.1 Coding methods 5.1.1.2 Speech encoding 5.1.2 History o 5.2 Video 5.2.1 Encoding theory 5.2.2 Timeline 6 See also 7 References 8 External links
[edit] Lossy
Lossless data compression algorithms usually exploit statistical redundancy to represent data more concisely without losing information. Lossless compression is possible because most real-world data has statistical redundancy. For example, an image may have areas of colour that do not change over several pixels; instead of coding "red pixel, red pixel, ..." the data may be encoded as "279 red pixels". This is a simple example of run-length encoding; there are many schemes to reduce size by eliminating redundancy. Lossy image compression is used in digital cameras, to increase storage capacities with minimal degradation of picture quality. Similarly, DVDs use the lossy MPEG-2 Video codec for video compression. In lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components of the signal. Compression of human speech is often performed with even more specialized techniques, so that "speech compression" or "voice coding" is sometimes distinguished as a separate discipline from "audio compression". Different audio and speech compression standards are listed under audio codecs. Voice compression is used in Internet telephony for example, while audio compression is used for CD ripping and is decoded by audio players.
[edit] Lossless
Lossless data compression is contrasted with lossy data compression. In these schemes, some loss of information is acceptable. Depending upon the application, detail can be dropped from the data to save storage space. Generally, lossy data compression schemes are guided by research on how people perceive the data in question. For example, the human eye is more sensitive to subtle variations in luminance than it is to variations in color. JPEG image compression works in part by "rounding off" less-important visual information. There is a corresponding trade-off between information lost and the size
reduction. A number of popular compression formats exploit these perceptual differences, including those used in music files, images, and video. The LempelZiv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio, but compression can be slow. DEFLATE is used in PKZIP, gzip and PNG. LZW (LempelZivWelch) is used in GIF images. Also noteworthy are the LZR (LZRenau) methods, which serve as the basis of the Zip method. LZ methods utilize a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX). A current LZ-based coding scheme that performs well is LZX, used in Microsoft's CAB format. The very best modern lossless compressors use probabilistic models, such as prediction by partial matching. The BurrowsWheeler transform can also be viewed as an indirect form of statistical modelling. The class of grammar-based codes are recently noticed because they can extremely compress highly-repetitive text, for instance, biological data collection of same or related species, huge versioned document collection, internet archives, etc. The basic task of grammar-based codes is constructing a context-free grammar deriving a single string. Sequitur and Re-Pair are practical grammar compression algorithms which public codes are available. In a further refinement of these techniques, statistical predictions can be coupled to an algorithm called arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to the better-known Huffman algorithm, and lends itself especially well to adaptive data compression tasks where the predictions are strongly context-dependent. Arithmetic coding is used in the bilevel image-compression standard JBIG, and the document-compression standard DjVu. The text entry system, Dasher, is an inverse-arithmetic-coder.
[edit] Theory
The theoretical background of compression is provided by information theory (which is closely related to algorithmic information theory) for lossless compression, and by rate distortion theory for lossy compression. These fields of study were essentially created by Claude Shannon, who published fundamental papers on the topic in the late 1940s and early 1950s. Coding theory is also related. The idea of data compression is deeply connected with statistical inference.
Run-length encoding (RLE) is a very simple form of data compression in

which runs of data (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. This is
most useful on data that contains many such runs: for example, simple graphic images such as icons, line drawings, and animations. It is not useful with files that don't have many runs as it could greatly increase the file size. RLE also refers to a little-used image format in Windows 3.x, with the extension rle, which is a Run Length Encoded Bitmap, used to compress the Windows 3.x startup screen.
Contents
[hide]
1 Example 2 Applications 3 See also 4 External links
[edit]Example For example, consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. Let us take a hypothetical single scan line, with B representing a black pixel and W representing white: WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWW WW If we apply the run-length encoding (RLE) data compression algorithm to the above hypothetical scan line, we get the following: 12W1B12W3B24W1B14W This is to be interpreted as twelve Ws, one B, twelve Ws, three Bs, etc. The run-length code represents the original 67 characters in only 18. Of course, the actual format used for the storage of images is generally binary rather than ASCII characters like this, but the principle remains the same. Even binary data files can be compressed with this method; file format specifications often dictate repeated bytes in files as padding space. However, newer compression methods such as DEFLATE often use LZ77-based algorithms, a generalization of run-length encoding that can take advantage of runs of strings of characters (such as BWWBWWBWWBWW). [edit]Applications Run-length encoding performs lossless data compression and is well suited to palettebased iconic images. It does not work well at all on continuous-tone images such as photographs, although JPEG uses it quite effectively on the coefficients that remain after transforming and quantizing image blocks.
Common formats for run-length encoded data include Truevision TGA, PackBits, PCX and ILBM. Run-length encoding is used in fax machines (combined with other techniques into Modified Huffman coding). It is relatively efficient because most faxed documents are mostly white space, with occasional interruptions of black.
Entropy encoding
In information theory an entropy encoding is a lossless data compression scheme that is independent of the specific characteristics of the medium. One of the main types of entropy coding creates and assigns a unique prefix-free code to each unique symbol that occurs in the input. These entropy encoders then compress data by replacing each fixedlength input symbol by the corresponding variable-length prefix-free output codeword. The length of each codeword is approximately proportional to the negativelogarithm of the probability. Therefore, the most common symbols use the shortest codes. According to Shannon's source coding theorem, the optimal code length for a symbol is logbP, where b is the number of symbols used to make output codes and P is the probability of the input symbol. Two of the most common entropy encoding techniques are Huffman coding and arithmetic coding. If the approximate entropy characteristics of a data stream are known in advance (especially for signal compression), a simpler static code may be useful. These static codes include universal codes (such as Elias gamma coding or Fibonacci coding) and Golomb codes (such as unary coding or Rice coding).
[edit]Entropy
as a measure of similarity
Besides using entropy encoding as a way to compress digital data, an entropy encoder can also be used to measure the amount of similarity between streams of data. This is done by generating an entropy coder/compressor for each class of data; unknown data is then classified by feeding the uncompressed data to each compressor and seeing which compressor yields the highest compression. The coder with the best compression is probably the coder trained on the data that was most similar to the unknown data.
Huffman coding
This article includes a list of references, but its sources remain unclear because it has insufficient inline citations. Please help to improve this article by introducing more precise citations. (January 2011)
Huffman tree generated from the exact frequencies of the text "this is an example of a huffman tree". The frequencies and codes of each character are below. Encoding the sentence with this code requires 135 bits, as opposed to 288 bits if 36 characters of 8 bits were used. (This assumes that the code tree structure is known to the decoder and thus does not need to be counted as part of the transmitted information.)
Char
Freq
Code
space
111
010
000
1101
1010
1000
0111
0010
Char
Freq
Code
1011
0110
11001
00110
10011
11000
00111
10010
In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes". Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code (sometimes called "prefix-free codes", that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol) that expresses the most common source symbols using shorter strings of bits than are used for less common source symbols. Huffman was able to design the most efficient compression method of this type: no other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code. A method was later found to design a Huffman code in linear time if input probabilities (also known as weights) are sorted.[citation needed]
For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding, e.g., ASCII coding. Huffman coding is such a widespread method for creating prefix codes that the term "Huffman code" is widely used as a synonym for "prefix code" even when such a code is not produced by Huffman's algorithm. Although Huffman's original algorithm is optimal for a symbol-by-symbol coding (i.e. a stream of unrelated symbols) with a known input probability distribution, it is not optimal when the symbol-bysymbol restriction is dropped, or when the probability mass functions are unknown, not identically distributed, or not independent (e.g., "cat" is more common than "cta"). Other methods such as arithmetic coding andLZW coding often have better compression capability: both of these methods can combine an arbitrary number of symbols for more efficient coding, and generally adapt to the actual input statistics, the latter of which is useful when input probabilities are not precisely known or vary significantly within the stream. However, the limitations of Huffman coding should not be overstated; it can be used adaptively, accommodating unknown, changing, or context-dependent probabilities. In the case of known independent and identically-distributed random variables, combining symbols together reduces inefficiency in a way that approaches optimality as the number of symbols combined increases.
Contents
[hide]
o o o o o o o o o o
1 History 2 Problem definition 2.1 Informal description 2.2 Formalized description 2.3 Samples 3 Basic technique 3.1 Compression 3.2 Decompression 4 Main properties 5 Variations 5.1 n-ary Huffman coding 5.2 Adaptive Huffman coding 5.3 Huffman template algorithm 5.4 Length-limited Huffman coding 5.5 Huffman coding with unequal letter costs
o o
5.6 Optimal alphabetic binary trees (Hu-Tucker coding) 5.7 The canonical Huffman code 6 Applications 7 See also 8 Notes 9 References 10 External links
[edit]History
In 1951, David A. Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree and quickly proved this method the most efficient.[1] In doing so, the student outdid his professor, who had worked with information theory inventor Claude Shannon to develop a similar code. Huffman avoided the major flaw of the suboptimalShannon-Fano coding by building the tree from the bottom up instead of from the top down.
[edit]Problem [edit]Informal
Given
definition
description
A set of symbols and their weights (usually proportional to probabilities). Find A prefix-free binary code (a set of codewords) with minimum expected codeword length (equivalently, a tree with minimum weighted path length from the root).
[edit]Formalized
Input. Alphabet Set
description
, which is the symbol alphabet of size n. , which is the set of the (positive) symbol weights .
(usually proportional to probabilities), i.e.
Output. Code , which is the set of (binary) codewords,
where ci is the codeword for
Goal.
Let Condition: for any code
be the weighted path length of code C. .
[edit]Samples Symbol (ai) Input (A, W) Weights (wi) 0.10 0.15 0.30 0.16 0.29 =1 a b c d e Sum
Codewords (ci)
010 011
11
00
10
Output C
Codeword length (in bits) (li)
Weighted path length (li wi )
0.30 0.45 0.60 0.32 0.58
L(C) = 2.25
Probability budget (2-li)
1/8
1/8
1/4
1/4
1/4
= 1.00
Optimality
Information content (in bits) 3.32 2.74 1.74 2.64 1.79 (log2 wi)
Entropy (wi log2 wi)
0.332 0.411 0.521 0.423 0.518 H(A) = 2.205
For any code that is biunique, meaning that the code is uniquely decodeable, the sum of the probability budgets across all symbols is always less than or equal to one. In this example, the sum is strictly equal to one; as a result, the code is termed a complete code. If this is not the case, you can always derive an equivalent code by adding extra symbols (with associated null probabilities), to make the code complete while keeping it biunique.
As defined by Shannon (1948), the information content h (in bits) of each symbol ai with nonnull probability is
The entropy H (in bits) is the weighted sum, across all symbols ai with non-zero probability wi, of the information content of each symbol:
(Note: A symbol with zero probability has zero contribution to the entropy, since be left out of the formula above.) As a consequence of Shannon's source coding theorem, the entropy is a measure of the smallest codeword length that is theoretically possible for the given alphabet with associated weights. In this example, the weighted average codeword length is 2.25 bits per symbol, only slightly larger than the calculated entropy of 2.205 bits per symbol. So not only is this code optimal in the sense that no other feasible code performs better, but it is very close to the theoretical limit established by Shannon. Note that, in general, a Huffman code need not be unique, but it is always one of the codes minimizing L(C). So for simplicity, symbols with zero probability can
[edit]Basic
technique
[edit]Compression
A source generates 4 different symbols {a1,a2,a3,a4} with probability{0.4;0.35;0.2;0.05}. A binary tree is generated from left to right taking the two least probable symbols and putting them together to form another equivalent symbol having a probability that equals the sum of the two symbols. The process is repeated until there is just one symbol. The tree can then be read backwards, from right to left, assigning different bits to different branches. The final Huffman code is:
Symbol Code
a1
a2
10
a3
110
a4
111
The standard way to represent a signal made of 4 symbols is by using 2 bits/symbol, but the entropy of the source is 1.74 bits/symbol. If this Huffman code is used to represent the signal, then the average length is lowered to 1.85 bits/symbol; it is still far from the theoretical limit because the probabilities of the symbols are different from negative powers of two.
The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols, n. A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, theweight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has up to n leaf nodes and n 1 internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths. The process essentially begins with the leaf nodes containing the probabilities of the symbol they represent, then a new node whose children are the 2 nodes with smallest probability is created, such that the new node's probability is equal to the sum of the children's probability. With the previous 2 nodes merged into one node (thus not considering them anymore), and with the new node being now considered, the procedure is repeated until only one node remains, the Huffman tree. The simplest construction algorithm uses a priority queue where the node with lowest probability is given highest priority: 1. Create a leaf node for each symbol and add it to the priority queue.
2.
While there is more than one node in the queue: 1. Remove the two nodes of highest priority (lowest
probability) from the queue
2.
Create a new internal node with these two nodes as
children and with probability equal to the sum of the two nodes' probabilities. 3. 3. Add the new node to the queue.
The remaining node is the root node and the tree is complete.
Since efficient priority queue data structures require O(log n) time per insertion, and a tree with n leaves has 2n1 nodes, this algorithm operates in O(n log n) time, where n is the number of symbols. If the symbols are sorted by probability, there is a linear-time (O(n)) method to create a Huffman tree using two queues, the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues: 1. 2. Start with as many leaves as there are symbols. Enqueue all leaf nodes into the first queue (by probability in
increasing order so that the least likely item is in the head of the queue). 3. While there is more than one node in the queues: 1. Dequeue the two nodes with the lowest weight by
examining the fronts of both queues. 2. Create a new internal node, with the two just-removed
nodes as children (either node can be either child) and the sum of their weights as the new weight. 3. 4. Enqueue the new node into the rear of the second queue.
The remaining node is the root node; the tree has now been
generated. Although this algorithm may appear "faster" complexity-wise than the previous algorithm using a priority queue, this is not actually the case because the symbols need to be sorted by probability before-hand, a process that takes O(n log n) time in itself.
In many cases, time complexity is not very important in the choice of algorithm here, since n here is the number of symbols in the alphabet, which is typically a very small number (compared to the length of the message to be encoded); whereas complexity analysis concerns the behavior when n grows to be very large. It is generally beneficial to minimize the variance of codeword length. For example, a communication buffer receiving Huffman-encoded data may need to be larger to deal with especially long symbols if the tree is especially unbalanced. To minimize variance, simply break ties between queues by choosing the item in the first queue. This modification will retain the mathematical optimality of the Huffman coding while both minimizing variance and minimizing the length of the longest character code. Here's an example using the French subject string "j'aime aller sur le bord de l'eau les jeudis ou les jours impairs":
Arithmetic coding
Arithmetic coding is a form of variable-length entropy encoding used in lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding such as Huffman coding in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, a fraction n where (0.0 n < 1.0).
Contents
[hide]
o o o
1 Implementation details and examples 1.1 Equal probabilities 1.2 Defining a model 1.3 Encoding and decoding: overview
o o o o o
1.4 Encoding and decoding: example 1.5 Sources of inefficiency 2 Adaptive arithmetic coding 3 Precision and renormalization 4 Arithmetic coding as a generalized change of radix 4.1 Theoretical limit of compressed message 5 Connections with other compression methods 5.1 Huffman coding 5.2 Range encoding 6 US patents 7 Benchmarks and other technical characteristics 8 Teaching aid 9 See also 10 References 11 External links
[edit]Implementation [edit]Equal
details and examples
probabilities
In the simplest case, the probability of each symbol occurring is equal. For example, consider a sequence taken from a set of three symbols, A, B, and C, each equally likely to occur. Simple block encoding would use 2 bits per symbol, which is wasteful: one of the bit variations is never used. A more efficient solution is to represent the sequence as a rational number between 0 and 1 in base 3, where each digit represents a symbol. For example, the sequence "ABBCAB" could become 0.0112013. The next step is to encode this ternary number using a fixed-point binary number of sufficient precision to recover it, such as 0.0010110012 this is only 9 bits, 25% smaller than the nave block encoding. This is feasible for long sequences because there are efficient, in-place algorithms for converting the base of arbitrarily precise numbers. To decode the value, knowing the original string had length 6, one can simply convert back to base 3, round to 6 digits, and recover the string.
[edit]Defining
a model
In general, arithmetic coders can produce near-optimal output for any given set of symbols and probabilities (the optimal value is log2P bits for each symbol of probability P, see source coding
theorem). Compression algorithms that use arithmetic coding start by determining a model of the data basically a prediction of what patterns will be found in the symbols of the message. The more accurate this prediction is, the closer to optimal the output will be. Example: a simple, static model for describing the output of a particular monitoring instrument over time might be:
60% chance of symbol NEUTRAL 20% chance of symbol POSITIVE 10% chance of symbol NEGATIVE 10% chance of symbol END-OF-DATA. (The presence of this symbol means that the stream
will be 'internally terminated', as is fairly common in data compression; when this symbol appears in the data stream, the decoder will know that the entire stream has been decoded.) Models can also handle alphabets other than the simple four-symbol set chosen for this example. More sophisticated models are also possible: higher-order modelling changes its estimation of the current probability of a symbol based on the symbols that precede it (the context), so that in a model for English text, for example, the percentage chance of "u" would be much higher when it follows a "Q" or a "q". Models can even be adaptive, so that they continuously change their prediction of the data based on what the stream actually contains. The decoder must have the same model as the encoder.
[edit]Encoding
and decoding: overview
In general, each step of the encoding process, except for the very last, is the same; the encoder has basically just three pieces of data to consider:
The next symbol that needs to be encoded The current interval (at the very start of the encoding process, the interval is set to [0,1], but
that will change)
The probabilities the model assigns to each of the various symbols that are possible at this
stage (as mentioned earlier, higher-order or adaptive models mean that these probabilities are not necessarily the same in each step.) The encoder divides the current interval into sub-intervals, each representing a fraction of the current interval proportional to the probability of that symbol in the current context. Whichever interval corresponds to the actual symbol that is next to be encoded becomes the interval used in the next step. Example: for the four-symbol model above:
the interval for NEUTRAL would be [0, 0.6) the interval for POSITIVE would be [0.6, 0.8) the interval for NEGATIVE would be [0.8, 0.9) the interval for END-OF-DATA would be [0.9, 1).
When all symbols have been encoded, the resulting interval unambiguously identifies the sequence of symbols that produced it. Anyone who has the same final interval and model that is being used can reconstruct the symbol sequence that must have entered the encoder to result in that final interval. It is not necessary to transmit the final interval, however; it is only necessary to transmit one fraction that lies within that interval. In particular, it is only necessary to transmit enough digits (in whatever base) of the fraction so that all fractions that begin with those digits fall into the final interval.
[edit]Encoding
and decoding: example
A diagram showing decoding of 0.538 (the circular point) in the example model. The region is divided into subregions proportional to symbol frequencies, then the subregion containing the point is successively subdivided in the same way.
Consider the process for decoding a message encoded with the given four-symbol model. The message is encoded in the fraction 0.538 (using decimal for clarity, instead of binary; also assuming that there are only as many digits as needed to decode the message.) The process starts with the same interval used by the encoder: [0,1), and using the same model, dividing it into the same four sub-intervals that the encoder must have. The fraction 0.538 falls into the sub-interval for NEUTRAL, [0, 0.6); this indicates that the first symbol the encoder read must have been NEUTRAL, so this is the first symbol of the message. Next divide the interval [0, 0.6) into sub-intervals:
the interval for NEUTRAL would be [0, 0.36) -- 60% of [0, 0.6)
the interval for POSITIVE would be [0.36, 0.48) -- 20% of [0, 0.6) the interval for NEGATIVE would be [0.48, 0.54) -- 10% of [0, 0.6) the interval for END-OF-DATA would be [0.54, 0.6). -- 10% of [0, 0.6)
Since .538 is within the interval [0.48, 0.54), the second symbol of the message must have been NEGATIVE. Again divide our current interval into sub-intervals:
the interval for NEUTRAL would be [0.48, 0.516) the interval for POSITIVE would be [0.516, 0.528) the interval for NEGATIVE would be [0.528, 0.534) the interval for END-OF-DATA would be [0.534, 0.540).
Now .538 falls within the interval of the END-OF-DATA symbol; therefore, this must be the next symbol. Since it is also the internal termination symbol, it means the decoding is complete. If the stream is not internally terminated, there needs to be some other way to indicate where the stream stops. Otherwise, the decoding process could continue forever, mistakenly reading more symbols from the fraction than were in fact encoded into it.
[edit]Sources
of inefficiency
The message 0.538 in the previous example could have been encoded by the equally short fractions 0.534, 0.535, 0.536, 0.537 or 0.539. This suggests that the use of decimal instead of binary introduced some inefficiency. This is correct; the information content of a three-digit decimal is approximately 9.966 bits; the same message could have been encoded in the binary fraction 0.10001010 (equivalent to 0.5390625 decimal) at a cost of only 8 bits. (The final zero must be specified in the binary fraction, or else the message would be ambiguous without external information such as compressed stream size.) This 8 bit output is larger than the information content, or entropy of the message, which is 1.57 3 or 4.71 bits. The large difference between the example's 8 (or 7 with external compressed data size information) bits of output and the entropy of 4.71 bits is caused by the short example message not being able to exercise the coder effectively. The claimed symbol probabilities were [0.6, 0.2, 0.1, 0.1], but the actual frequencies in this example are [0.33, 0, 0.33, 0.33]. If the intervals are readjusted for these frequencies, the entropy of the message would be 1.58 bits and the same NEUTRAL NEGATIVE ENDOFDATA message could be encoded as intervals [0, 1/3); [1/9, 2/9); [5/27, 6/27); and a binary interval of [1011110, 1110001). This could yield an output message of 111, or just 3 bits. This
is also an example of how statistical coding methods like arithmetic encoding can produce an output message that is larger than the input message, especially if the probability model is off.
[edit]Adaptive
arithmetic coding
One advantage of arithmetic coding over other similar methods of data compression is the convenience of adaptation. Adaptation is the changing of the frequency (or probability) tables while processing the data. The decoded data matches the original data as long as the frequency table in decoding is replaced in the same way and in the same step as in encoding. The synchronization is, usually, based on a combination of symbols occurring during the encoding and decoding process. Adaptive arithmetic coding significantly improves the compression ratio compared to static methods; it may be as effective as 2 to 3 times better in the result.
Data compression
"Source coding" redirects here. For the term in computer programming, see Source code.
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (November 2011)
In computer science and information theory, data compression, source coding,[1] or bit-rate reduction involves encoding information using fewer bits than the original representation. Compression can be either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by identifying marginally important information and removing it. Compression is useful because it helps reduce the consumption of resources such as data space or transmission capacity. Because compressed data must be decompressed to be used, this extra processing imposes computational or other costs through decompression. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it is being decompressed, and the option to decompress the video in full before watching it may be inconvenient or require additional storage. The design of data compression schemes involve trade-offs among various factors, including the degree of compression, the amount of distortion introduced (e.g., when usinglossy data compression), and the computational resources required to compress and uncompress the data.
Contents
[hide]
o o o
1 Lossy 2 Lossless 3 Theory 3.1 Machine learning 3.2 Data differencing 4 Outlook and currently unused potential 5 Uses 5.1 Audio
o
6 See also 7 References 8 External links 5.2 Video
5.1.1 Lossy audio compression 5.1.1.1 Coding methods 5.1.1.2 Speech encoding 5.1.2 History
5.2.1 Encoding theory 5.2.2 Timeline
[edit]Lossy
Lossless data compression algorithms usually exploit statistical redundancy to represent data more concisely without losing information. Lossless compression is possible because most real-world data has statistical redundancy. For example, an image may have areas of colour that do not change over several pixels; instead of coding "red pixel, red pixel, ..." the data may be encoded as "279 red pixels". This is a simple example of run-length encoding; there are many schemes to reduce size by eliminating redundancy. Lossy image compression is used in digital cameras, to increase storage capacities with minimal degradation of picture quality. Similarly, DVDs use the lossy MPEG-2 Video codec forvideo compression. In lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components of the signal. Compression of human speech is often performed with even more specialized techniques, so that "speech compression" or "voice coding" is sometimes distinguished as
a separate discipline from "audio compression". Different audio and speech compression standards are listed under audio codecs. Voice compression is used in Internet telephony for example, while audio compression is used for CD ripping and is decoded by audio players.
[edit]Lossless
Lossless data compression is contrasted with lossy data compression. In these schemes, some loss of information is acceptable. Depending upon the application, detail can be dropped from the data to save storage space. Generally, lossy data compression schemes are guided by research on how people perceive the data in question. For example, the human eye is more sensitive to subtle variations in luminance than it is to variations in color. JPEG image compression works in part by "rounding off" less-important visual information. There is a corresponding trade-off between information lost and the size reduction. A number of popular compression formats exploit these perceptual differences, including those used in musicfiles, images, and video. The LempelZiv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio, but compression can be slow. DEFLATE is used in PKZIP, gzip and PNG. LZW (LempelZiv Welch) is used in GIF images. Also noteworthy are the LZR (LZRenau) methods, which serve as the basis of the Zip method. LZ methods utilize a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e.g. SHRI, LZX). A current LZbased coding scheme that performs well is LZX, used in Microsoft's CAB format. The very best modern lossless compressors use probabilistic models, such as prediction by partial matching. The BurrowsWheeler transform can also be viewed as an indirect form of statistical modelling. The class of grammar-based codes are recently noticed because they can extremely compress highlyrepetitive text, for instance, biological data collection of same or related species, huge versioned document collection, internet archives, etc. The basic task of grammar-based codes is constructing a context-free grammar deriving a single string. Sequitur and Re-Pairare practical grammar compression algorithms which public codes are available. In a further refinement of these techniques, statistical predictions can be coupled to an algorithm called arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to the better-known Huffman algorithm, and lends itself especially well to adaptive data compression tasks where the predictions are strongly context-dependent. Arithmetic coding is used in the bilevel image-compression
standard JBIG, and the document-compression standard DjVu. The text entry system, Dasher, is an inverse-arithmetic-coder.
[edit]Theory
The theoretical background of compression is provided by information theory (which is closely related to algorithmic information theory) for lossless compression, and by ratedistortion theory for lossy compression. These fields of study were essentially created by Claude Shannon, who published fundamental papers on the topic in the late 1940s and early 1950s.Coding theory is also related. The idea of data compression is deeply connected with statistical inference.
[edit]Machine
learning
See also: Machine learning There is a close connection between machine learning and compression: a system that predicts the posterior probabilities of a sequence given its entire history can be used for optimal data compression (by using arithmetic coding on the output distribution), while an optimal compressor can be used for prediction (by finding the symbol that compresses best, given the previous history). This equivalence has been used as justification for data compression as a benchmark for "general intelligence". [2]
[edit]Data
differencing
Main article: Data differencing Data compression can be viewed as a special case of data differencing:[3][4] Data differencing consists of producing a difference given a source and a target, with patching producing atarget given a source and a difference, while data compression consists of producing a compressed file given a target, and decompression consists of producing a target given only a compressed file. Thus, one can consider data compression as data differencing with empty source data, the compressed file corresponding to a "difference from nothing". This is the same as considering absolute entropy (corresponding to data compression) as a special case of relative entropy (corresponding to data differencing) with no initial data. When one wishes to emphasize the connection, one may use the term differential compression to refer to data differencing.
[edit]Outlook
and currently unused potential
It is estimated that the total amount of the information that is stored on the world's storage devices could be further compressed with existing compression algorithms by a remaining average factor of 4.5 : 1. It is estimated that the combined technological capacity of the world to store information
provides 1,300 exabytes of hardware digits in 2007, but when the corresponding content is optimally compressed, this only represents 295 exabytes of Shannon information.[5]
[edit]Uses [edit]Audio
See also: Audio codec Audio data compression, as distinguished from dynamic range compression, reduces the transmission bandwidth and storage requirements of audio data. Audio compression algorithmsare implemented in software as audio codecs. Lossy audio compression algorithms provide higher compression at the cost of fidelity, are used in numerous audio applications. These algorithms almost all rely on psychoacoustics to eliminate less audible or meaningful sounds, thereby reducing the space required to store or transmit them. In both lossy and lossless compression, information redundancy is reduced, using methods such as coding, pattern recognition and linear prediction to reduce the amount of information used to represent the uncompressed data. The acceptable trade-off between loss of audio quality and transmission or storage size depends upon the application. For example, one 640MB compact disc (CD) holds approximately one hour of uncompressed high fidelity music, less than 2 hours of music compressed losslessly, or 7 hours of music compressed in the MP3 format at a medium bit rate. A digital sound recorder can typically store around 200 hours of clearly intelligible speech in 640MB [6]. Lossless audio compression produces a representation of digital data that decompresses to an exact digital duplicate of the original audio stream, unlike playback from lossy compression techniques such as Vorbis and MP3. Compression ratios are around 5060% of original size[7], similar to those for generic lossless data compression. Lossy compression depends upon the quality required, but typically yields files of 5 to 20% of the size of the uncompressed original.[8] Lossless compression is unable to attain high compression ratios due to the complexity of wave forms and the rapid changes in sound forms. Codecs like FLAC, Shorten and TTA use linear prediction to estimate the spectrum of the signal. Many of these algorithms use convolution with the filter [-1 1] to slightly whiten or flatten the spectrum, thereby allowing traditional lossless compression to work more efficiently. The process is reversed upon decompression. When audio files are to be processed, either by further compression or for editing, it is desirable to work from an unchanged original (uncompressed or losslessly compressed). Processing of a lossily compressed file for some purpose usually produces a final result inferior to creation of the same
compressed file from an uncompressed original. In addition to sound editing or mixing, lossless audio compression is often used for archival storage, or as master copies. A number of lossless audio compression formats exist. Shorten was an early lossless format. Newer ones include Free Lossless Audio Codec (FLAC), Apple's Apple Lossless, MPEG-4 ALS, Microsoft's Windows Media Audio 9 Lossless (WMA Lossless), Monkey's Audio, and TTA. See list of lossless codecs for a complete list. Some audio formats feature a combination of a lossy format and a lossless correction; this allows stripping the correction to easily obtain a lossy file. Such formats include MPEG-4 SLS(Scalable to Lossless), WavPack, and OptimFROG DualStream. Other formats are associated with a distinct system, such as:
Direct Stream Transfer, used in Super Audio CD Meridian Lossless Packing, used in DVD-Audio, Dolby TrueHD, Blu-ray and HD DVD
[edit]Lossy audio compression
Comparison of acoustic spectrograms of a song in an uncompressed format and various lossy formats. The fact that the lossy spectrograms are different from the uncompressed one indicates that they are in fact lossy, but nothing can be assumed about the effect of the changes onperceived quality.
Lossy audio compression is used in a wide range of applications. In addition to the direct applications (mp3 players or computers), digitally compressed audio streams are used in most video DVDs; digital television; streaming media on the internet; satellite and cable radio; and increasingly in terrestrial
radio broadcasts. Lossy compression typically achieves far greater compression than lossless compression (data of 5 percent to 20 percent of the original stream, rather than 50 percent to 60 percent), by discarding less-critical data. The innovation of lossy audio compression was to use psychoacoustics to recognize that not all data in an audio stream can be perceived by the human auditory system. Most lossy compression reduces perceptual redundancy by first identifying sounds which are considered perceptually irrelevant, that is, sounds that are very hard to hear. Typical examples include high frequencies, or sounds that occur at the same time as louder sounds. Those sounds are coded with decreased accuracy or not coded at all. Due to the nature of lossy algorithms, audio quality suffers when a file is decompressed and recompressed (digital generation loss). This makes lossy compression unsuitable for storing the intermediate results in professional audio engineering applications, such as sound editing and multitrack recording. However, they are very popular with end users (particularly MP3), as a megabyte can store about a minute's worth of music at adequate quality.
[edit]Coding methods
In order to determine what information in an audio signal is perceptually irrelevant, most lossy compression algorithms use transforms such as the modified discrete cosine transform (MDCT) to convert time domain sampled waveforms into a transform domain. Once transformed, typically into the frequency domain, component frequencies can be allocated bits according to how audible they are. Audibility of spectral components is determined by first calculating a masking threshold, below which it is estimated that sounds will be beyond the limits of human perception. The masking threshold is calculated using the absolute threshold of hearing and the principles of simultaneous maskingthe phenomenon wherein a signal is masked by another signal separated by frequency, and, in some cases, temporal maskingwhere a signal is masked by another signal separated by time. Equal-loudness contours may also be used to weight the perceptual importance of different components. Models of the human ear-brain combination incorporating such effects are often called psychoacoustic models. Other types of lossy compressors, such as the linear predictive coding (LPC) used with speech, are source-based coders. These coders use a model of the sound's generator (such as the human vocal tract with LPC) to whiten the audio signal (i.e., flatten its spectrum) prior to quantization. LPC may also be thought of as a basic perceptual coding technique; reconstruction of an audio signal using a linear predictor shapes the coder's quantization noise into the spectrum of the target signal, partially masking it.
Lossy formats are often used for the distribution of streaming audio, or interactive applications (such as the coding of speech for digital transmission in cell phone networks). In such applications, the data must be decompressed as the data flows, rather than after the entire data stream has been transmitted. Not all audio codecs can be used for streaming applications, and for such applications a codec designed to stream data effectively will usually be chosen. Latency results from the methods used to encode and decode the data. Some codecs will analyze a longer segment of the data to optimize efficiency, and then code it in a manner that requires a larger segment of data at one time in order to decode. (Often codecs create segments called a "frame" to create discrete data segments for encoding and decoding.) The inherent latency of the coding algorithm can be critical; for example, when there is two-way transmission of data, such as with a telephone conversation, significant delays may seriously degrade the perceived quality. In contrast to the speed of compression, which is proportional to the number of operations required by the algorithm, here latency refers to the number of samples which must be analysed before a block of audio is processed. In the minimum case, latency is 0 zero samples (e.g., if the coder/decoder simply reduces the number of bits used to quantize the signal). Time domain algorithms such as LPC also often have low latencies, hence their popularity in speech coding for telephony. In algorithms such as MP3, however, a large number of samples have to be analyzed in order to implement a psychoacoustic model in the frequency domain, and latency is on the order of 23 ms (46 ms for twoway communication).
[edit]Speech encoding
Speech encoding is an important category of audio data compression. The perceptual models used to estimate what a human ear can hear are generally somewhat different from those used for music. The range of frequencies needed to convey the sounds of a human voice are normally far narrower than that needed for music, and the sound is normally less complex. As a result, speech can be encoded at high quality using a relatively low bit rate. This is accomplished, in general, by some combination of two approaches:
Only encoding sounds that could be made by a single human voice. Throwing away more of the data in the signalkeeping just enough to reconstruct an
"intelligible" voice rather than the full frequency range of human hearing. Perhaps the earliest algorithms used in speech encoding (and audio data compression in general) were the A-law algorithm and the -law algorithm.
[edit]History
Solidyne 922: The world's first commercial audio bit compression card for PC, 1990
A literature compendium for a large variety of audio coding systems was published in the IEEE Journal on Selected Areas in Communications (JSAC), February 1988. While there were some papers from before that time, this collection documented an entire variety of finished, working audio coders, nearly all of them using perceptual (i.e. masking) techniques and some kind of frequency analysis and backend noiseless coding.[9] Several of these papers remarked on the difficulty of obtaining good, clean digital audio for research purposes. Most, if not all, of the authors in the JSAC edition were also active in the MPEG-1 Audio committee. The world's first commercial broadcast automation audio compression system was developed by Oscar Bonello, an Engineering professor at the University of Buenos Aires.[10] In 1983, using the psychoacoustic principle of the masking of critical bands first published in 1967, [11] he started developing a practical application based on the recently developed IBM PC computer, and the broadcast automation system was launched in 1987 under the name Audicom. 20 years later, almost all the radio stations in the world were using similar technology, manufactured by a number of companies.
[edit]Video
See also: Video codec Video compression uses modern coding techniques to reduce redundancy in video data. Most video compression algorithms and codecs combine spatial image compression and temporal motion compensation. Video compression is a practical implementation of source coding in information theory. In practice most video codecs also use audio compression techniques in parallel to compress the separate, but combined data streams. The majority of video compression algorithms use lossy compression. Large amounts of data may be eliminated while being perceptually indistinguishable. As in all lossy compression, there is a tradeoff between video quality, cost of processing the compression and decompression, and system requirements. Highly compressed video may present visible or distractingartifacts.
Video compression typically operates on square-shaped groups of neighboring pixels, often called macroblocks. These pixel groups or blocks of pixels are compared from one frame to the next and the video compression codec sends only the differences within those blocks. In areas of video with more motion, the compression must encode more data to keep up with the larger number of pixels that are changing. Commonly during explosions, flames, flocks of animals, and in some panning shots, the high-frequency detail leads to quality decreases or to increases in the variable bitrate.
Video codec
This article does not cite any references or sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (April 2011)
A video codec is a device or software that enables video compression and/or decompression for digital video. The compression usually employs lossy data compression. Historically, video was stored as an analog signal on magnetic tape. Around the time when the compact disc entered the market as a digital-format replacement for analog audio, it became feasible to also begin storing and using video in digital form, and a variety of such technologies began to emerge. Audio and video call for customized methods of compression. Engineers and mathematicians have tried a number of solutions for tackling this problem. There is a complex balance between the video quality, the quantity of the data needed to represent it (also known as the bit rate), the complexity of the encoding and decoding algorithms, robustness to data losses and errors, ease of editing, random access, the state of the art of compression algorithm design, end-to-end delay, and a number of other factors.
Motion compensation
This article provides insufficient context for those unfamiliar with the subject. Please help improve the article with a good introductory style. (October 2009) This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (December 2007) This article or section lacks a single coherent topic. Please help improve this article by rewording sentences or removing irrelevant information. Specific concerns may appear on the talk page. (March 2011)
Motion compensation is an algorithmic technique employed in the encoding of video data for video compression, for example in the generation of MPEG-2 files. Motion compensation describes a picture in terms of the transformation of a reference picture to the current picture. The reference picture may be previous in time or even from the future. When images can be accurately synthesised from previously transmitted/stored images, the compression efficiency can be improved.
[edit]Encoding theory
Video data may be represented as a series of still image frames. The sequence of frames contains spatial and temporal redundancy that video compression algorithms attempt to eliminate or code in a smaller size. Similarities can be encoded by only storing differences between frames, or by using perceptual features of human vision. For example, small differences in color are more difficult to perceive than are changes in brightness. Compression algorithms can average a color across these similar areas to reduce space, in a manner similar to those used in JPEG image compression.[12] Some of these methods are inherently lossy while others may preserve all relevant information from the original, uncompressed video. One of the most powerful techniques for compressing video is interframe compression. Interframe compression uses one or more earlier or later frames in a sequence to compress the current frame, while intraframe compression uses only the current frame, effectively being image compression. The most commonly used method works by comparing each frame in the video with the previous one. If the frame contains areas where nothing has moved, the system simply issues a short command that copies that part of the previous frame, bit-for-bit, into the next one. If sections of the frame move in a simple manner, the compressor emits a (slightly longer) command that tells the decompresser to shift, rotate, lighten, or darken the copy: a longer command, but still much shorter than intraframe compression. Interframe compression works well for programs that will simply be played back by the viewer, but can cause problems if the video sequence needs to be edited. Because interframe compression copies data from one frame to another, if the original frame is simply cut out (or lost in transmission), the following frames cannot be reconstructed properly. Some video formats, such as DV, compress each frame independently using intraframe compression. Making 'cuts' in intraframe-compressed video is almost as easy as editing uncompressed video: one finds the beginning and ending of each frame, and simply copies bit-for-bit each frame that one wants to keep, and discards the frames one doesn't want. Another difference between intraframe and interframe compression is that with intraframe systems, each frame uses a similar amount of data. In most
interframe systems, certain frames (such as "I frames" in MPEG-2) aren't allowed to copy data from other frames, and so require much more data than other frames nearby. It is possible to build a computer-based video editor that spots problems caused when I frames are edited out while other frames need them. This has allowed newer formats like HDV to be used for editing. However, this process demands a lot more computing power than editing intraframe compressed video with the same picture quality. Today, nearly all commonly used video compression methods (e.g., those in standards approved by the ITU-T or ISO) apply a discrete cosine transform (DCT) for spatial redundancy reduction. Other methods, such as fractal compression, matching pursuit and the use of a discrete wavelet transform (DWT) have been the subject of some research, but are typically not used in practical products (except for the use of wavelet coding as still-image coders without motion compensation). Interest in fractal compression seems to be waning, due to recent theoretical analysis showing a comparative lack of effectiveness of such methods.[citation needed]
[edit]Timeline
The following table is a partial history of international video compression standards.
History of Video Compression Standards
Year
Standard
Publisher
Popular Implementations
1984
H.120
ITU-T
1990
H.261
ITU-T
Videoconferencing, Videotelephony
1993
MPEG-1 Part 2
ISO, IEC
Video-CD
1995
H.262/MPEG-2 Part ISO, IEC, ITU-T 2
DVD Video, Blu-ray, Digital Video Broadcasting, SVCD
1996
H.263
ITU-T
Videoconferencing, Videotelephony, Video on Mobile Phones (3GP)
1999
MPEG-4 Part 2
ISO, IEC
Video on Internet (DivX, Xvid ...)
2003 H.264/MPEG-4 AVC ISO, IEC, ITU-T Blu-ray, Digital Video Broadcasting, iPod Video, HD DVD
2008
VC-2 (Dirac)
ISO, BBC
Video on Internet, HDTV broadcast, UHDTV
Speech processing
Speech processing is the study of speech signals and the processing methods of these signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signal.[clarification needed] It is also closely tied to natural language processing (NLP), as its input can come from / output can go to NLP applications. E.g. text-to-speech synthesis may use a syntactic parser on its input text and speech recognition's output may be used by e.g. information extraction techniques. Speech processing can be divided into the following categories:
Speech recognition, which deals with analysis of the linguistic content of a speech signal. Speaker recognition, where the aim is to recognize the identity of the speaker. Speech coding, a specialized form of data compression, is important in
the telecommunication area.
Voice analysis for medical purposes, such as analysis of vocal loading and dysfunction of
the vocal cords.
Speech synthesis: the artificial synthesis of speech, which usually means computer-generated
speech.
Speech enhancement: enhancing the intelligibility and/or perceptual quality of a speech
signal, like audio noise reduction for audio signals.
Speech recognition
For the human linguistic concept, see Speech perception.
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (February 2011) This article may need to be rewritten entirely to comply with Wikipedia's quality standards. You can help. The discussion page may contain suggestions. (February 2011)
The display of the Speech Recognition screensaver on a PC, in which the character responds to questions, e.g. "Where are you?" or statements, e.g. "Hello."
Speech recognition (also known as automatic speech recognition, computer speech recognition, speech to text, or just STT) converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speakeras is the case for most desktop recognition software. Recognizing the speaker can simplify the task of translating speech. Speech recognition is a broader solution that refers to technology that can recognize speech without being targeted at single speakersuch as acall system that can recognize arbitrary voices. Speech recognition applications include voice user interfaces such as voice dialing (e.g., "Call home"), call routing (e.g., "I would like to make a collect call"), domotic appliance control, search (e.g., find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), speech-to-text processing (e.g., word processors or emails), andaircraft (usually termed Direct Voice Input).
Speaker recognition
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may
be challenged and removed. (November 2007)

Voice recognition redirects here. For software that converts speech to text, see Speech recognition. Speaker recognition[1] is the computing task of validating a user's claimed identity using characteristics extracted from their voices . There is a difference between speaker recognition (recognizing who is speaking) and speech recognition (recognizing what is being said). These two terms are frequently confused, as isvoice recognition. Voice recognition is combination of the two where it uses learned aspects of a speakers voice to determine what is being said; the system cannot recognize speech from random speakers very accurately, but it can reach high accuracy for individual voices for which it has been trained. In addition, there is a difference between the act of authentication (commonly referred to as speaker verification or speaker authentication) and identification. Finally, there is a difference between speaker recognition (recognizing who is speaking) and speaker diarisation (recognizing when the same speaker is speaking). Speaker recognition has a history dating back some four decades and uses the acoustic features of speech that have been found to differ between individuals. These acoustic patterns reflect both anatomy (e.g., size and shape of the throat and mouth) and learned behavioral patterns (e.g., voice pitch, speaking style). Speaker verification has earned speaker recognition its classification as a "behavioral biometric".
Speech coding
This article does not cite any references or sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (December 2007) This article is written like a personal reflection or essay rather than an encyclopedic description of the subject. Please help improve it by rewriting it in an encyclopedic style. (November 2011)
Speech coding is the application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream. The two most important applications of speech coding are mobile telephony and Voice over IP. The techniques used in speech coding are similar to that in audio data compression and audio coding where knowledge in psychoacoustics is used to transmit only data that is relevant to the human
auditory system. For example, in voiceband speech coding, only information in the frequency band 400 Hz to 3500 Hz is transmitted but the reconstructed signal is still adequate for intelligibility. Speech coding differs from other forms of audio coding in that speech is a much simpler signal than most other audio signals, and a lot more statistical information is available about the properties of speech. As a result, some auditory information which is relevant in audio coding can be unnecessary in the speech coding context. In speech coding, the most important criterion is preservation of intelligibility and "pleasantness" of speech, with a constrained amount of transmitted data. The intelligibility of speech includes, besides the actual literal content, also speaker identity, emotions, intonation, timbre etc. that are all important for perfect intelligibility. The more abstract concept of pleasantness of degraded speech is a different property than intelligibility, since it is possible that degraded speech is completely intelligible, but subjectively annoying to the listener. In addition, most speech applications require low coding delay, as long coding delays interfere with speech interaction.
[edit]Sample
companding viewed as a form of speech coding
From this viewpoint, the A-law and -law algorithms (G.711) used in traditional PCM digital telephony can be seen as a very early precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12 bits of resolution. Although this would generate unacceptable distortion in a music signal, the peaky nature of speech waveforms, combined with the simple frequency structure of speech as a periodic waveform having a single fundamental frequency with occasional added noise bursts, make these very simple instantaneous compression algorithms acceptable for speech. A wide variety of other algorithms were tried at the time, mostly variants on delta modulation, but after careful consideration, the A-law/-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made them an excellent engineering compromise. Their audio performance remains acceptable, and there has been no need to replace them in the stationary phone network. In 2008, G.711.1 codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.
Voice analysis
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (February 2011)
Voice analysis is the study of speech sounds for purposes other than linguistic content, such as in speech recognition. Such studies include mostly medical analysis of the voice i.e.phoniatrics, but also speaker identification. More controversially, some believe that the truthfulness or emotional state of speakers can be determined using Voice Stress Analysis orLayered Voice Analysis.
Contents
[hide]
1 Typical voice problems 2 Analysis methods 3 External links 4 See also
[edit]Typical
voice problems
A medical study of the voice can be, for instance, analysis of the voice of patients who have had a polyp removed from his or her vocal cords through an operation. In order to objectively evaluate the improvement in voice quality there has to be some measure of voice quality. An experienced voice therapist can quite reliably evaluate the voice, but this requires extensive training and is still always subjective. Another active research topic in medical voice analysis is vocal loading evaluation. The vocal cords of a person speaking for an extended period of time will suffer from tiring, that is, the process of speaking exerts a load on the vocal cords where the tissue will suffer from tiring. Among professional voice users (i.e. teachers, sales people) this tiring can cause voice failures and sick leaves. To evaluate these problems vocal loading needs to be objectively measured.
[edit]Analysis
methods
Voice problems that require voice analysis most commonly originate from the vocal folds or the laryngeal musculature that controls them, since the folds are subject to collision forces with each vibratory cycle and to drying from the air being forced through the small gap between them, and the laryngeal musclature is intensely active during speech or singing and is subject to tiring. However, dynamic analysis of the vocal folds and their movement is physically difficult. The location of the vocal folds effectively prohibits direct, invasive measurement of movement. Less invasive imaging methods such as x-rays or ultrasounds do not work because the vocal cords are surrounded by cartilage which distort image quality. Movements in the vocal cords are rapid, fundamental frequencies are usually between 80 and 300 Hz, thus preventing usage of ordinary video. Stroboscopic, and high-speed videos provide an option but in order to see the vocal folds, a fiberoptic probe leading to the camera
has to be positioned in the throat, which makes speaking difficult. In addition, placing objects in the pharynx usually triggers a gag reflex that stops voicing and closes the larynx. In addition, stroboscopic imaging is only useful when the vocal fold vibratory pattern is closely periodic. The most important indirect methods are currently inverse filtering of either microphone or oral airflow recordings and electroglottography (EGG). In inverse filtering, the speech sound (the radiated acoustic pressure waveform, as obtained from a microphone) or the oral airflow waveform from a circumferentially vented (CV) mask is recorded outside the mouth and then filtered by a mathematical method to remove the effects of the vocal tract. This method produces an estimate of the waveform of the glottal airflow pulses, which in turn reflect the movements of the vocal folds. The other kind of noninvasive indirect indication of vocal fold motion is the electroglottography, in which electrodes placed on either side of the subject's throat at the level of the vocal folds record the changes in the conductivity of the throat according to how large a portion of the vocal folds are touching each other. It thus yields one-dimensional information of the contact area. Neither inverse filtering nor EGG are sufficient to completely describe the complex 3-dimensional pattern of vocal fold movement, but can provide useful indirect evidence of that movement.
Speech synthesis
See also: Speech generating device
Stephen Hawking is one of the most famous people using speech synthesis to communicate
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware. A text-tospeech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. [1] Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of thevocal tract and other human voice characteristics to create a completely "synthetic" voice output.[2] The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1980s.
Contents
[hide]
o o
1 Overview of text processing 2 History 2.1 Electronic devices 3 Synthesizer technologies 3.1 Concatenative synthesis
o o o o o o o o
3.1.1 Unit selection synthesis 3.1.2 Diphone synthesis 3.1.3 Domain-specific synthesis 3.2 Formant synthesis 3.3 Articulatory synthesis 3.4 HMM-based synthesis 3.5 Sinewave synthesis
4 Challenges 4.1 Text normalization challenges 4.2 Text-to-phoneme challenges 4.3 Evaluation challenges 4.4 Prosodics and emotional content
o o o o o o o
5 Dedicated hardware 6 Computer operating systems or outlets with speech synthesis 6.1 Atari 6.2 Apple 6.3 AmigaOS 6.4 Microsoft Windows 6.5 Android 6.6 Internet 6.7 Others 7 Speech synthesis markup languages 8 Applications 9 See also 10 References 11 External links
[edit]Overview
of text processing
Overview of a typical TTS system Sample of Microsoft Sam
Microsoft Windows XP's default speech synthesizer voice saying "The quick brown fox jumps over the lazy dog 1,234,567,890 times. soi"
Problems listening to this file? See media help.
A text-to-speech system (or "engine") is composed of two parts:[3] a front-end and aback-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-tophoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-endoften referred to as the synthesizerthen converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations),[4] which is then imposed on
Speech enhancement
Speech enhancement aims to improve speech quality by using various algorithms. The objective of enhancement is improvement in intelligibility and/or overall perceptual quality of degraded speech signal using audio signal processing techniques. Enhancing of speech degraded by noise, or noise reduction, is the most important field of speech enhancement, and used for many applications such as mobile phones, VoIP,teleconferencing systems , speech recognition, and hearing aids [1].
Contents
[hide]
1 Algorithms 2 See also 3 References 4 External links
The Representation of Speech

Historically, the primary use of encryption has been, of course, to protect messages in text form. Advancing technology has allowed images and audio to
be stored and communicated in digital form. A particularly effective method of compressing images is the Discrete Cosine Transform, which is used in the JPEG (Joint Photographic Experts Group) file format. When sound is converted to an analogue electrical signal by an appropriate transducer (a device for converting changing levels of one quantity to changing levels of another) such as a microphone, the resulting electrical signal has a value that changes over time, oscillating between positive and negative. A Compact Disc stores stereo musical recordings in the form of two digital audio channels, each one containing 44,100 16-bit signed integers for every second of sound. This leads to a total data rate of 176,400 bytes per second. For transmitting a telephone conversation digitally, the same level of fidelity is not required. Only a single audio channel is used, and only frequencies of up to 3000 cycles per second (or 3000 Hertz) are required, which requires (because of a mathematical law called the Nyquist theorem) 6000 samples of the level of the audio signal (after it has been bandlimited to the range of frequencies to be reproduced, otherwise aliasing may result) to be taken each second. For many communications applications, samples of audio waveforms are one byte in length, and they are represented by a type of floating-point notation to allow one byte to represent an adequate range of levels. Simple floating-point notation, for an eight-bit byte, might look like this:
S 0 0 0 0 0 0 0 0 EE 11 11 10 10 01 01 00 00 MMMMM 11111 10000 11111 10000 11111 10000 11111 10000 1111.1 1000.0 111.11 100.00 11.111 10.000 1.1111 1.0000
The sign bit is always shown as 0, which indicates a positive number. Negative numbers are often indicated in floating-point notation by making the sign bit a 1 without changing any other part of the number, although other conventions are used as well. For comparison purposes, the floating-point notations shown have all been scaled so that 1 represents the smallest nonzero number that can be indicated.
One way the range of values that can be represented can be extended is by allowing gradual underflow, where an unnormalized mantissa is permitted for the smallest exponent value.
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 EE 11 11 10 10 01 01 00 00 00 00 00 00 00 00 00 MMMMM 11111 10000 11111 10000 11111 10000 11111 10000 01111 01000 00111 00100 00011 00010 00001 11111000 10000000 1111100 1000000 111110 100000 11111 10000 1111 1000 111 100 11 10 1
Another way of making a floating-point representation more efficient involves noting that, in the first case, the first mantissa bit (the field of a floating-point number that represents the actual number directly is called the mantissa because it would correspond to the fractional part of the number's logarithm to the base used for the exponent) is always one. With gradual underflow, that bit is only allowed to be zero for one exponent value. Instead of using gradual underflow, one could use the basic floating-point representation we started with, but simply omit the bit that is always equal to one. This could produce a result like this:
S 0 0 0 0 0 0 0 0 EEE 111 110 101 100 011 010 001 000 MMMM aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa 1aaaa000 1aaaa00 1aaaa0 1aaaa 1aaa.a 1aa.aa 1a.aaa 1.aaaa
Here, the variable bits of the mantissa are noted by aaaa, instead of being represented as all ones in one line, and all zeroes in a following line, for both compactness and clarity. Today's personal computers use a standard floating-point format that combines gradual underflow with suppressing the first one bit in the mantissa. This is achieved by reserving a special exponent value, the lowest one, to behave
differently from the others. That exponent value is required to multiply the mantissa by the same amount as the next higher exponent value (instead of a power of the radix that is one less), and the mantissa, for that exponent value, does not have its first one bit suppressed. Another method of representing floating point quantities efficiently is something I call extremely gradual underflow. This retains the first one bit in the mantissa, but treats the degree of unnormalization of the mantissa as the most significant part of the exponent field. It works like this (the third column shows an alternate version of this format, to be explained below):
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 EE 11 10 01 00 11 10 01 00 11 10 01 00 11 10 01 00 11 10 01 00 MMMMM 1aaaa 1aaaa 1aaaa 1aaaa 01aaa 01aaa 01aaa 01aaa 001aa 001aa 001aa 001aa 0001a 0001a 0001a 0001a 00001 00001 00001 00001 1aaaa000000000000000 1aaaa00000000000000 1aaaa0000000000000 1aaaa000000000000 1aaa000000000000 1aaa00000000000 1aaa0000000000 1aaa000000000 1aa000000000 1aa00000000 1aa0000000 1aa000000 1a000000 1a00000 1a0000 1a000 1000 100 10 1 S 0 0 0 0 S 0 0 0 0 S 0 0 0 0 S 0 0 0 0 S 0 0 0 0 M 1 1 1 1 MM 01 01 01 01 MMM 001 001 001 001 MMMM 0001 0001 0001 0001 MMMMM 00001 00001 00001 00001 EE 11 10 01 00 EE 11 10 01 00 EE 11 10 01 00 EE 11 10 01 00 EE 11 10 01 00 MMMM aaaa aaaa aaaa aaaa MMM aaa aaa aaa aaa MM aa aa aa aa M a a a a
Although usually a negative number is indicated simply by setting the sign bit to 1, another possibility is to also invert all the other bits in the number. In this way, for some of the simpler floating-point formats, an integer comparison instruction can also be used to test if one floating-point number is larger than another. This definitely will not work for the complicated extremely gradual underflow format as it is shown here. However, that format can be coded so as to allow
this to work, as follows: the exponent field can be made movable, and it can be placed after the first 1 bit in the mantissa field. This is the format shown in the third column above. When this is done, for very small numbers the idea of allowing the exponent field to shrink suggests itself. Thus, if the table above is continued, we obtain:
S 0 0 0 0 EE 11 10 01 00 MMMMM 00001 00001 00001 00001 1000 100 10 1 0.1 0.01 0.001 S 0 0 0 0 MMMMM 00001 00001 00001 00001 EE 11 10 01 00
N/A N/A N/A
S MMMMMM E 0 000001 1 0 000001 0 S MMMMMMM 0 0000001
Something very similar is used to represent sound signals in 8-bit form using the A-law, which is the standard for European microwave telephone transmission, and which is also sometimes used for satellite audio transmissions. However, the convention for representing the sign of numbers is different. Mu-law encoding, used in the United States and Japan (and, I would suspect, Canada as well), instead operates as a conventional floating-point format, with the first bit of the mantissa, which is always a 1 when the exponent is a power of two, suppressed. The following table illustrates these formats, with capital letters indicating bits that are complemented:
Linear value Bit Mu-Law Point Gradual Underflow Extremely Gradual A-Law (1) Suppressed Bit A-Law (2) Underflow Floating-Point with Floating-Point 0 1 11 aaaa 1111aaaa 0 1 10 aaaa 1110aaaa 0 1 01 aaaa 1101aaaa 0 1 00 aaaa 1100aaaa 1111aaaa 1110aaaa 1101aaaa 1100aaaa Suppressed Floating-
+1aaaa000000000000000000 1000AAAA 0 111 aaaa +01aaaa00000000000000000 1001AAAA 0 110 aaaa +001aaaa0000000000000000 1010AAAA 0 101 aaaa +0001aaaa000000000000000 1011AAAA 0 100 aaaa
0 111 aaaa 0 110 aaaa 0 101 aaaa 0 100 aaaa
+00001aaab00000000000000 0 1100AAAB 0 011 aaab +000001aaab0000000000000 0 1101AAAB 0 010 aaab +0000001aaab000000000000 0 1110AAAB 0 001 aaab +00000001aaab00000000000 0 1111AAAB 0 000 1aaa +000000001aa000000000000 0 0 000 01aa 100001aa +0000000001ab00000000000 0 0 000 001a 1000001a +00000000001aa0000000000 0 0 000 0001 10000001 +000000000001aa000000000 0 +0000000000001a000000000 0 +00000000000001a00000000 0 +000000000000001a0000000 0 +0000000000000001a000000 0 +00000000000000001000000 0 +00000000000000000100000 0 +00000000000000000010000 0 +00000000000000000001000 0 +00000000000000000000100 0 +00000000000000000000010 0 +00000000000000000000001 0 +0 10000000 -0 01111111 -00000001aaab00000000000 0111AAAB 1 000 1aaa -1aaaa000000000000000000 0000AAAA 1 111 aaaa
01 11 aaa 1011aaab 01 10 aaa 1010aaab 01 01 aaa 1001aaab 01 00 aaa 10001aaa 001 11 aa 001 10 ab 001 01 aa 001 00 aa 0001 11 a 0001 10 a 0001 01 a 0001 00 a 00001 11 00001 10 00001 01 00001 00 000001 1 000001 0 0000001
10111aaa 10110aaa 10101aaa 10100aaa 100111aa 100110ab 100101aa 100100aa 1000111a 1000110a 1000101a 1000100a 10000111 10000110 10000101 10000100 10000011 10000010 10000001 10000000 01111111
0 011 aaab 0 010 aaab 0 001 aaab 0 000 aaab
1 01 00 aaa 00001aaa 1 1 11 aaaa 0111aaaa
00100aaa 0111aaaa
1 000 aaab 1 111 aaaa
Usually, most descriptions of A-Law encoding and Mu-Law encoding state that it is Mu-Law encoding that has the greater dynamic range, acting on 14-bit values while A-Law encoding acts on 13-bit values; it appears to me, as shown on the diagram, that Mu-Law encoding acts on 13-bit values, and A-Law encoding acts on 24-bit values. It may be that the floating-point encoding used with Mu-Law encoding is applied not to the input signal value, but to its logarithm, or it may be that my original source for information on A-Law encoding either was not accurate, or I had misconstrued it; this seems likely, as using 24-bit digitization as the first step in digitizing a telephone conversation appears, in comparison to standards for high-quality digital audio, to be bizarre. The third column indicates what other sources appear to give for A-Law encoding, and this does cause it to act on 12-bit values (including the sign bit), which is at least one less bit than for Mu-Law encoding, even if there is a onebit discrepancy in both cases.
Also, if this method, with a two-bit exponent, were used for encoding audio signals with 16 bits per sample, the result, for the loudest signals, would have the same precision as a 14-bit signed integer, 13 bits of mantissa. Many early digital audio systems used 14 bits per sample rather than 16 bits. But the dynamic range, the difference between the softest and loudest signals possible, would be that of a 56-bit integer. One problem with using floating-point representations of signals for digital high-fidelity audio - although this particular format seems precise enough to largely make that problem minor - is that the human ear can still hear relatively faint sounds while another sound is present, if the two sounds are in different parts of the frequency spectrum. This is why some methods of music compression, such as those used with Sony's MiniDisc format, Philips' DCC (Digital Compact Cassette), and today's popular MP3 audio format, work by dividing the audio spectrum up into "critical bands", which are to some extent processed separately. Transmitting 6000 bytes per second is an improvement over 176,400 bytes per second, but it is still a fairly high data rate, requiring a transmission rate of 48,000 baud. Other techniques of compressing audio waveforms include delta modulation, where the difference between consecutive samples, rather than the samples themselves, are transmitted. A technique called ADPCM, adaptive pulse code modulation, works by such methods as extrapolating the previous two samples in a straight line, and assigning the available codes for levels for the current sample symmetrically around the extrapolated point. The term LPC, which means linear predictive coding, does not, as it might seem, refer to this kind of technique, but instead to a method that can very effectively reduce the amount of data required to transmit a speech signal, because it is based on the way the human vocal tract forms speech sounds. There was a good page about Linear Predictive Coding at the page
http://asylum.sf.ca.us/pub/u/howitt/lpc.tutorial.html
but that URL is no longer valid. In the latter part of World War II, the United States developed a highly secure speech scrambling system which used the vocoder principle to convert speech
to a digital format. This format was then enciphered by means of a one-timepad, and the result was transmitted using the spread-spectrum technique. The one-time-pad was in the form of a phonograph record, containing a signal which had six distinct levels. The records used by the two stations communicating were kept synchronized by the use of quartz crystal oscillators where the quartz crystals were kept at a controlled temperature. The system was called SIGSALY, and an article by David Kahn in the September, 1984 issue of Spectrum described it. Speech was converted for transmission as follows: The loudness of the portion of the sound in each of ten frequency bands, on average 280 Hz in width (ranging from 150 Hz to 2950 Hz), was determined for periods of one fiftieth of a second. This loudness was represented by one of six levels. The fundamental frequency of the speaking voice was represented by 35 codes; a 36th code indicated that a white noise source should be used instead in reconstructing the voice. This was also sampled fifty times a second. The intensities of sound in the bands indicated both the loudness of the fundamental signal, and the resonance of the vocal tract with respect to those harmonics of the fundamental signal that fell within the band. Either a waveform with the frequency of the fundamental, and a full set of harmonics, or white noise, was used as the source of the reconstructed sound in the reciever, and it was then filtered in the ten bands to match the observed intensities in these bands. This involved the transmission of twelve base-6 digits, 50 times a second. Since 6 to the 12th power is 2,176,782,336, which is just over 2^31, which is 2,147,483,648, this roughly corresponds to transmitting 200 bytes a second. This uses only two-thirds of the capacity of a 2,400-baud modem, and is quite a moderate data rate. The sound quality this provided, however, was mediocre. A standard for linear predictive coding, known as CELP, comes in two versions which convert the human voice to a 2,400-baud signal or to a 4,800-baud signal.
[
Audio filter
Digital domain parametric equalisation
This article does not cite any references or sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.(December 2009)
An audio filter is a frequency dependent amplifier circuit, working in the audio frequency range, 0 Hz to beyond 20 kHz. Many types of filters exist for applications including graphic equalizers, synthesizers, sound effects, CD players and virtual reality systems. Being a frequency dependent amplifier, in its most basic form, an audio filter is designed to amplify, pass or attenuate (negative amplification) some frequency ranges. Common types include low-pass, which pass through frequencies below their cutoff frequencies, and progressively attenuates frequencies above the cutoff frequency. A high-pass filter does the opposite, passing high frequencies above the cutoff frequency, and progressively attenuating frequencies below the cutoff frequency. A bandpass filter passes frequencies between its two cutoff frequencies, while attenuating those outside the range. A band-reject filter, attenuates frequencies between its two cutoff frequencies, while passing those outside the 'reject' range. An all-pass filter, passes all frequencies, but affects the phase of any given sinusoidal component according to its frequency. In some applications, such as in the design of graphic equalizers or CD players, the filters are designed according to a set of objective criteria such as pass band, pass band attenuation, stop band, and stop band attenuation, where the pass bands are the frequency ranges for which audio is attenuated less than a specified maximum, and the stop bands are the frequency ranges for which the audio must be attenuated by a specified minimum. In more complex cases, an audio filter can provide a feedback loop, which introduces resonance (ringing) alongside attenuation. Audio filters can also be designed to provide gain (boost) as well as attenuation.
In other applications, such as with synthesizers or sound effects, the aesthetic of the filter must be evaluated subjectively. Audio filters can be implemented in analog circuitry as analog filters or in DSP code or computer software as digital filters. Generically, the term 'audio filter' can be applied to mean anything which changes the timbre, or harmonic content of an audio signal.
[edit]Self
oscillation
Not to be confused with Self-exciting oscillation. Self oscillation occurs when the resonance or Q factor of the cutoff frequency of the filter is set high enough that the internal feedback causes the filter circuitry to become a sine tonesine wave oscillator.
[edit]See
also
Audio crossover

Multimedia

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multimedia

Uploaded by

Copyright:

Available Formats

A negative

Main article: Photographic film

Canny edge detection applied to a photograph

An Introduction to Image Compression

MSE = PSNR = 20 * log10 (255 / sqrt(MSE))

1 Lossy 2 Lossless 3 Theory o 3.1 Machine learning o 3.2 Data differencing

Run-length encoding (RLE) is a very simple form of data compression in

1 Example 2 Applications 3 See also 4 External links

(usually proportional to probabilities), i.e.

Output. Code , which is the set of (binary) codewords,

where ci is the codeword for

Let Condition: for any code

be the weighted path length of code C. .

Codeword length (in bits) (li)

Weighted path length (li wi )

0.30 0.45 0.60 0.32 0.58

Probability budget (2-li)

Entropy (wi log2 wi)

0.332 0.411 0.521 0.423 0.518 H(A) = 2.205

probability) from the queue

Create a new internal node with these two nodes as

details and examples

and decoding: overview

that will change)

and decoding: example

5.2.1 Encoding theory 5.2.2 Timeline

and currently unused potential

[edit]Lossy audio compression

History of Video Compression Standards

H.262/MPEG-2 Part ISO, IEC, ITU-T 2

DVD Video, Blu-ray, Digital Video Broadcasting, SVCD

Videoconferencing, Videotelephony, Video on Mobile Phones (3GP)

Video on Internet (DivX, Xvid ...)

Video on Internet, HDTV broadcast, UHDTV

the telecommunication area.

the vocal cords.

Speech enhancement: enhancing the intelligibility and/or perceptual quality of a speech

signal, like audio noise reduction for audio signals.

For the human linguistic concept, see Speech perception.

be challenged and removed. (November 2007)

companding viewed as a form of speech coding

1 Typical voice problems 2 Analysis methods 3 External links 4 See also

See also: Speech generating device

Overview of a typical TTS system Sample of Microsoft Sam

Problems listening to this file? See media help.

1 Algorithms 2 See also 3 References 4 External links

The Representation of Speech

N/A N/A N/A

S MMMMMM E 0 000001 1 0 000001 0 S MMMMMMM 0 0000001

0 111 aaaa 0 110 aaaa 0 101 aaaa 0 100 aaaa

0 011 aaab 0 010 aaab 0 001 aaab 0 000 aaab

1 01 00 aaa 00001aaa 1 1 11 aaaa 0111aaaa

1 000 aaab 1 111 aaaa

Digital domain parametric equalisation

You might also like