# 1984VQ.pdf

VECTOR QUANTIZATION Robert M. Gray Information Systems Laboratory ABSTRACT During the past ten years Vector Quantization (VQ) has developed from a theoretical possibility promised by Shannon s source coding theorems into a powerful and competitive technique for speech and image coding and compression at medium to low bit rates. In this survey, the basic ideas behind the design of vector quantizers are sketched and some comments made on the state-of-the-art and current research efforts. INTRODUCTION VQ can be thought of as the vector extension of Pulse Coded Modulation (PCM) wherein real vectors instead of real scalars are converted into digital representations which in turn can be used to produce a reproduction of the original signal. The goal, of course, is to produce a digital representation of the signal which can be communicated on a digital communication channel or stored in a digital medium. Representing analog data digitally introduces distortion, and hence a major design goal is to minimize the distortion given constraints on communication or storage capacity and complexity. consecutive samples from a continuous waveform, rectangular subblocks of an image intensity or density, three dimensional vectors consisting of, say, two-by-two squares of pixels three frames deep (or twelve pixels in the vector), or they may be feature or parameter vectors extracted from the data which represent its important attributes, such as Fourier transformed vectors or the Linear Predictive Coded (LPC) representation of a speech signals. The vectors to be digitized may be a collection of 205 PRECEDING PAGE BLANK NOT FILMED Although Shannon s source coding theorems imply that performance improvement can always be obtained by coding vectors instead of scalars (I 3 55 62) , the most popular systems for analog-to-digital conversion and data compression perform the quantization operation only on scalars, although they often effectively operate on vectors by imbedding the quantizer in a feedback loop (as in predictive quantization) or by,first performing a linear transform on the data (as in transform coding). While such systems have the advantage of simplicity, they are necessarily suboptimal in the Shannon sense. Furthermore, the definition of l1sirnplicity1l has been enormously extended with modern circuit design and implementation techniques: DSP chips and VLSI have rendered feasible algorithms that were considered absurdly complicated only ten years ago. In addition to the complexity issue, another impediment to the use of VQ systems has been the lack of design algorithms. Unlike the dual problem of channel coding or error control coding, quantization is fundamentally nonlinear and the algebraic approaches successful in error correction are of little help in the digitization problem. Since the middle of the last decade, a variety of design techniques and tricks have been developed for VQ and real time hardware has been designed to perform sophisticated variations of these systems. This paper presents a brief overview of the fundamental design principles of the basic VQ structure and its variations. Deeper discussions of many of the issues and systems may be found in tutorial articles (4 5 6). found(7). A thorough development of VQ systems may be VQ is a form of lllossyll data compression in contrast to lllosslesslf data compression or noiseless coding. Noiseless codes are perfectly invertible and necessarily variable length codes. The most popular noiseless coding algorithms are Huffman coding, Rice codes, Lempel-Ziv Codes, and arithmetic codes( 55-61) . especially for the compression of computer programs and data files which cannot tolerate errors. Noiseless compression is usually These codes are in common use, 206 required in such situations where the compression system must be designed without any knowledge of the structure or end use of the original digital signal. with an analog signal (such as microphone, camera, or analog sensor output) and produces a digital output is necessarily a lossy code as a continuous voltage cannot be reproduced perfectly from a digital representation. systems are lossy, the goal of any such system is to provide the best quality (minimum loss) within the constraints of the system. As unpleasant as purposefully introducing distortion into a representation might sound to a user, it is preferable to the potential large insertion of uncontrolled distortion or the complete loss of the data caused by overwhelming available digital communication or storage capacity. In other words, if you are generating gigabits but the available communication channel only takes megabits, then you either compress to the best acceptable quality or you may get no useful data at all. VQ is an approach to minimizing the loss for a given communication or storage rate. It is based on the Shannon theory approach of defining and minimizing an objective distortion criterion for a given code rate. The minimization is accomplished using algorithms developed in communications, statistics, and cluster analysis. Next two sections summarize the basic approach. On the other hand, any system which begins Given that all such analog-to-digital conversion MATHEMATICAL MODELS The mathematical model for an analog-to-digital conversion system or for a data compression system is a source code subject to a fidelity criterion. A source {X(n);n=l,2,.} is a discrete time signal which in general is vector-valued. Let A denote the alphabet or collection of possible values of X(n)., for example, A may be k dimensional Euclidean vector space. For convenience we assume that the signal is a statistically 88nicett process (e.g., the law of large numbers holds). The basic results extend to more general processes. A code in the general sense is a mapping of the input sequence (X(n)) 207 into a binary sequence (the encoder) together with a mapping of the binary sequence into a reproduction sequence (Y(n)) (the decoder). (We assume the encoded sequence is binary for convenience, in general it need only be from a finite alphabet.) The rate (or resolution) R of the code is the number of bits or binary symbols transmitted or stored per source symbol. In general this rate can be fractional, although in some examples it is useful to focus on integral values. A block source code is a code where each input block or vector (X(lk), X(lk+l), ., X((l+l)k-1)) is mapped into its binary code word in a way that does not depend on past or future actions of the encoder. The decoder is required to act in a similar fashion independently of past and future vectors. A block source code is also called a (memoryless) block quantizer or vector quantizer. The qualifier llmemorylessll reflects the fact that such codes operate on vectors in a memoryless fashion (although the clearly have memory with respect to the I individual symbols within the vectors). We will consider codes that I have memory, but the focus of Shannon theory is on memoryless codes. To measure the performance of a code, we assume a non-negative distortion measure d(x,y) which measures the cost of reproducing any x in A by some y in a reproduction alphabet B, which may or may not be the same as A. The performance is measured by an average distortion, where the average may be a long term time average or an ensemble or probabilistic average. For convenience we represent both by I 1N N i=1 AN = E [ - d(X(i), Y(i))]r where the expectation E can mean either a probabilistic average or a time average (which can be viewed as a special case of a probabilistic average in which every sample vector in a training sequence of length L has probability 1/L. Ideally a distortion measure should be analytically tractable, computable, and subjectively meaningful. In practice, these attributes must be balanced and a variety of choices exist. 208 Shannon s converse coding theorem and its generalizations imply that for any code for which these definitions make sense, A can be no smaller than the distortion-rate function D(R) of the source and distortion measure evaluated at rate R, a function defined by an information theoretic minimization which can be computed analytically or numerically or bounded for many interesting sources. Shannon also showed that performance arbitrarily near to D(R) could be achieved with block source codes, another name for VQ. Unfortunately, however, the proof of this result provided no indication of how to actually design a good code. This we explore shortly. MEMORYLESS VECTOR QUANTIZATION As described in the previous section,. a memoryless vector quantizer or block quantizer is in the Shannon terminology a length k block source code subject to a fidelity criterion; that is, it is a pair of mappings, an encoder E which maps k-dimensional input vectors X into binary vectors which we denote by their equivalent decimal representation j = 1,2, ., M, and a decoder D which maps those binary vectors into reproduction vectors. For simplicity we assume that the binary vectors have dimension K and hence that the rate of the code is R = K/k bits per input symbol. To describe the operation of a block code define the code book C= (y (j) = D(j) , j = 1,2, ., M} as the collection of all possible reproduction vectors, and the code partition P = ( P(j):j=l, ., M), where P(j) is the collection of all input vectors which are encoded into the binary vector j. The quantizer mapping Q( x )is defined as D(E(x)), that is, the overall action of the code. The term VQ is used to refer to combination of the encoder and decoder or, equivalently, the overall mapping Q. 209 The average distortion resulting from applying a vector quantizer to a source can be written using conditional averages as Recall that the expectation and the probabilities may come either from a probabilistic model or (more commonly) from a training sequence of typical data. A vector quantizer is optimal if it yields the smallest possible A over any quantizer with the same dimension and resolution. The above representation easily yields two necessary conditions for a VQ to be optimal. The Nearest Neighbor (Minimum Distortion) Condition A necessary condition for a VQ to be optimal is that the encoder be optimal for the decoder. If the decoder yields a code book C, then the encoder must be a nearest neighbor or minimum distortion mapping that satisfies This is equivalent to the following: E(x) =j only if d(x,y(j) ) I d(x,y(i)), all i#j Thus given a decoder or, equivalently, a code book, the optimal encoder is the one which searches the entire code book and selects the binary vector corresponding to the minimum distortion available reproduction vector. 210 The Centroid Condition Define the centroid of a set S with respect to a distortion measure d and a probability distribution on X = (X(1), . ,X(N)) by cent(s) = min-l E[~(X,Y)IXES] Y A necessary condition for a VQ to be optimal is that the decoder be optimal for the encoder. This is equivalent to the following: If the encoder implies a partition {P(j)), then the optimal decoder satisfies D(j) = cent( P(j)); all j. For example, in the case of mean squared error, the centroid is simply a conditional mean given that the input was mapped into the binary vector j. This condition states that given an encoder, which can be considered as a partition of the input vector space, then the optimum decoder is the one which maps each received binary vector into the centroid of the region of the partition which is encoded into that binary vector. Note that unlike the encoder condition, this condition requires knowledge of the input signal distribution, knowledge that can come from a mathematical model or from a training sequence. MEMORYLESS VQ DESIGN Because of the nearest neighbor condition, a VQ is completely described by its code book. The two properties together provide a means of improving any given code book C: The Lloyd Iteration 1. Given a code book C, form a nearest neighbor partition {p(j) 1 211 2. Given a partition (P(j)), form a new code book C = (cent(P(j)); j=1, . , M). It is obvious that the above operation produces a new code book that is better than (at least no worse than) the original code book since each step can only improve performance. These properties were first observed for the mean squared error and scalar quantizers by Lukaszewicz and Steinhaus (8) and were independently found shortly thereafter by Lloyd( ) , who developed an iterative algorithm for designing scalar quantizers with a mean squared error based on repeated use of the iteration. The basic idea extends immediately to vectors and is called the generalized Lloyd algorithm: The Lloyd VQ Design Algorithm 0. Given an initial code book C(0) and a threshold a. Set A(0) = huge. Set k = 1. 1. Use the Lloyd iteration to produce a new code book C(k) from C(k-1). 2. Evaluate the distortion Ak = E(min d(X,Y)) Y If quit. Otherwise replace k by k+l and continue. In most practical applications, one does not have a probability distribution, but does have a training sequence of data. In this case the algorithm can be run on the empirical distribution which assigns a probability of 1/L to each of L samples in the training sequence. In this case the expected distortion is replaced by a sample average. 212 Theorems can be proved to the effect that if the training sequence is long enough, the VQ designed will be close to that which would have been designed had the distribution been known (10,11) This algorithm was developed in the statistical literature under the name of the k-means algorithm by MacQueen (12) and was first applied to vector quantization in the two dimensional case with a mean squared error by Chen (I3). The algorithm was extended to general vector quantization with a variety of distortion measures by Linde, BUZO, and Gray (I4), who computed centroids for a variety of distortion measures and applied the algorithm to speech waveform and voice compression. A remaining issue is how to design the initial code book, which is in itself a code design problem. We now next describe several such techniques. In fact, these techniques can be used as an alternative to the Lloyd algorithm for designing a complete code, but such code books can always be improved by subsequent application of the Lloyd algorithm. Random Coding Perhaps the simplest conceptual approach towards filling a code book of M code words is to randomly select the code words according to the source distribution, which can be viewed as a Monte Carlo code book design. The obvious variation when designing based on a training sequence is to simply select the first M training vectors as code words. If the data is highly correlated, it will likely produce a better code book if one takes, say, every Nth training vector. This technique has often been used in the pattern recognition literature and was used in the original development of the k-means technique by MacQueen (I2). One can be somewhat more so