84 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. COM-28, NO. 1, JANUARY 1980 An Algorithm for Vector Quantizer Design YOSEPH LINDE, MEMBER. IEEE. ANDRES BUZO, MEMBER, EEE, Am ROBERT M. GRAY, SENIOR MEMBER. EEE ’ Abstract-An efficient,and intuitive algorithm is presented for the design of vector quantizers based either on a known prohabitistic model or on a long training sequence of data. The basic properties of the algorithm are discussed mid demonstrated by examples. Quite general distoriion measures and long blocklengths are allowed, as exemplified by the design of parameter vector quantizers of tendiensional vectors arising in Linear Predictive Coded (LE) speech compression with a complicated distortion measure arisiig in LPC analysis that does not depend only on the error vector. INTRODUCTION A- N efficient and intuitive algorithm for the design of good block or vector quantizers with ‘quite general distortion measures is developed for use on either known probabilistic source descriptions or on a long training sequence of data. The algorithm is based on in approach of Lloyd [I] , is not a varia- tional technique, and involves no differentiation; hence it works well even when the distribution has discrete compo- nents, as is the case when a sample distribution obtained from a training sequence is used. As with the common variational techniques, the algorithm produces a quantizer meeting neces- sary but not sufficient conditions for optimality. Usually, however, at least local optimality is assured in both approaches. We here motivate and describe the algorithm and relate it to a number of similar algorithms for special cases that have appeared in both the quantization and cluster analysis litera- ture. The basic operation of the algorithm is simple and intuitive in the general case considered here and it is clear that variational techniques are not required to develop nor to apply the algorithm. Several of the algorithm’s basic properties are developed using heuristic arguments and demonstrated by example. In a companion theoretical paper [2], these properties are given precise mathematical statements and are proved using -argu- ments from optiniization theory and ergodic theory. Those results will occasionally be quoted here to characterize the generality of certain properties. Paper approved by the Editor for Data Communication Systems of the IEEE Communications.Society for publication after presentation in part at the 1979 International Telemetering Conference, Los Angeles, CA, November 1978 and the 1979 International Symposium on hfor- mation Theory, Gringano, Italy; June 1979. Manuscript received May 22, 1978; revised August 21, 1979. This work was supported by Air Force Contract F44620-73-0065, F49620-78C-0087, and F49620-79- C-0058 and by the Joint Services Electronics Program at Stanford University, Stanford, CA. Y. Linde is with the Codex Corporation, Mansfield, MA. A. Buzo was with Stanford University, Stanford, CA and Signal Technology Inc., Santa Barbara, CA. He is now with the Instituto de Ingenieria, National University of Mexico, Mexico City, Mexico. R. M. Gray is with the Information Systems Laboratory, Stanford University, Stanford, CA 94305. In particular, the algorithm’s convergence properties are demonstrated herein by several examples. We consider the usual test cas. for such algorithms-quantizer, design for memoryless Gaussian sources with a mean-squared. error distor- tion measure; but we design and evaluate block quantizers with a rate of one bit per symbol and with blocklengths,of 1 through 6. Comparison with recently developed lower bounds to the optimal distortion of such block quantizers (which provide strict improvement over the traditional bounds of rate-distor- tion theory) indicate that the resulting quantizers are indeed nearly optimal and not simply locally optimal. We also con- sider a scalar case where local optima arise and show how a variation of the algorithm yields a global optimum. . The algorithm is also used to design a quantizer for 1Odi- mensional vectors- arising in speech compression systems. A complicated distortion measure is used that does not simply depend on the error vector. No probabilistic model is assumed, and hence the quantizer must be designed based on a training sequence of real speech. Here the convergence properties for both length of the training sequence and the number of iterations of the algorithm are demonstrated experimentally; No theoretical optimumjs known for this case, but our system was used to compress the output of a traditional 6000 bit/s Linear Predictive Coded (LPC) speech system down to a rate of 1400 bitthat assigns to each input vector, x = (xo, -, X~LI), a reproduc- tion vector, i = q(x), drawn from a finite reproduction alphabet, A = bi; i = 1, -, N}. The quantizer 4 is compl$ely described by the reproduction alphabet (or codebook) A to- gether with the partition, S = {Si; i = 1, *-*, N), of the input vector space into the sets Si = {x: q(x) =vi) of input vectors mapping into the ith reproduction vector (or codeword)., Such quantizers are also called block quantizers, vector quantizers, and block source codes. DISTORTION MEASURES We assume the distortion caused by reproducing an input vector x by a reproduction vector i is giveri by a nonnegative distortion measure d(x, 2). Many such distortion measures have been proposed in the literature. The most common for 0090-6778/80/0100-0084$00.75 0 1980 IEEE LINDE et al. : ALGORITHM FOR VECTOR QUANTIZER DESIGN 85 reasons of mathematical convenience is the squarederror dis- tortion. k-1 i= 0 Other common distortion measures are the l,, or Holder norm, and its vth power, the uth-law distortion: b-1 While both distortion measures (2) and (3) depend on the uth power of the errors in the separate coordinates, the measure of (2) is often more useful since it is a distance or metric and hence satisfies the triangle inequality, d(x, 2) d d(x, y) +

[email protected], .?), for all y. The triangle inequality allows one to bound the overall distortion easily in a multi-step system by the sum of the individual distortions incurred in each step. The usual vth-law distortion of (3) does not have this property. Other distortion measures are the I,, or Minkowski norm, the weighted-squares distortion, k--3 where wi 2 0, i = 0, .-e, k - 1, and the more general quadratic distortion d(x, i) = (x - i)B(x - i)t where B = {Bi,j} is a k X k positive definite symmetric matrix. All of the previously described distortion measures have the property that they depend on the vectors x and P only through the error vector x - i. Such distortion measures having the form d(x, i) = L(x -2) are called difference distortion meas- ures. Distortion measures not having this form but depending on x and 2 in a more complicated fashion have also been pro- posed for data compression systems. Of interest here is a dis- tortion measure of Itakura and Saito [3, 41 and Chaffee [5, 321 which arises in speech compression systems and has the form d(x , i) = (x - i)R @)(x - i)f , (7) where for each x, R(x) is a positive definite k X k symmetric matrix. This distortion resembles the quadratic distortion of (6), but here the weighting matrix depends on the input vector We are here concerned with the particular form and applica- tion of this distortion measure rather than its origins, which are treated in depth in [3-91 and in a paper in preparation. For motivation, however, we briefly describe the context in which this distortion measure is used in speech systems. In the LPC approach to speech compression [IO] , each frame of sampled speech is modeled as the output of a finite-order all-pole filter driven by either white noise (unvoiced sounds) or a periodic pulse train (voiced sounds). LPC analysis has, as input, a frame of speech and produces parameters describing the model. These parameters are then quantized and transmitted. One collection of such parameters consists of a voiced/unvoiced decision together with a pitch estimate for voiced sounds, a gain term u (related to volume), and the sample response of the normalized inverse filter (1, al, a2, .*a, aK), that is, the normalized all-pole model has transfer function or z-transform (Zf=o Q~z-~}-~. Other parameter descriptions such as the reflection coefficients are also possible [ 101 . In traditional LPC systems, the various parameters are quantized separately, but such systems have effectively reached their theoretical performance limits [ 1 I] . Hence it is natural to consider block quantization of these parameters and com- pare the performance with the traditional scalar quantization techniques. Here we consider the case where the pitch and gain are (as usual) quantized separately, but the parameters describ- ing the normalized model are to be quantized together as a vector. Since the lead term is 1, we wish to quantize a vector (al, u2, .e-, uK) k = 0, 1, as-, K - 1 ;j = 0, 1, e-, K - 1) defined by X. described by x when the input has a flat unit amplitude spectrum. Many properties and alternative forms for this particular distortion measure are developed in [3-91, where it is also shown that standard LPC systems implicitly minimize this distortion, which suggests that it is also an appropriate distor- tion measure for subsequent quantization. Here, however, the important fact is that it is not a difference distortion measure; it is one for which the dependence on x and i is quite complicated. We also observe that various functions of the previously defined distortion measures have been proposed in the lit- erature, for example, distortion measures of the forms IIx - illr and p( Ilx - ill), where p is a convex function and the norm is any of the previously defined norms. The techniques to be developed here are applicable to all of these distortion measures. 86 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. COM-28, NO. 1, JANUARY 1980 PERFORMANCE kt x = (xo, *,Xk-l) be a real random vector described by a cumulative distribution function F(x) = Pr{Xi Qxi;i = 0, 1, a**, k - 1). A measure of the performance of a quantizer 4 applied to the random vector X is given by the expected distortion where E denotes the expectation with respect to the under- lying distribution F. This performance measure is physically meaningful if the quantizer q is to be used to quantize a sequence of vectors X, = (X,K, e-, XnK+K-l) that are sta- tionary and ergodic, since then the time-averaged distortion, converges with probability one to D(4) as n -+ 00 (from the ergodic theorem), that is, D(4) describes the long-run time- averaged distortion. An alternative performance measure is the maximum of d(x, 4(x)) over all x in A, but we use only the expected dis- tortion (9) since, in most problems of interest (to us), it is the average distortion and not the peak distortion that deter- mines subjective quality. In addition, the expected distortion is more easily dealt with mathematically. OPTIMAL QUANTIZATION An N-level quantizer will be said to be optimal (or globally optimal) if it minimizes the expected distortion, that is, 4* is optimal if for all other quantizers 4 having N reproduction vectors D(q*) D(q). A quantizer is said to be locally opti- mum if D(q) is only a local minimum, that is, slight changes in q cause an increase in distortion. The goal of block quan- tizer design is to obtain an optimal quantizer if possible and, if not, to obtain a locally optimal and hopefully “good” quantizer. .Several such algorithms have been proposed in the literature for the computer-aided design of locally optimal quantizers. In a few special cases, it has been possible to demonstrate global optimality either analytically or by ex- hausting alllocal optima. In 1957,in a classic but unfortunately unpublished Bell Laboratories’ paper, S. Lloyd [l] proposed two methods for quantizer design for the scalar case (k = 1) with a squarederror distortion criterion. His “Method II” was a straightforward variational approach wherein he took deriva- tives with respect to the reproduction symbols,yi, and with respect to the boundary points defining the Si and set these derivatives to zero. This in general yields only a “stationary- point” quantizer (a multidimensional zero derivative) that satisfies necessary but not sufficient conditions for optimality. By second derivative arguments, however, it is easy to establish that such stationary-point quantizers are at least locally opti- mum for vth-power law distortion measures. In addition, Lloyd also demonstrated global optimality for certain distri- butions by a technique of exhaustively searching all local optima. Essentially the same technique was also proposed and used in the parallel problem of cluster analysis by Dalenius [12] in 1950, Fisher [13] in 1953, and Cox [14] in 1957. The technique was also independently developed by Max [ 151 in 1960 and the resulting quantizer is commonly known as the Ldoyd-Max quantizer. This approach has proved quite useful for designing scalar quantizers, with power-law distor- tion criteria and with known distributions that were suffi- ciently well behaved to ensure the existence of the derivatives in question. In addition, for this case, Fleischer [ 161 was able to demonstrate analytically that the resulting quantizers were globally optimum for several interesting probability densities. In some situations, however, the direct variational approach has not proved successful. First, if k is not equal to 1 or 2, the computational requirements become too complex. Simple combinations of one-dimensional differentiation will not work because of the possibly complicated surface shapes of the boundaries of the cells of the partition. In fact, the only suc- cessful applications of a direct variational approach to multi- dimensional quantization are for quantizers where the parti- tion cells are required to have a particular simple form such as multidimensional “cubes” or, in two dimensions, “pie slices,” each described only by a radius and two angles. These shapes are amenable to differentiation techniques, but only yield a local optimum within the constrained class. Secondly, if, in addition, more complex distortion measures such as those of (4)-(7) are desired, the required computation asso- ciated with the variational equations can become exorbitant. Thirdly, if the underlying probability distribution has discrete components, then the required derivatives may not exist, causing further computational problems. Lastly, if one lacks a precise probabilistic description of the random vector X and must base the design instead on an observed long training sequence of data, then there is no obvious way to apply the variational approach. If the underlying unknown process is stationary and ergodic, then hopefully a system designed by using a sufficiently long training sequence should also work well on future data. To directly apply the variational tech- nique in this case, one would fir