# 1980TSVQ-Speech coding based upon vector quantization.pdf

562 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-28, NO. 5, OCTOBER 19813 Speech Coding Based Upon Vector Quantization ANDRES BUZO, MEMBER, IEEE, AUGUSTINE H. GRAY, JR., SENIOR MEMBER, IEEE, ROBERT M. GRAY, FELLOW, IEEE, AND JOHN D. MARKEL, SENIOR MEMBER, IEEE Abstract-With rare exception, all presently available narrow-band speech coding systems implement scalar quantization (independent quantization) of the transmission parameters (such as reflection coef- ficients or transformed reflection coefficients in LPC systems). This paper presents a new approach cdled vector quantization. For very low data rates, realistic experiments have shown that vector quantization can achieve a given level of average distortion with 15 to 20 fewer bits/frame than that required for the optimized scalar quantiz- ing approaches presently in use. The vector quantizing approach is shown to be a mathematically and computationally tractable method which builds upon knowledge obtained in linear prediction analysis studies. This paper introduces the theory in a nonrigorous form, along with practical results to date and an extensive list of research topics for this new area of speech coding. T I. INTRODUCTION HE ubiquitous LPC technique of speech coding can be viewed as a two-step process as shown in Fig. 1. The first step is an identification process whereby an all-pole model G,(z) which best matches the input speech frame X(z) (or possibly the preprocessed speech frame) is calculated. The best match is based on some predefined measure of optimality. The model G,(z) implicitly also includes a gain term so that the only remaining transmission parameter is the pitch esti- mate (which includes a voicing decision). The second step is compression or quantization of the parameters from the identification step for efficient transmis- sion or storage. A great deal has been written about the iden- tification step (see [l] , [2] and their bibliographies, for exam- ple), while substantially less has been written about the compression or quantization step [3]-[7]. Traditionally, the parameters from the identification step, such as reflection coefficients, have been individually quan- tized. We refer to such an approach as scalar quantization and it has also been called single symbol quantization. Orthogonal vector transformations have been applied with the idea of Manuscript received October 15, 1979; revised April 10, 1980. This research was supported in part by the Naval Research Laboratory under Contract NRLNO0039-79-0256(S), the U.S. Department of Defense under Contract MDA904-79-C-0402, the U.S. Air Force Office of Scientific Research under Contracts F49620-78-C-0087 and F49620- 79-C-0058, and by the Joint Services Electronics Program, Stanford University, Stanford, CA. A. Buzo is with the University of Mexico, Mexico City, Mexico. A. H. Gray, Jr. and J. D. Markel are with Signal Technology, Inc., R. M. Gray is with the Information Systems Laboratory, Stanford Santa Barbara, CA 93101. University, Stanford, CA 94305. I L _--_-_--__-________ J Fig. 1. Illustration showing process of LPC analysis as a two-step process. eliminating interparameter correlation. However, scalar quan- tization has then been applied to the transformed coefficients Other techniques, such as variable frame rate [9] or inter- frame coding, which can be “added on” to most speech com- pression systems, have been studied as a way to increase com- pression. To achieve highly efficient parameter compression, one must look beyond scalar quantization or the heuristics mentioned above to rate distortion techniques based on an appropriate fidelity criterion. Here, the fidelity criterion will include minimization of a distortion measure, not only in the identification step (as implemented in linear prediction analy- sis), but also in the compression or quantization step. Vector quantization, a design approach to thk problem, has been developed during the past few years. The historical develop- ment of this area is covered in detail elsewhere [ 131 , [ 141 . This paper presents a nonrigorous mathematical introduction to the new area of vector quantization in a deterministic manner along with experimental results. The presentation expands upon previously published results in linear prediction analysis. A more rigorous development of many of the theo- retical concepts is presented elsewhere [lo] , [ 131 , [18] , [21]. In addition to the mathematical development, this paper presents the first experimental comparison between optimized scalar quantization and vector quantization. The results from both objective and subjective evaluation show dramatic bit savings for very low data rate conditions. 11. ITAKURA-SAITO DISTORTION MEASURE [81* For a distortion measure to be of value in vector quantizing it must be analytically tractable, computable from sampled data, and, most important, subjectively meaningful. A distor- tion measure that results in the standard linear prediction 0096-3518/80/1000-0562$00.75 0 1980 IEEE BUZO et al.: SPEECH CODING 563 analysis equations under the assumption of no distortion due to compression is also desirable. The Itakura-Saito distortion meamre, introduced as an error matching function, appears to satisfy all of the above require- ments [ 111 . This section reviews several preliminary mathe- matical results and introduces the Itakura-Saito distortion measure along with several properties relevant to specific vec- tor quantizer implementations. A. Preliminaries Let X(z) represent the z transform of the windowed (and possibly preemphasized) speech data, which are to be modeled by an all-pole filter of the form G (z) P u/A (z) (1) A(z)A(l/z)-ru(n) =xakak+n (7b) k (z GM ( 1 /z - hz (n). (7c) Limits have not been placed on the summations to indicate that they are summations over all values of k. As will be pointed out, however, these will always be finite summations. The integral which defines the residual energy in (4) can be precisely expressed for numerical evaluation as It is well known [l] that the model GM(z) matches the signal X(z) in terms of the 2M + 1 term autocorrelation sequence M A(z) A akz-k, with a, = 1. k=n and therefore rM (n) can be used to replace rx (n) in the sum- (2) mation of (8). .” - In describing the spectral matching effects of linear predic- In linear Predictive analysis the PolYnomial is wed to tion, Itakura and Saito introduced an “error matching func- minimize a residual energy. In Particular, using /XI2 and IA l2 tion” [ll] which is also referred to as the Itakura-Saito dis- to denote energy density spectra tortion measure [ 101 . This measure will be denoted by IXI2 1Gj2),where then the residual energy resulting from passing X(z) through the inverse fdter A(z) is given by ~(IXI~;IGI~)GJ~ [1x/cl2 -ln(lX/G/2)- 11 dB -. (10) -77 2n Many of its properties can be found elsewhere [lo]-[ 121 , (4) [ 181 . One of the properties that interests us here is that it is We denote the polynomial which minimizes the residual energy as AM(z) and the minimum value of the residual as aM so that a nonnegative function of the spectra, whose minimum value for a given ]XI2 and G(z) given by (1) occurs when G(z) = GM(z). Thus, a aM. d(1X12;IG12)~d(lX12;;GM12)=ln(aM/am). (1 1) When IX/G I is near unity, (10) takes on the approximate form The model chosen in linear prediction analysis based upon the identification step in Fig. 1-is then Gw (z) = 6/AM (z- It is known [l] that AM(z) is a phase poly- so that for small distortion the Itakura-Saito distortion mea- nomial, that is, has its roots inside the unit circle so that GM(z) is stable. In addition, the minimum residual energies form a decreasing sequence as M increases, approaching a lower limit sure is approximately one-half the mean-square log spectral deviation. defined here as a,, which is given by B. Usejid Properties This lower limit a, is sometimes called the one-step predic- tion error or gain of [I]. The actual solution process for finding AM(z) is also well known (see [l] and the references contained therein) and is not considered here. Using arrows to denote z transform relationships, we can define the autocorrelation sequences X(z) X( l/z) t--, rx (n) = x (k) x (k + n) (7a) k A set of properties of the Itakura-Saito distortion measure to be used in the next section is now presented. A sketch of the derivation of these results is given in the Appendix. The properties are developed in detail in [lo] and [ 121. First, for purposes of calculation and interpretation, the Itakura-Saito distortion measure can be expressed in the form d[l~1~;I~1~] =a/02 +In(u2)- In(@-)- 1 (1 3) where u, a, and a, are defined by (1)) (4), and (6), respec- tively. Surprisingly, it can be shown to satisfy a form of “triangle equality” d[JX12;IG(21 =d[1X12;IG~121 +~[\GM~~;(G~~I, (14) as shown in the Appendix. Referring to Fig. 1 we see that the total distortion in the analysis can be viewed as precisely the sum of the distortion due to the identification step and the distortion due to the compression or quantization step. In general, one can only hope for a triangle inequality whereby the total distortion is bounded by the sum of the individual distortions. Also, minimizing d[ ]XI2; I G 12] is equivalent to minimizing d[lG~(’; (G(’]], for d[lX(’; IGMI2]] is a fixed property of I X 1’ , as indicated in (1 I), if M is held constant. A second useful gain cascading property is given by d[lX12;u2/IA12] =d[IX12;a/lA12] +d[a;u2], (1 5) which divides the distortion into two parts. The first part is independent of the gain choice o, while the second part is apparently independent of the polynomial A (z), although in fact it does depend upon A(z) through the residual energy a. The first term has also been called the gain optimized Itakura- Saito distortion measure or simply the “Itakura distortion measure” [IO], [12]. We will show later that this form leads to implementation of a suboptimal but storage-efficient and a practical narrow-band speech compression system. 111. VECTOR QUANTIZATION The standard autocorrelation method LPC system first obtains an optimal model GM(z) for a speech frame X(z), and then quantizes its parameters, leading to a quantized version G(z). This system implicitly minimizes the Itakura-Saito dis- tortion measure [l, p. 1351, [lo], [ll] at the first step (identification of the optimal model), but does not use this distortion measure for the second step (compression or quanti- zation of the parameters). It is therefore unknown whether any particular overall distortion from the speech spectrum 1x1’ to the quantized model spectrum (G l2 has in fact been minimized. An alternate approach is to define a reasonable distortion measure and attempt to choose G(z) from a finite collection of vectors to minimize the overall distortion. Thus, the term vector quantization refers to the process of choosing a vector of parameters, e.g., {u,ra(0),ra(l), - - - ,ra(M)}, from a set which minimizes a distortion measure. As described in the previous section, the Itakura-Saito distortion measure is analytically tractable, perceptually meaningful, and readily computable (except for the term am in (13) which will be shown to be unnecessary). One very important and useful property of the distortion measure is the triangle equality of (14). In particular, this property shows that one can minimize the overall distortion d(IX1’; IG 12) directly in one step, or one can first obtain the ideal model GM(z) and then minimize d(]Gn/r 1’; ]G 12), which results in a two-step process as indicated in Fig. 1 [lo]. Aside from computational and storage issues, the vector quantizer being described here results in an elegant structure (shown later), once the nontrivial problem of how to deter- mine a collection of reference or reproduction vectors or code- book has been resolved. The codebook is a finite set of model fdters from which G(z) must be found. Each code word has an index number and a stored set of parameters representing a possible G(z). For example, code word number 01101001 (binary) might represent one frame of the sound /a/ as uttered by a specific test speaker, by way of its filter coefficients, reflection coefficients or autocorrelation coefficients. Then for each frame of speech, G(z) is chosen from the codebook to minimize d(IX1’; IG [’), or, equivalently, to minimize d(l GM[’ ; I G 1’). This process finds the nearest neighbor to X(z) in the codebook. The index to the code word is then transmitted and used at the receiver to retrieve the appropriate synthesis filter parameters. The problems of generating an optimal codebook (in the Itakura-Saito distortion sense) and evaluating the distortion are now analyzed. Then the computational and storage issues which lead to a practical suboptimal vector quantizing system are considered. A. Codebook Generation The generation of a codebook to minimize distortion over a large number of test frames of speech requires an iterative process. As with many minimization problems, the procedure to be described will converge to a local minimum, but not necessarily an absolute or global minimum. The basic ideas of the procedure to be described date to Lloyd [15]. Linde et al. 1131 and Gray et a2. [21 J contain a more detailed and general discussion. Let Xk(z) represent the z transform of the kth frame of speech, where k = 1 , 2, * * - , K, and X is large. These frames represent a test or learning sequence for codebook generation. We wish to model these speech frames with a finite set of models, B = 2b in number, where b is the number of bits in the codebook. Assume at this point that an initial (nonoptimum) choice has been made for a set of B code words. As a first step, each speech frame, represented by Xk(z), is taken individ- ually and assigned to a “cell” represented by a single code word. This is accomplished by finding the particular G(z) out of the collection of B possible models which minimizes d[1X,I2; [GI2]. That is, the objective is to find the particular G(z) in the codebook which is the nearest neighbor to X, (z). The total overall distortion is then the sum of all the individual distortions in the database, with one from each speech frame. The second step attempts to improve the codebook by choosing a better model for each cell. For simplicity, consider a single cell, with the frames of speech within that cell renum- bered as XI (z), X,(z), * * - ,XL (z), so that the total distortion for that cell is given by Dk drJXkJ2;)GJ2]. L k= 1 (16) Then we must find a single model G(z) from an infinite num- ber of possibilities to minimize the distortion in that particular cell by identifying what might be called the centroid of the cell. The solution to this problem is equivalent to a stan- dard linear prediction analysis problem of minimizing d[IXI2; IG1’1,where 1X12isthearithmeticmeanofthe individual cell spectra 1L 1x1’ 2L lXk12. (1 7) k= 1 BUZO et al.: SPEECH CODING