Additive Quantization for Extreme Vector Compression Artem Babenko Yandex, Moscow Moscow Institute of Physics and Technology

[email protected] Victor Lempitsky Skolkovo Institute of Science and Technology (Skoltech)

[email protected] Abstract We introduce a new compression scheme for high- dimensional vectors that approximates the vectors using sums of M codewords coming from M different codebooks. We show that the proposed scheme permits efficient distance and scalar product computations between compressed and uncompressed vectors. We further suggest vector encod- ing and codebook learning algorithms that can minimize the coding error within the proposed scheme. In the exper- iments, we demonstrate that the proposed compression can be used instead of or together with product quantization. Compared to product quantization and its optimized ver- sions, the proposed compression approach leads to lower coding approximation errors, higher accuracy of approxi- mate nearest neighbor search in the datasets of visual de- scriptors, and lower image classification error, whenever the classifiers are learned on or applied to compressed vec- tors. 1. Introduction At least since the work [21], there is a growing interest in the computer vision community to the problem of extreme lossy compression of high-dimensional vectors. In this sce- nario, a large dataset of high-dimensional vectors (corre- sponding to image or interest point descriptors) is com- pressed by a large factor, so that only few bytes are spent to represent each vector. Computer vision applications addi- tionally require that the compressed representation permits efficient evaluation of distances and/or scalar products be- tween a query vector (which can be uncompressed) and the dataset of compressed vectors. Previously proposed methods fall into two groups. The first very diverse group consists of binary encoding meth- ods (e.g. [23, 9]) that transform each vector into a short se- quence of bits, so that the Hamming distance between a pair of compressed vectors approximates the Euclidean distance between the original vectors. The second group is based on the idea of Product Quantization (PQ) [10]. PQ meth- ods decompose each vector into components correspond- ing to orthogonal subspaces, and then vector quantize these lower-dimensional components using a separate small code- book for each subset. For some kinds of data (in particu- lar, histogram-based descriptors such as SIFT) the orthogo- nal subspaces corresponding to natural dimension splitting lead to near optimal performance, and thus each of the PQ subsets corresponds to subsets of dimensions in the original space (Figure 1). For other types, a significant gain in PQ performance can be obtained by transforming the vectors via rotation found by an optimization process [15, 8]. Via a smart use of lookup tables, PQ compression per- mits very efficient computation of distances and scalar prod- ucts between an uncompressed query vector and a large set of PQ-compressed vectors (asymmetric distance computa- tion or ADC in terms of [10]). On top of the high effi- ciency of the ADC, the computed distances approximate the distances between uncompressed vectors rather closely, in particular, much more accurately than the Hamming dis- tance approximation within binary encoding methods with the same compression rate – see e.g. comparisons in [15, 2]. PQ thus provides a unique combination of low approxima- tion error, fixed size and random access to data on one hand, and highly efficient evaluation of distances and scalar prod- ucts with uncompressed vectors on the other hand. At the same time, the decomposition into subspaces employed by PQ makes an underlying assumption that the distributions of vectors within different subspaces are mutually indepen- dent. The performance of PQ thus decreases whenever the dependence between subspace data distribution is strong. In this paper we propose a new coding method called additive quantization (AQ) that generalizes PQ and further improves over PQ accuracy, while retaining its computa- tional efficiency to a large degree. Similarly to PQ, AQ rep- resents each vector as a sum of several components each coming from a separate codebook. Unlike PQ, AQ does not decompose data space into orthogonal subspaces and thus does not make any subspace independence assump- tions. Thus the codewords within each AQ dataset are of the same length as the input vectors, and are generally not or- 1 Figure 1. Product quantization (PQ) vs. Additive quantization (AQ) in the case of M=4 codebooks of size K=4. Both cod- ing methods encode the input vector with M numbers between 1 and K. In the case of PQ, this code corresponds to the concate- nation of M codewords of length D=M. In the case of AQ, this code corresponds to the sum of M codewords of length D. Given suitable codebooks, AQ is able to achieve better approximation to the input vector. thogonal to each other. Consequently, unlike PQ, the code- books in AQ are learned within a joint optimization process. As we demonstrate in the experiments with strongly com- pressed visual descriptors, removing independence assump- tions allows AQ to attain better coding accuracy than PQ (Figure 1), which translates into higher accuracy of nearest neighbor search as well as of image classification, whenever the classifiers are learned on or are applied to compressed vectors. Crucially, we show that similarly to PQ, AQ permits lookup table-based asymmetric distance and scalar product computations with uncompressed vectors. For small code- books typically employed by PQ, the efficiency of scalar product computation with AQ is almost the same as in PQ. The distance computations (ADC) with AQ requires either small amount of extra time or small amount of extra mem- ory compared to PQ, which in many applications would be justifiable by an improved accuracy. The encoding, the codebook learning, the distance computation, and the im- provement over PQ are all particularly favourable for short codes (e.g. four or eight bytes), i.e. extreme compression rates. For longer codes, the two techniques (AQ and PQ) can be seamlessly combined. Apart from PQ, AQ is connected to several other frame- works popular in computer vision and general pattern anal- ysis. In particular, AQ can be regarded as a very special case of sparse coding with unit component weights. Below, we also show interesting connections between AQ and Markov Random Field optimization. Finally, AQ can be regarded as a generalization of standard vector quantization, and the codebook learning algorithm that we propose for AQ is a generalization of the k-means algorithm. In the remainder of the paper, we discuss the represen- tation used by AQ, the lookup table-based computation of scalar products and distances, the encoding algorithms for AQ representation, and the codebook learning. We then perform extensive experimental comparison between AQ, PQ (including the optimized version [15]) in the context of nearest-neighbor search and image classification. 2. Additive quantization 2.1. Additive quantization representation We now introduce notation and discuss additive quanti- zation (AQ) in detail. Below, we assume that we deal with D-dimensional vectors. AQ is based around a set of M codebooks, each containing K vectors (codewords). We de- note the mth codebook as Cm, and the kth codeword in the mth codebook as cm(k). This setting is similar to PQ. How- ever, whereas in PQ the codewords have the length D=M, the length of codewords in AQ are D, i.e. they have the same dimensions as the vectors that are being encoded. The AQ model encodes a vector x 2 RD as a sum of M codewords (one codeword per codebook). In more de- tail, a vector is encoded with an M-tuple of codeword IDs [i1;i2;:::;iM], where each ID is between 1 and K. The en- coding process (described below) seeks the code that min- imizes the distance between x and the sum of the corre- sponding codewords: x MX m=1 cm(im); im21::K (1) If, for example K = 256 (as in most of our experiments), then the vector is encoded into M bytes, each byte encod- ing a single codeword ID. The memory footprint of an AQ- encoded vector will be the same as a PQ-encoded vector (for the same M and K), while AQ can potentially represent the vector more accurately. The codebooks within AQ occupy M times more memory than within PQ, however this mem- ory increase is typically negligible for datasets that are large enough. 2.2. Fast distance and scalar product computations The PQ compression permits very efficient computa- tion between an uncompressed vector (e.g. a query q to a nearest-neighbor search) and a large number L K of PQ-compressed vectors, so that each distance evaluation is implemented with M lookups and M 1 additions, plus a small amount of precomputation independent of L. Such asymmetric distance computation (ADC) is arguably where the main power of PQ lies. We will now show that with some caveats all this is possible in the case of AQ. We now assume that we need to compute squared eu- clidean distances between the query q and the dataset of L AQ-encoded vectors. We start with the formula: kq xk2 =kqk2 2hq;xi+kxk2 (2) We can precompute and reuse kqk2 for each of L dataset vectors, so the two questions that remain are how to evaluate hq;xi and kxk2 quickly. We now assume that x in (2) is AQ-compressed, i.e. x =PMm=1 cm(im). Evaluating the scalar product. The evaluation ofhq;xi can be implemented rather straightforwardly using lookup tables. Indeed, hq;xi= MX m=1 hq;cm(im)i= MX m=1 Tm(im) ; (3) where Tm(im) = hq;cm(im)i can be precomputed and stored, given the query q. Given the query and L AQ- encoded vectors, the total complexity involved will be O(DMK) to compute the lookup tables (versus O(DK) in an analogous step of the PQ) and O(ML) to compute the actual scalar products. Assuming that L DK as is typical in many applications related to the nearest neighbor search, the complexity of the scalar product computation within AQ is very similar to the complexity of distance computation in the case of PQ. Evaluating thekxk2. Evaluatingkxk2 can also be facil- itated by the lookup tables. Indeed: kxk2 =k MX m=1 cm(im)k2 = MX m=1 MX m0=1 hcm(im);cm0(im0)i; (4) Each of the individual terms in the right-hand side can be precomputed and stored in the lookup table (note that the terms are query independent). The number of operations required to computekxk2 is therefore approximately M2=2 lookups and M2=2 additions (assuming that the symmetry of the scalar product is exploited). Notably, the cost of computation ofkxk2 is independent of the space dimensionality D. However, it grows quadrat- ically with M and can slow down the computation of the distances considerably. It is possible to get rid of this time overhead at the cost of the small memory overhead using the fact that the term kxk2 does not depend on the query q. For this, we can encode the squared lengthkxk2 using a single byte by non-uniformly quantizing such scalar values over the encoded (or a hold-out) dataset. At the data com- pression time, we then augment the AQ-code of each x with the corresponding length byte. The computation ofkxk2 in this case costs one lookup (and one byte per vector). In our experiments we found that the effect of length quantization on the accuracy of the nearest neighbor search is minimal, which is natural to expect given that we quantize a scalar value. Finally, we note that for some applications, e.g. when applying a linear classifier to a large set of compressed vec- tors, the only required operation is scalar product compu- tation (between the weight vector and the encoded vector), and therefore the computation ofkxk2 is unnecessary. 2.3. Vector encoding We now come to the task of vector encoding, i.e. finding the AQ-representation for a vector x given the codebooks C1 :::CM. We seek the code that minimizes the coding error E: E(i1;i2;:::im) =kx MX m=1 cm(im)k2 (5) Using (2), the error function can be rewritten as: E(i1;i2;:::im) = MX m=1 2hx;cm(i m)i+kcm(im)k2 + X 1 m1), Beam Search considers each in- complete tuple, containing m 1 vectors from m 1 code- books. It then considers the remaining M+1 m code- books and finds the N codewords among all codewords in those codebooks that are closest to the remainder of x after subtracting codewords already included into the tu- ple. Thus, after considering all N candidates, N2 tuples of length m are created. We then pick the top N unique tuples (in terms of the approximation error for x) and keep them as candidates for the next iteration. After M such iter- ations, i.e. when tuples contain exactly M elements, a tuple with the best approximation error is returned. The lookup tables are used through Beam Search to make all operations (except for the precomputations of these tables) independent of the space dimensionality D. We ob- serve that for reasonable N (e.g. N=16 or N=32), Beam Search performs much better then other MRF-optimization algorithms we tried (given comparable amount of time). Therefore, we use Beam Search in our experiments. In- terestingly, we tried to formulate the MRF optimization (7) as an integer quadratic program and submit it to a general- purpose branch-and-cut optimizer [1]. For small codebooks K=64 and M = 8 and given very large amount of time, the solver could find a global minimum with much smaller energy (coding error) then those found by Beam Search or other non-exhasutive algorithms. While such “smart brute- force” approach is infeasible for large datasets and mean- ingful codebook sizes, this result suggests the AQ-coding problems as interesting instances for the research into new MRF-optimization algorithms. 2.4. Codebook learning We finally consider the task of learning codebooks, i.e. finding a set of M codebooks fC1;C2;:::CMg that can encode a vector set X = fx1;:::;xng with low coding error. Thus, we seek to minimize: min C1;:::;CM RD jCmj=K imj 21::K nX j=1 kxj MX m=1 cm(imj )k2 (8) Similarly to other codebook learning approaches, we per- form minimization using block-coordinate descent, alter- nating the minimization over the assignment variables (codes) imj and the codewords cm( ). The proposed algo- rithm generalizes the standard k-means algorithm (which corresponds to the case M=1). The minimization over the encoding variables given codebooks requires coding the vectors xj given the code- books, which is covered in the previous subsection. Up- dating the codewords given assignments is equivalent to the following least-squares problem: min fcm(k)g nX j=1 kxj MX m=1 KX k=1 ajkmcm(k)k2 (9) ajkm = ( 1; if imj =k 0; otherwise While the least-squares optimization (9) may seem large (given large n and D), one can observe that it decomposes over each of the D dimensions. Thus, instead of solving a single large-scale least squares problem with KMD vari- ables, it is sufficient to solve D least-squares p