OPTIMIZED PRODUCT QUANTIZATION 1 Optimized Product Quantization Tiezheng Ge, Kaiming He ? , Qifa Ke, and Jian Sun Abstract—Product quantization (PQ) is an effective vector quantization method used by many applications. A product quantizer can generate an exponentially large codebook which can be used with very low memory/time cost. The essence of PQ is to decompose the original high-dimensional vector space into the Cartesian product of subspaces and then quantize these subspaces separately. The optimal space decomposition is important for the performance of PQ, but still remains an unaddressed issue. In this paper, we optimize PQ by minimizing quantization distortions w.r.t. the space decomposition and the quantization codebooks. We present two novel solutions to this challenging optimization problem. The first solution iteratively solves two simpler sub-problems. The second solution is based on the assumption that the input data follows some Gaussian distribution; under this assumption we provide theoretical guarantees of the optimality. We evaluate our optimized product quantizers in three applications: (i) compact encoding for approximate distance computation and exhaustive ranking [1], (ii) building inverted multi- indexing for non-exhaustive search [2], and (iii) compacting image representations for image retrieval [3]. In all applications our optimized product quantizers outperform existing PQ solutions. Index Terms—Vector quantization, nearest neighbor search, image retrieval, compact encoding, inverted indexing ? 1INTRODUCTION Approximate nearest neighbor (ANN) search is of great importance for many computer vision problems, such as image/video retrieval [4], image classifica- tion [5], [6], and object/scene recognition [7]. Vector Quantization (VQ) [8] is a popular and successful method for ANN search. The vector quantization method is used in two ways for ANN: (i) to build inverted indexing [4] for non-exhaustive search, or (ii) to encode vectors into compact codes [1], [9], [10] for exhaustive search. In the non-exhaustive search, the quantizers can be k-means [4], hierarchical k-means [11], etc. A query is quantized into a codeword and then compared with a short list of datawhich have the same or similar codewords. In the exhaustive search, the dataarequantized into codewords [1], [9], [10]; the distances of vectors are approximatedby the distances of codewords. With a few dozens of bits used per vector, the memory footprint is small, and the search time can be as few as tens of milliseconds for scanning one million quantized data. The compact encoding methods can be combined with inverted indexing for reranking (e.g., as in [1], [2]) to achieve real-time and high-quality search in billions of vectors. In information theory, a product quantizer [8] is a so- lution to VQ when an exponentially large number of codewords are desired. The key idea is to decompose the original vector space into the Cartesian product of M low-dimensional subspaces and quantize each ? Correspondence author. T. Ge is with the University of Science and Technology of China, Hefei, China. E-mail:

[email protected] K. He and J. Sun are with the Visual Computing Group, Microsoft Research Asia, Beijing, China. E-mail: {kahe,jiansun}@microsoft.com Q. Ke is with the Microsoft Bing, Sunnyvale, CA, US. E-mail:

[email protected] subspace into k codewords by k-means. The effective number of codewords in the original space is k M , but the cost of storing them is merely O(Dk) for D- dimensional vectors. Such a Product Quantization (PQ) technique has been used for compact encoding for approximate distance computation in [1], and recently has also been adopted for building inverted indexing in [2]. When used for compact encoding, the time complex- ity is O(Mk) per distance computation using look- up tables [1]. This makes it possible to search one million documents exhaustively in ～20 milliseconds [1]. When used for building inverted indexing [2], a query can quickly find its nearest codewords (out of k M codewords) in O(Mk) time. Currently, [1] is among the state-of-the-art compact encoding method- s, whereas [2] is among the state-of-the-art inverted indexing methods. Despite the power of PQ, the optimal decompo- sition of the vector space remains a largely unad- dressed issue. In [1] it has been noticed that the prior knowledge about the structures of the input data (SIFT/GIST) is of particular importance, and the search accuracy would become substantially worse if such knowledge is not available. But the strong reliance on the prior knowledge largely limits the performance of PQ on general data, e.g., raw image pixels, compressed representations (by PCA, sparse coding, etc.), and mixed representations. The usage of the prior structures also constrains the choice of the subspace number M and thus the codebook size. In [3] it is proposed to optimize a Householder transfor- m such that the transformed vectors have balanced variances in all components. In [3] it is also observed that a random rotation of the vector space leads to a performance similar to using the Householder OPTIMIZED PRODUCT QUANTIZATION 2 transform. But the optimality in terms of quantization error remains unclear in both cases. In this paper, we formulate PQ as an optimization problem that minimizes the quantization distortion by seeking for optimal codewords and space decom- positions. Such an optimization problem is challeng- ing due to the large number of unknown parame- ters. In this work we present two solutions. In the first solution, we iteratively solve two simpler sub- problems: solving for the space decomposition with the codewords fixed, and vice versa. This solution is non-parametric in that it does not assume any priori information about the data distribution. Our second solution is a parametric one in that it assumes the data follows a Gaussian distribution. Under this as- sumption, we derive an analytical formulation of the lower bound of the quantization distortion. Then we theoretically provethat this lower bound is minimized when (i) the subspaces are mutually independent, and simultaneously (ii) the vectors have balanced variances in all subspaces. Based on these theories, we propose a simple and greedy Eigenvalue Allocation method to effectively optimize the space decompositions. We demonstrate by experiments the superiority of our methods in three applications: ? When used for compact encoding for exhaustive search, our method outperforms several variants of the non-optimized PQ and other VQ methods [9], [10]. Our method is also substantially better than many state-of-the-art binarization methods ([12], [13], [14], [15], [16]). ? When used for building inverted indexing, our method improves the “inverted multi-index” method [2] by optimizing its codebook. Further, we show that the performance can be improved even more when our optimized PQ is adopted in the combination of inverted multi-index and compact encoding (for reranking candidates in the short lists). This marks the state-of-the-art performance on the one billion SIFT dataset. ? We further apply our optimized PQ for com- pact aggregated image representations (VLAD [3] and Fisher Vectors [5]) for image retrieval. Our method provides better retrieval accuracies than the original PQ encoding method. A preliminary version of this work was published in CVPR 2013 [17]. Concurrent with [17], a very simi- lar work “Cartesian k-means” [18] is independently developed. We note that our first solution in [17] (the non-parametric one, Sec. 3.1) is equivalent to the Cartesian k-means method. But our second solution (the parametric one, Sec. 3.2) takes a step further than [18], in that our second solution provides more solid theories and understandings about the problem. In particular, it provides theoretical guarantees of the global optimality under some practical assumption. It also provides a near-optimal initialization (superior to a random one) for the non-parametric solution, when the data is nearly Gaussian. Thus our work has significant contributions in both practical and theoretical aspects. We have published the Matlab code 1 of our two solutions for the ease of practical applications and theoretical analysis. 2QUANTIZATION DISTORTION In this section, we show that a variety of ANN meth- ods, including k-means [19], Product Quantization [1], and Iterative Quantization [20], [10], can be formu- lated within a framework of vector quantization [8]. In our formulation, the quantization distortion is a common objective function among the different meth- ods studied. Our formulation treats the specific con- figuration of each method as the constraints when optimizing the common objective function, rather than incorporate the configuration into the objective function. In this way, various methods (including the proposed) can be studied and discussed in a unified framework. 2.1 Vector Quantization A vector quantization (VQ) system [8] maps a vector x ∈ R D to a codeword c in a codebook C = {c(i)} with i in a finite index set. The mapping, termed as a quantizer, is denoted by: x → c(i(x)). In information theory, the function i(·) is called an encoder, and function c(·) is called a decoder [8] 2 . The quantization distortion E is defined as: E = 1 n summationdisplay x bardblx?c(i(x))bardbl 2 , (1) where bardbl·bardbldenotes the Euclidean distance, n is the total number of data samples, and the summation is over all the points in the given sample set. Notice that this definition of distortion applies to any quantizer, no matter what the encoder and decoder are. Given a codebook C, an encoder that minimizes the distortion E must satisfy the first Lloyd’s condi- tion [8]: the encoder i(x) should map any x to its nearest codeword in the codebook C. Notice that this property is valid no matterwhat the codebook is. Such a quantizer is known as a Voronoi quantizer [8]. 2.2 Codebook Generation We show that a variety of methods minimize the distortion w.r.t. to the codebook under different con- straints. 1. research.microsoft.com/en-us/um/people/kahe/ 2. In this paper, we abuse the notations of i(·) and i to simplify the presentation: by i(·) we mean the mapping of a vector to an index, and by i we mean the index of a certain codeword. Likewise, we abuse the notations of c(·) and c:byc(·) we mean the mapping of an index to a codeword, and by c we mean a certain codeword. We believe these notations are clear given the context of their usage. OPTIMIZED PRODUCT QUANTIZATION 3 2.2.1 K-means If there is no constraint on the codebook, minimizing the distortion in Eqn.(1) leads to the classical k-means clustering algorithm [19]. With the encoder i(·) fixed, a codeword c(i) is the mean of the vectors that are indexed as i – this is the second Lloyd’s condition [8]. The classical k-means algorithm iteratively updates the encoder and the decoder. 2.2.2 Product Quantization If any codeword c must be taken from the Carte- sian product of a finite number of sub-codebooks, minimizing the distortion in Eqn.(1) leads to the PQ method [1]. Formally, denote any x ∈ R D as the concatenation of M subvectors: x =[x 1 ,.x m ,.x M ]. For simplic- ity it is assumed that the subvectors have an equal number of dimensions D/M. The Cartesian product C = C 1 ×.×C M is the set in which a codeword c ∈C is formed by concatenating the M sub-codewords: c =[c 1 ,.c m ,.c M ], with each c m ∈C m . We point out that the objective function for PQ, though not explicitly defined in [1], is essentially: min C 1 ,.,C M summationdisplay x bardblx?c(i(x))bardbl 2 , (2) s.t. c ∈C= C 1 ×.×C M . It is easy to show that x’s nearest codeword c in C is the concatenation of the M nearest sub-codewords c =[c 1 ,.c m ,.c M ] where c m is the nearest sub- codeword of the subvector x m . So Eqn. (2) can be split into M separate subproblems, each of which can be solved by k-means in its corresponding subspace. This is exactly what the PQ algorithm [1] does. The benefit of PQ is that it can easily generate a codebook C with an exponentially large num- ber of codewords. If each sub-codebook has k sub- codewords, then their Cartesian product C has k M codewords. This is not possible for classical k-means when k M is large. The cost of storing these codewords is merely O(Mk·D/M)=O(Dk) because only the sub-codebooks are stored. The PQ developed in [1] is used for compact encod- ing and approximate distance computation. The cost of storing each encoded vector is M log 2 k bits. The distance between a query and any vector is approxi- mated by the distance of their codewords (known as Symmetric Distance Computation or SDC), or by the distance between the query and the codeword of the vector (known as Asymmetric Distance Computation or ADC). Both ways of distance computation are efficient using lookup tables. For SDC, the distances between any two sub-codewords in a subspace are pre-computed and stored in a k-by-k lookup table. For ADC, the distances between the sub-vector of the query and the sub-codewords in a subspace are pre- computed on-line and stored in a 1-by-k lookup table. The distance in the original space is simply the sum of the distances compute from the M subspaces: this operation is fast and makes it possible to exhaust one million vectors in ～ 20 milliseconds using a single core. 2.2.3 Iterative Quantization If any codeword c must be taken from “the vertexes of a rotating hyper-cube”, minimizing the distortion leads to a binary embedding method called Iterative Quantization (ITQ) [10]. The D-dimensional vectors in {?a,a} D are the vertices of an axis-aligned D-dimensional hyper-cube. Suppose the data has been zero-centered. The objec- tive function in ITQ [10] is essentially: min R,a summationdisplay x bardblx?c(i(x))bardbl 2 , (3) s.t. c ∈C= {c | Rc ∈{?a,a} D },R T R = I, where R is an orthogonal matrix and I is an identity matrix. This gives 2 D codewords in C. This optimization problem is also iteratively solved in [10] just like k-means: update the codebook (repre- sented by R and a) with the index of each vector fixed, and vise versa. In [10] it has also shown that the length a in (3) does not impact the resulting partition planes. We keep a here because it matters when we compare the distortion with other quantization methods. The benefit of using a rotating hyper-cube as the codebook is that the squared Euclidean distance be- tween any two codewords is equivalent to the Ham- ming distance between their indices. So ITQ can be viewed in the category of binary hashing methods [12], [13], [14], [15], where one hash function deter- mines one bit. Eqn.(3) also indicates that any orthog- onal binary hashing method [20], [10] can be viewed a vector quantizer. 2.3 Distortion as the Objective Function In our unified framework, the above methods all optimize the same form of quantization distortion, but subject to different constraints. This implies that dis-