Optimized Product Quantization for Approximate Nearest Neighbor Search Tiezheng Ge 1? Kaiming He 2 Qifa Ke 3 Jian Sun 2 1 University of Science and Technology of China 2 Microsoft Research Asia 3 Microsoft Research Silicon Valley Abstract Product quantization is an effective vector quantization approach to compactly encode high-dimensional vectors for fast approximate nearest neighbor (ANN) search. The essence of product quantization is to decompose the orig- inal high-dimensional space into the Cartesian product of a finite number of low-dimensional subspaces that are then quantized separately. Optimal space decomposition is im- portant for the performance of ANN search, but still re- mains unaddressed. In this paper, we optimize produc- t quantization by minimizing quantization distortions w.r.t. the space decomposition and the quantization codebooks. We present two novel methods for optimization: a non- parametric method that alternatively solves two smaller sub-problems, and a parametric method that is guaranteed to achieve the optimal solution if the input data follows some Gaussian distribution. We show by experiments that our optimized approach substantially improves the accura- cy of product quantization for ANN search. 1. Introduction Approximate nearest neighbor (ANN) search is of great importance for many computer vision problems, such as re- trieval [17], classification [2], and recognition [18]. Re- cent years have witnessed the increasing interest (e.g., [18, 20, 3, 10, 6]) in encoding high dimensional data in- to distance-preserving compact codes. With merely tens of bits per data item, compact encoding not only saves the cost of data storage and transmission, but more important- ly, it enables efficient nearest neighbor search on large-scale datasets, taking only a fraction of a second for each nearest neighbor query [18, 10] Hashing [1, 18, 20, 19, 6, 8] has been a popular approach to compact encoding, where the similarity between two da- ta points is approximated by the Hamming distance of their hashed codes. Recently, product quantization (PQ) [10] was applied to compact encoding, where a data point is vector- quantized to its nearest codeword in a predefined codebook, ? This work is done when Tiezheng Ge is an intern at Microsoft Re- search Asia. and the distance between two data points is approximated by the distance between their codewords. PQ achieves a large effective codebook size with the Cartesian product of a set of small sub-codebooks. It has been shown to be more accurate than various hashing-based methods (c.f. [10, 3]), largely due to its lower quantization distortions and more precise distance computation using a set of small lookup ta- bles. Moreover, PQ is computationally efficient and thus at- tractive for large-scale applications—the Cartesian product enables pre-computed distances between codewords to be s- tored in tables with feasible sizes, and query is merely done by table lookups using codeword indices. It takes about 20 milliseconds to query against one million data points for the nearest neighbor by exhaustive search. To keep the size of the distance lookup table feasible, PQ decomposes the original vector space into the Cartesian product of a finite number of low-dimensional subspaces. It has been noticed [10] that the prior knowledge about the structures of the input data is of particular importance, and the accuracy of ANN search would become substantially worse if ignoring such knowledge. The method in [11] op- timizes a Householder transform under an intuition that the data components should have balanced variances. It is al- so observed that a random rotation achieves similar perfor- mance [11]. But the optimality in terms of quantization er- ror is unclear. Thus, optimal space decomposition for PQ remains largely an unaddressed problem. In this paper, we formulate product quantization as an optimization problem that minimizes the quantization dis- tortions by searching for optimal codebooks and space de- composition. Such an optimization problem is challenging due to large number of free parameters. We proposed t- wo solutions. In the first solution, we split the problem into two sub-problems, each having a simple solver. The space decomposition and the codebooks are then alterna- tively optimized, by solving for the space decomposition while fixing the codewords, and vice versa. Such a solution is non-parametric in that it does not assume any priori in- formation about the data distribution. Our second solution is a parametric one in that it assumes the data follows Gaus- sian distribution. Under such assumption, we show that the lower bound of the quantization distortion has an analytical 2013 IEEE Conference on Computer Vision and Pattern Recognition 1063-6919/13 $26.00 ? 2013 IEEE DOI 10.1109/CVPR.2013.379 29442946 formulation, which can be effectively optimized by a sim- ple Eigenvalue Allocation method. Experiments show that our two solutions outperform the original PQ [10] and other alternatives like transform coding [3] and iterative quantiza- tion [6], even when the prior knowledge about the structure of the input data is used by PQ [10]. Concurrent with our work, a very similar idea is inde- pendently developed by Norouzi and Fleet [14]. 2. Quantization Distortion In this section, we show that a variety of distance approx- imation methods, including k-means [13], product quan- tization [10], and orthogonal hashing [19, 6], can be for- mulated within the framework of vector quantization [7] where quantization distortion is used as the objective func- tion. Quantization distortion is tightly related to the empiri- cal ANN performance, and thus can be used to measure the “optimality” of a quantization algorithm for ANN search. 2.1. Vector Quantization Vector quantization (VQ) [7] maps a vector x ∈ R D to a codeword c in a codebook C = {c(i)} with i in a finite index set. The mapping, termed as a quantizer, is denoted by: x → c(i(x)). In information theory, the function i(·) is called an encoder, and function c(·) is called a decoder [7]. The quantization distortion E is defined as: E = 1 n summationdisplay x bardblx?c(i(x))bardbl 2 , (1) where bardbl·bardbldenotes the l 2 -norm, n is the total number of da- ta samples, and the summation is over all the points in the given sample set. Given a codebook C, a quantizer that min- imizes the distortion E must satisfy the first Lloyd’s condi- tion [7]: the encoder i(x) should map any x to its nearest codeword in the codebook C. The distance between two vectors can be approximated by the distances between their codewords, which can be precomputed offline. 2.2. Codebook Generation We show that a variety of methods minimize the distor- tion w.r.t. to the codebook using different constraints. K-means If there is no constraint on the codebook, minimizing the distortion in Eqn.(1) leads to the classical k-means cluster- ing algorithm [13]. With the encoder i(·) fixed, the code- wordcof a givenxis the center of the cluster thatxbelongs to—this is the second Lloyd’s condition [7]. Product Quantization [10] If any codeword c must be taken from the Cartesian product of a finite number of sub-codebooks, minimizing the distortion in Eqn.(1) leads to the product quantization method [10]. Formally, denote any x ∈ R D as the concatenation of M subvectors: x =[x 1 ,.x m ,.x M ]. For simplicity it is assumed [10] that the subvectors have common number of dimensionsD/M. The Cartesian productC = C 1 ×.×C M is the set in which a codeword c ∈Cis formed by concate- nating the M sub-codewords: c =[c 1 ,.c m ,.c M ], with each c m ∈C m . We point out that the objective function for PQ, though not explicitly defined in [10], is essentially: min C 1 ,.,C M summationdisplay x bardblx?c(i(x))bardbl 2 , (2) s.t. c ∈C= C 1 ×.×C M . It is easy to show that x’s nearest codeword c in C is the concatenation of the M nearest sub-codewords c = [c 1 ,.c m ,.c M ] where c m is the nearest sub-codeword of the subvector x m . So Eqn. (2) can be split into M separate subproblems, each of which can be solved by k-means in its corresponding subspace. This is the PQ algorithm. The benefit of PQ is that it can easily generate a code- book C with a large number of codewords. If each sub- codebook has k sub-codewords, then their Cartesian prod- uctC hask M codewords. This is not possible for classical k- means whenk M is large. PQ also enables fast distance com- putation: the distances between any two sub-codewords in a subspace are precomputed and stored in a k-by-k lookup ta- ble, and the distance between two codewords in C is simply the sum of the distances compute from the M subspaces. Iterative Quantization [6] If any codeword c must be taken from “the vertexes of a rotating hyper-cube,” minimizing the distortion leads to a hashing method called Iterative Quantization (ITQ) [6]. The D-dimensional vectors in {?a,a} D are the ver- tices of an axis-aligned D-dimensional hyper-cube. Sup- pose the data has been zero-centered. The objective func- tion in ITQ [6] is essentially: min R,a summationdisplay x bardblx?c(i(x))bardbl 2 , (3) s.t. c ∈C= {c |Rc ∈{?a,a} D },R T R = I, where R is an orthogonal matrix and I is an identity matrix. The benefit of using a rotating hyper-cube as the code- book is that the squared Euclidean distance between any two codewords is equivalent to the Hamming distance be- tween their indices. So ITQ is in the category of binary hashing methods [1, 20, 19]. Eqn.(3) also indicates that any orthogonal hashing method is equivalent to a vector quan- tizer. The length a in (3) does not impact the resulting hash- ing functions as noticed in [6], but it matters when we com- pare the distortion with other quantization methods. 2.3. Distortion as the Objective Function The above methods all optimize the same form of quan- tization distortion, but subject to different constraints. This 29452947 0.1 0.3 0.5 0 0.1 0 15000 30000 0 0.2

福彩双色球历史比较器