# 2014LOPQ－Locally Optimized Product Quantization By ANN Serach.pdf

Locally Optimized Product Quantization for Approximate Nearest Neighbor Search Yannis Kalantidis and Yannis Avrithis National Technical University of Athens {ykalant, iavr}@image.ntua.gr Abstract We present a simple vector quantizer that combines low distortion with fast search and apply it to approximate near- est neighbor (ANN) search in high dimensional spaces. Leveraging the very same data structure that is used to pro- vide non-exhaustive search, i.e., inverted lists or a multi- index, the idea is to locally optimize an individual product quantizer (PQ) per cell and use it to encode residuals. Lo- cal optimization is over rotation and space decomposition; interestingly, we apply a parametric solution that assumes a normal distribution and is extremely fast to train. With a reasonable space and time overhead that is constant in the data size, we set a new state-of-the-art on several public datasets, including a billion-scale one. 1. Introduction Approximate nearest neighbor (ANN) search in high- dimensional spaces is not only a recurring problem in com- puter vision, but also undergoing significant progress. A large body of methods maintain all data points in memory and rely on efficient data structures to compute only a lim- ited number of exact distances, that is ideally fixed [14]. At the other extreme, mapping data points to compact binary codes is not only efficient in space but may achieve fast ex- haustive search in Hamming space [10, 16]. Product quantization (PQ) [12] is an alternative compact encoding method that is discrete but not binary and can be used for exhaustive or non-exhaustive search through in- verted indexing or multi-indexing [3]. As is true for most hashing methods [11], better fitting to the underlying distri- bution is critical in search performance, and one such ap- proach for PQ is optimized product quantization (OPQ) [9] and its equivalent Cartesian k-means [15]. How are such training methods beneficial? Different cri- teria are applicable, but the underlying principle is that all bits allocated to data points should be used sparingly. Since search can be made fast, such methods should be ultimately (a) k-means (b) PQ (c) OPQ (d) LOPQ Figure 1. Four quantizers of 64 centroids ( ) each, trained on a random set of 2D points ( ), following a mixture distribution. (c) and (d) also reorder dimensions, which is not shown in 2D. seen as (lossy) data compression targeting minimal distor- tion, with extreme examples being [1, 5]. As such,k-means, depicted in Fig. 1(a), is a vector quan- tization method where by specifyingk centroids,log 2 k bits can represent an arbitrary data point in R d for any dimen- sion d; but na¨?ve search is O(dk) and low distortion means very large k. By constraining centroids on an axis-aligned, m-dimensional grid, PQ achieves k m centroids keeping search at O(dk); but as illustrated in Fig. 1(b), many of these centroids remain without data support e.g. if the dis- tributions on m subspaces are not independent. OPQ allows the grid to undergo arbitrary rotation and re- ordering of dimensions to better align to data and balance their variance across subspaces to match bit allocation that is also balanced. But as shown in Fig. 1(c), a strongly multi- modal distribution may not benefit from such alignment. Our solution in this work is locally optimized product 2014 IEEE Conference on Computer Vision and Pattern Recognition 1063-6919/14 $31.00 ? 2014 IEEE DOI 10.1109/CVPR.2014.298 231523212329 quantization (LOPQ). Following a quite common search option of [12], a coarse quantizer is used to index data by in- verted lists, and residuals between data points and centroids are PQ-encoded. But within-cell distributions are largely unimodal; hence, as in Fig. 1(d), we locally optimize an in- dividual product quantizer per cell. Under no assumptions on the distribution, practically all centroids are supported by data, contributing to a lower distortion. LOPQ requires reasonable space and time overhead compared to PQ, both for offline training/indexing, and on- line queries; but all overhead is constant in data size. It is embarrassingly simple to apply and boosts performance on several public datasets. A multi-index is essential for large scale datasets and combining with LOPQ is less trivial, but we provide a scalable solution nevertheless. 2. Related work and contribution Focusing on large datasets where index space is the bottle- neck, we exclude e.g. tree-based methods like [14] that re- quire all data uncompressed in memory. Binary encoding is the most compact representation, approaching ANN search via exhaustive search in Hamming space. Methods like spectral hashing [18], ITQ [10]ork-means hashing [11] focus on learning optimized codes on the underlying data distribution. Search in Hamming space is really fast but, despite learning, performance suffers. Significant benefit is to be gained via multiple quantiz- ers or hash tables as in LSH [6], at the cost of storing each point index multiple times. For instance, [17, 19] gain per- formance by multiple k-means quantizers via random re- initializing or partitioning jointly trained centroids. Sim- ilarly, multi-index hashing [16] gains speed via multiple hash tables on binary code substrings. We still outperform all such approaches at only a fraction of index space. PQ [12] provides efficient vector quantization with less distortion than binary encoding. Transform coding [4]is a special case of scalar quantization that additionally allo- cates bits according to variance per dimension. OPQ [9] and Ck-means [15] generalize PQ by jointly optimizing ro- tation, subspace decomposition and sub-quantizers. Inter- estingly, the parametric solution of OPQ aims at the exact opposite of [4]: balancing variance given a uniform bit al- location over subspaces. Although [12] provides non-exhaustive variant IVFADC based on a coarse quantizer and PQ-encoded residuals, [9, 15] are exhaustive. The inverted multi-index [3] achieves very fine space partitioning via one quantizer per subspace and is compatible with PQ-encoding, gaining performance at query times comparable to Hamming space search. On the other hand, the idea of space decomposition can be ap- plied recursively to provide extremely fast codebook train- ing and vector quantization [2]. The recent extension of OPQ [8] combines optimization with a multi-index and is the current state-of-the-art on a billion-scale dataset, but all optimizations are still global. We observe that OPQ performs significantly better when the underlying distribution is unimodal, while residuals are much more unimodalthan original data. Hence we indepen- dently optimize per cell to distribute centroids mostly over underlying data, despite the constraints of a product quan- tizer. In particular, we make the following contributions: 1. Partitioningdata in cells, we locally optimize one prod- uct quantizer per cell on the residual distribution. 2. We show that training is practical since local distribu- tions are easier to optimize via a simple OPQ variant. 3. We provide solutions for either a single or a multi- index, fitting naturally to existing search frameworks for state-of-the-art performance with little overhead. A recent related work is [7], applying a local PCA ro- tation per centroid prior to VLAD aggregation. However, both our experimentsand [9, 8] show that PCA without sub- space allocation actually damages ANN performance. 3. Background Vector quantization.Aquantizer is a functionq that maps a d-dimensional vector x ∈ R d to vectorq(x) ∈C, whereC is a finite subset of R d , of cardinality k. Each vector c ∈C is called a centroid, and C a codebook. Given a finite set X of data points in R d , q induces distortion E = summationdisplay x∈X bardblx?q(x)bardbl 2 . (1) According to Lloyd’s first condition, regardless of the cho- sen codebook, a quantizer that minimizes distortion should map vector x to its nearest centroid, or x mapsto→ q(x)=argmin c∈C bardblx?cbardbl, (2) for x ∈ R d . Hence, an optimal quantizer should minimize distortion E as a function of codebook C alone. Product quantization. Assuming that dimension d is a multiple of m, write any vector x ∈ R d as a concatenation (x 1 ,.,x m ) of m sub-vectors, each of dimension d/m.If C 1 ,.,C m are m sub-codebooks in subspace R d/m , each of k sub-centroids, a product quantizer [12] constrains C to the Cartesian product C = C 1 ×···×C m , (3) i.e., a codebook of k m centroids of the form c = (c 1 ,.,c m ) with each sub-centroid c j ∈C j for j ∈M= {1,.,m}. An optimal product quantizer q should mini- mize distortion E (1) as a function of C, subject to C being of the form (3)[9]. In this case, for each x ∈ R d , the nearest centroid in C is q(x)=(q 1 (x 1 ),.,q m (x m )), (4) 231623222330 whereq j (x j ) is the nearest sub-centroid of sub-vector x j in C j , for j ∈M[9]. Hence an optimal product quantizerq in d dimensions incurs m subproblems of finding m optimal sub-quantizers q j ,j ∈M, each in d/m dimensions. We write q =(q 1 ,.,q m ) in this case. Optimized product quantization [9],[15] refers to opti- mizing the subspace decomposition apart from the cen- troids. Constraint (3) of the codebook is now relaxed to C = {R?c : ?c ∈C 1 ×···×C m ,R T R = I}, (5) where orthogonal d × d matrix R allows for arbitrary ro- tation and permutation of vector components. Hence E should be minimized as a function of C, subject to C be- ing of the form (5). Optimization with respect to R and C 1 ,.,C m can be either joint as in Ck-means [15] and in the non-parametric solution OPQ NP of [9], or decoupled, as in the parametric solution OPQ P of [9]. Exhaustive search. Given a product quantizer q = (q 1 ,.,q m ), assume that each data point x ∈Xis rep- resented by q(x) and encoded as tuple (i 1 ,.,i m ) of m sub-centroid indices (4), each in index set K = {1,.,k}. This PQ-encoding requires mlog 2 k bits per point. Given a new query vector y, the (squared) Euclidean dis- tance to every point x ∈Xmay be approximated by δ q (y,x)=bardbly ?q(x)bardbl 2 = m summationdisplay j=1 bardbly j ?q j (x j )bardbl 2 , (6) where q j (x j ) ∈C j = {c j 1 ,.,c j k } for j ∈M. Distances bardbly j ? c j i bardbl 2 are precomputed for i ∈Kand j ∈M,so(6) amounts to only O(m) lookup and add operations. This is the asymmetric distance computation (ADC) of [12]. Indexing. When quantizing point x ∈ R d by quantizer q, its residual vector is defined as r q (x)=x?q(x). (7) Non-exhaustive search involves a coarse quantizer Q of K centroids, or cells. Each point x ∈Xis quantized to Q(x), and its residual vector r Q (x) is quantized by a prod- uct quantizer q. For each cell, an inverted list of data points is maintained, along with PQ-encoded residuals. A query point y is first quantized to its w nearest cells, and approximatedistances between residuals are then found according to (6) only within the corresponding w inverted lists. This is referred to as IVFADC search in [12]. Re-ranking. Second-order residuals may be employed along with ADC or IVFADC, again PQ-encodedbym prime sub- quantizers. However, this requires full vector reconstruc- tion, so is only used for re-ranking [13]. Multi-indexing applies the idea of PQ to the coarse quan- tizer used for indexing. A second-order inverted multi- index [3] comprises two subspace quantizers Q 1 ,Q 2 over R d/2 , each of K sub-centroids. A cell is now a pair of sub-centroids. There are K 2 cells, which can be struc- tured on a 2-dimensional grid, inducing a fine partition over R d . For each point x =(x 1 ,x 2 ) ∈X, sub-vectors x 1 ,x 2 ∈ R d/2 are separately (and exhaustively) quantized to Q 1 (x 1 ),Q 2 (x 2 ), respectively. For each cell, an inverted list of data points is again maintained. Given a query vector y =(y 1 ,y 2 ), the (squared) Eu- clidean distances of each of sub-vectors y 1 ,y 2 to all sub- centroids of Q 1 ,Q 2 respectively are found first. The dis- tance of y to a cell may then be found by a lookup-add operation, similarly to (6) for m =2. Cells are traversed in increasing order of distance to y by the multi-sequence al- gorithm [3]—a form of distance propagation on the grid— until a target number T of points is collected. Different options exist for encoding residuals and re-ranking. 4. Locally optimized product quantization We investigate two solutions: ordinary inverted lists, and a second-order multi-index. Section 4.1 discusses LOPQ in the former case, which simply allocates data to cells and locally optimizes a product quantizer per cell to encode residuals. Optimization per cell is discussed in section 4.2, mostly following [9, 15]; the same process is used in sec- tion 4.4, discussing LOPQ in the multi-index case. 4.1. Searching on a single index Given a set X = {x 1 ,.,x n } of n data points in R d , we optimize a coarse quantizer Q, with associated code- book E = {e 1 ,.,e K } of K centroids, or cells. For i ∈K= {1,.,K}, we construct an inverted list L i con- taining indices of points quantized to cell e i , L i = {j ∈N: Q(x j )=e i } (8) where N = {1,.,n}, and collect their residuals in Z i = {x?e i : x ∈X,Q(x)=e i }. (9) For each cell i ∈K, we locally optimize PQ encoding of residuals in set Z i , as discussed in section 4.2, yielding an orthogonal matrix R i and a product quantizer q i . Residuals are then locally rotated by ?z ← R T i z for z ∈Z i and PQ- encoded as q i (?z)=q i (R T i z). At query time, the query point y is soft-assigned to its w nearest cells A in E. For each cell e i ∈A, residual y i = y ?e i is individually rotated by ?y i ← R T i y i . Asym- metric distances δ q i (?y i ,?z p ) to residuals ?z p for p ∈L i are then computed according to (6), using the underlying local product quantizer q i . The computation is exhaustive within list L i , but is performed in the compressed domain. Analysis. To illustrate the individual gain from the two op- timized quantities, we investigate optimizing rotation alone 231723232331 with fixed sub-quantizers, as well as both rotation and sub- quantizers, referred to as LOR+PQ and LOPQ, respectively. In the latter case, there is anO(K(d 2 +dk))space overhead, comparing e.g. to IVFADC [12]. Similarly, local rotation of the query residual imposes an O(wd 2 ) time overhead. 4.2. Local optimization Let Z∈{Z 1 ,.,Z K } be the set of residuals of data points quantized to some cell in E. Contrary to [12], we PQ-encode these residuals by locally optimizing both space decomposition and sub-quantizers per cell. Given m and k as parameters, this problem is expressed as minimizing dis- tortion as a function of orthogonal matrix R ∈ R d×d and sub-codebooksC 1 ,.,C m ? R d/m per cell, minimize summationdisplay z∈Z min ?c∈ ? C bardblz?R?cbardbl 2 subjectto ? C = C 1 ×···×C m R T R = I, (10) where |C j | = k for j ∈M= {1,.,m}. Given solution R,C 1 ,.,C m , codebook C is found by (5). For j ∈M, sub-codebookC j determines a sub-quantizerq j by x mapsto→ q j (x)=argmin ?c j ∈