Transform Coding for Fast Approximate Nearest Neighbor Search in High Dimensions Jonathan Brandt Adobe Systems, Inc.

[email protected] Abstract We examine the problem of large scale nearest neigh- bor search in high dimensional spaces and propose a new approach based on the close relationship between nearest neighbor search and that of signal representation and quan- tization. Our contribution is a very simple and efficient quantization technique using transform coding and prod- uct quantization. We demonstrate its effectiveness in sev- eral settings, including large-scale retrieval, nearest neigh- bor classification, feature matching, and similarity search based on the bag-of-words representation. Through experi- ments on standard data sets we show it is competitive with state-of-the-art methods, with greater speed, simplicity, and generality. The resulting compact representation can be the basis for more elaborate hierarchical search structures for sub-linear approximate search. However, we demonstrate that optimized linear search using the quantized representa- tion is extremely fast and trivially parallelizable on modern computer architectures, with further acceleration possible by way of GPU implementation. 1. Introduction Finding nearby points among a large set in high dimen- sions is at the heart of many important computer vision problems. Examples include nearest neighbor classifica- tion, image similarity search, and feature matching. The long-standing difficulty in finding an exact nearest neigh- bor in high dimensions has led to the use of approximate algorithms, as well as various domain-specific approaches. Recently, image and video retrieval has been the subject of intense study with numerous practical applications. For retrieval, the number of points to search is usually much larger than can be held in system memory, which has led to the development of compressed representations, typically designed specifically for a given task. In this paper, we propose a very general quantization- based framework for high dimensional nearest neighbor search, founded on well-established results in signal rep- resentation and compression, and demonstrate its effective- ness in a diverse range of computer vision tasks. The ap- proach is inherently lossy, and is therefore an approximate approach to nearest neighbor. However, the representation degrades gracefully with increased compression, so it is straightforward to select the appropriate tradeoff between compression and accuracy for the task at hand. 2. Related Work Despite prolonged study, the problem of efficiently find- ing nearby points in high dimensions remains open [4]. This difficulty has led to the development of approximate solu- tions, such as ANN [1]. However, beyond a few dozen di- mensions, such methods tend to break down. In the past decade, locality-sensitive hashing (LSH) has emerged as an alternative to ANN that has been shown to be effective in higher dimensions. (See [13] for a survey of LSH.) However, due to its randomized nature, LSH does not generally produce compact codes. Code compactness is im- portant in the context of large-scale retrieval as memory size is often the primary determinant of system performance. To address this problem, Torralba et al [14] proposed learn- ing compact codes, both by boosting, and through restricted Boltzman machines, and were able to demonstrate signifi- cant improvement over LSH. Spectral hashing [16] improved on [14] by posing the bi- nary hash construction problem as one of graph labeling and employed spectral relaxation as a tractable approximation to an otherwise NP-hard problem. Spectral hashing achieves code efficiency comparable to the aforementioned learning- based approaches with much greater simplicity, and is a sig- nificant improvement over LSH when applied to the GIST descriptor under the Euclidean norm. Although hashing techniques are in general useful for near-duplicate search, they are less effective at ranged searches, where it is necessary to explore a potentially large neighborhood of a point, and distance estimation in the orig- inal space is typically not possible using the hash represen- 1815978-1-4244-6985-7/10/$26.00 ?2010 IEEE tation. Following earlier work [5, 6], J′egou et al [7] approach the Euclidean nearest neighbor problem as one of efficient quantization. They argue that in the context of retrieval, what is needed is to accurately estimate inter-point dis- tances while representing the points with a limited number of bits. That is, we not only want a compact representation, but we additionally want to be able to estimate distances using the compact representation. Their approach is to partition the input vector into a pre- scribed number of equal-sized sub-vectors that are each quantized independently, using k-means and a constant number of bits per sub-vector, and then form the Carte- sian product of each of the independently quantized com- ponents. Inter-point distances are estimated from the quan- tized values. Among other things, they observe that it is possible to learn the sub-vector quantizers using a relatively small training set, even though there may be a large number of bits in the complete code. This is a crucial point given that it is often impossible to collect and analyze a sufficient num- ber of training samples to span the quantized space, which often comprises hundreds of bits. Similar to the product quantization approach of [7], Tuytelaars et al [15] proposed a lattice quantizer for SIFT descriptors, allocating a constant number of bits per de- scriptor component and show its effectiveness using uni- form quantization and 2 bits per component. Winder et al [17] explored several compression tech- niques for feature matching as part of a systematic explo- ration of the design space for local feature descriptors. They found that error rates were reduced through the use of PCA, dimensionality reduction, and uniform quantization using a constant number of bits per dimension. In this paper, we propose a new quantization-based approach to nearest neighbor search inspired by well- established results in signal representation and compres- sion. The new contribution, distinct from previous ap- proaches, is that it combines transform coding, data-driven allocation of bits to components, and non-uniform mini- mum distortion quantization, to obtain a compact represen- tation suitable for nearest neighbor search. Our approach is also distinct in that it is very general, making few as- sumptions of the data, applies to a broad range of metrics, and requires no parameters other than the number of bits. Experiments on real-world data, using a variety of metrics, validate the effectiveness of the approach. 3. The Retrieval Problem Consider a finite set of points X Rn, jXj = N, drawn from the probability distribution p(x) defined over Rn. Point proximity is determined by a metric d(x;x0). The two fundamental proximity queries are Radial, namely to return the set of points within a given radius of the query, and Nearest-k. Hash-based retrieval consists of a quantization step fol- lowed by lookup based on the quantized representation. That is, the quantization step identifies the unique partition containing x within a finite partitioning of Rn. The index lookup returns all x0 2X contained within the given parti- tion. Hash-based lookup is effective for near-duplicate search, where we can rely on a hash collision even when the rep- resentation consists of a large number of bits. However, for proximity searches, such as Radial and Nearest-k, hash lookup becomes ineffective due to the sparseness of the code space. What is needed for Radial and Nearest-k searches is an efficient quantization scheme that preserves distances. In other words, we wish to design a quantizer such that the induced partitioning of the input space is optimal for prox- imity search. 3.1. Minimum Distortion Quantization Quantization in general has been the subject of pro- longed study and has a rich and extensive literature [3]. Any quantizer, q : Rn !Z, where Z = f0;1;:::;m 1g, is characterized by the partition it induces on the input space, Q(z) =fx : q(x) = zg; (1) for z2Z, and the codebook values associated with each z, c(z)2Rn. The quality of a given quantizer is measured in terms of its average distortion, D = E[d(x;c(q(x)))]; (2) where the distortion function d can take on a variety of forms. For retrieval, the appropriate distortion function to be minimized is simply the metric d(x;x0). In fact, appli- cation of the triangle inequality yields E[jd(x;x0) d(x;c(q(x0))j] D: (3) Therefore, D is an upper bound on the expected error in estimating inter-point distances when one of the two points is approximated by its quantized codebook value. Conse- quently, a quantizer that minimizes D for a fixed m is ex- pected to be effective from the standpoint of retrieval. A quantizer that minimizes D subject to the underlying distribution p(x) is characterized by the following proper- ties: 1. Q(z) =fx : d(x;c(z)) d(x;c(z0));8z0 2Zg, and 2. c(z) = argminx0 Ex[d(x;x0)jx2Q(z)]. 1816 That is, the quantization cells consist of points no further from the cell’s code vector than from any other code vec- tor, and that the code vector for a given cell is its centroid. This classic result, first developed by Lloyd [10] and inde- pendently by Max [12] for 1D signals, and later extended to vector quantization by Linde, et al [9], is the basis for the k-means algorithm. 3.2. Product Quantization In 1D, design of a quantizer to satisfy the minimum dis- tortion criterion is well understood [3]. But in higher di- mensions difficulties arise. For instance, it can be challeng- ing to obtain a sufficient sampling to characterizep(x). Ad- ditionally, vector quantization is computationally expensive in high dimensions. In the rare case that p(x) is independent in its com- ponents, and the metric is of the form d(x;x0) =P idi(xi;x 0i), where xi are the components of x, we can obtain a minimum distortion quantizer by forming the Cartesian product of the independently quantized compo- nents. That is, let q(x) = (q1(x1);q2(x2);:::;qn(xn)) = z; (4) where z = (z1;z2;:::;zn), zi = qi(xi). The quantizer design problem reduces to a set of n independent readily solved 1D problems: each qi is de- signed independently to minimize the expected distortion Di = E[di(xi;ci(qi(xi)))]. It is straightforward to show that D = PiDi. Therefore minimizing each Di indepen- dently minimizes D. 3.3. Bit Allocation The minimum distortion criterion is sufficient to design a product quantizer if the number of distinct quantization levels per component is known. Determining the number of levels per component is known as bit allocation [2, Ch. 8]. Optimal bit allocation amounts to minimizing D(b) = X i Di(bi); (5) such thatPibi = log2m, and where b = (b1;b2;:::) is the vector of per-component bit allocations. Solution to this optimization problem for general distri- butions and distortion functions requires computationally prohibitive numerical search. A reasonable approximation, which turns out to be effective in practice, is to assume that each component is identically distributed after normalizing the variance, and that the per-component distortion func- tions are identical. In this case, the optimal allocation is achieved when bi log2 i; (6) where i is the standard deviation of thei-th component [2]. Therefore, we can simply estimate i from the training data and allocate bits to each component proportionally. In practice, we prefer each bi to be integer-valued so that the components qi(xi) can be concatenated to encode q(x) as a contiguous bit vector. We can ensure the overall bit budget is met while proportionally allocating an integer number of bits to each component through the sequential distribution procedure listed in Algorithm 1. Algorithm 1 Distributebbits among quantizer components. Initialize H(i) log2 i, B(i) 0. for j = 1 to b do i argmaxH(i). B(i) B(i) + 1. H(i) H(i) 1. end for After the bit allocation procedure, it may be that some components are allocated no bits at all. The effect is di- mensionality reduction since such components are omitted from z. Therefore, we have a principled method to select a subset of dimensions, and simultaneously allocate bits to the remaining dimensions, given a fixed overall bit budget, while minimizing D. Note that spectral hashing [16] arrives at a similar bit distribution scheme through a very different approach. The geometric intuition is that the resulting quantization cells should be roughly isotropic with respect to the distortion function. Therefore, components with larger standard devi- ations require a proportionally larger number of quantiza- tion levels. 3.4. Transform Coding The product quantizer depends critically on the unre- alistic assumption that the components of x are statisti- cally independent. In vector quantization, this difficulty is addressed through transform coding, which seeks a (typi- cally linear) transformation to reduce statistical dependence among the components [2]. Classically, this is done through principal component analysis (PCA), although many other alternatives exist. Specifically, we compute the eigenvectors and eigenvalues of the training sample covariance. The matrix of eigenvec- tors constitutes a unitary transformation U that is applied to all points prior to quantization. Then, the product quantizer is designed for points y = Ux. Bit allocation and conse- quent dimensionality reduction, as described in the previous section, is based on the corresponding eigenvalues. In summary the quantizer design method is as follows: (1) determine the PCA transformation from the training set, including removing the mean; (2) run the bit allocation pro- cedure based on the PCA eigenvalues, and drop any com- 1817 q 1 q n c 1 c n U U T x z 1 z n z U T c(q(Ux)) encode decode { { Figure 1. Encoding and decoding flow for the transform coding system. ponents that receive zero bits; (3) for each remaining com- ponent, train a 1D distortion minimizing quantizer to obtain the quantization functions qi and centroids ci. The 1D dis- tortion function Di is the absolute difference in component values. Encoding of a new point entails PCA projection fol- lowed by quantization of the components. Reconstruction of a centroid in the input space from quantized point entails constructing the centroid c = (c1;c2;:::;cr) in the trans- form space, and then back projecting it with UT and adding back the mean. The encoding and reconstruction process is depicted graphically in Figure 1. Since PCA does not guarantee statistical independence of the transformed components, the product quantizer is not in general optimal with respect to distance distortion. Addi- tionally, with the exception of the squared Euclidean metric, quantizer distortion may not be expressible as the sum of the per-component distortion fun