# 2010 VLAD jegou_compactimagerepresentation.pdf

Aggregating local descriptors into a compact image representation Herv′e J′egou INRIA Rennes Matthijs Douze INRIA Grenoble Cordelia Schmid INRIA Grenoble Patrick P′erez Technicolor Abstract We address the problem of image search on a very large scale, where three constraints have to be considered jointly: the accuracy of the search, its efficiency, and the memory usage of the representation. We first propose a simple yet efficient way of aggregating local image descriptors into a vector of limited dimension, which can be viewed as a sim- plificationoftheFisherkernelrepresentation. Wethenshow how to jointly optimize the dimension reduction and the in- dexing algorithm, so that it best preserves the quality of vec- tor comparison. The evaluation shows that our approach significantly outperforms the state of the art: the search ac- curacy is comparable to the bag-of-features approach for an image representation that fits in 20 bytes. Searching a 10 million image dataset takes about 50ms. 1. Introduction There are two reasons why the bag-of-features image representation (BOF) is popular for indexation and catego- rization applications. First, this representation benefits from powerful local descriptors, such as the SIFT descriptor [12] and more recent ones [13, 28, 29]. Second, these vector rep- resentations can be compared with standard distances, and subsequently be used by robust classification methods such as support vector machines. When used for large scale image search, the BOF vec- tor representing the image is advantageously chosen to be highly dimensional [16, 20, 10], up to a million dimensions. In this case, the search efficiency results from the use of in- verted lists [23], which speeds up the computation of dis- tances between sparse vectors. However, two factors limit the number of images that can be indexed in practice: the efficiency of the search itself, which becomes prohibitive when considering more than 10 million images, and the memory required to represent an image. In this paper, we address the problem of searching the most similar images in a very large image database (ten mil- lion images or more). We put an emphasis on the joint op- timization of three constraints: the search accuracy, its effi- ciency and the memory usage. The last two are related [24], as search efficiency can be approximated by the amount of memory to be visited. Most similar to our work is the approach of [9], which proposes an approximate nearest neighbor search for BOF vectors. However, this method is limited to small vocabulary sizes, yielding lower search ac- curacy compared to large ones. Moreover, an image still re- quires more than a hundred bytes to reproduce the relatively low search accuracy of a low-dimensional BOF vector. The efficiency problem was partially addressed by the min-Hash approach [2, 3] and the method of Torresani et al. [25]. However, these techniques still require a significant amount of memory per image, and are more relevant in the context of near-duplicate image detection, as their search accuracy is significantly lower than BOF. Some authors ad- dress the efficiency and memory constraints by using GIST descriptors [17], and by converting them to compact binary vectors [5, 11, 24, 27]. These approaches are limited by the low degree of invariance of the global GIST descriptor, and none of them jointly fulfills the three aforementioned con- straints: [24, 27] still requires an exhaustive search while [11] is memory consuming due to the redundancy of the LSH algorithm. In contrast, our approach obtains a signif- icantly higher accuracy with, typically, a 20-byte represen- tation. This is obtained by optimizing: 1. the representation, i.e., how to aggregate local image descriptors into a vector representation; 2. the dimensionality reduction of these vectors; 3. the indexing algorithm. These steps are closely related: representing an image by a high-dimensional vector usually provides better exhaustive search results than with a low-dimensional one. However, high dimensional vectors are more difficult to index effi- ciently. In contrast, a low dimensional vector is more easily indexed, but its discriminative power is lower and might not be sufficient for recognizing objects or scenes. Our first contribution consists in proposing a represen- tation that provides excellent search accuracy with a rea- sonable vector dimensionality, as we know that the vector will be indexed subsequently. We propose a descriptor, de- rived from both BOF and Fisher kernel [18], that aggregates SIFT descriptors and produces a compact representation. It 1 is termed VLAD (vector of locally aggregated descriptors). Experimental results demonstrate that VLAD significantly outperforms BOF for the same size. It is cheaper to com- pute and its dimensionality can be reduced to a few hun- dreds components by principal component analysis (PCA) without noticeably impacting its accuracy. As a second contribution, we show the advantage of jointly optimizing the trade-off between the dimensionality reduction and the indexation algorithm. We consider in par- ticular the recent indexing method of [7], as we can directly compare the error induced by PCA with the error resulting from the indexation, due to the approximate reconstruction of the vector from its encoded index. After presenting two image vector representations that inspired ours, BOF and the Fisher kernel [18], we intro- duce our descriptor aggregation method in Section 2. The joint optimization of dimensionality reduction and indexing is presented in Section 3. Experimental results demonstrate the performance of our approach in section 4: we show that the performance of BOF is attained with an image represen- tation of about 20 bytes. This is a significant improvement over the state-of-the-art [9], both in terms of memory usage, search accuracy and efficiency. 2. Image vector representation In this section, we briefly review two popular approaches that produce a vector representation of an image from a set of local descriptors. We then propose our method to aggre- gate local descriptors. 2.1. Bag of features The BOF representation groups local descriptors. It re- quires the definition of a codebook of k “visual words” usu- ally obtained by k-means clustering. Each local descriptor of dimensiondfrom an image is assigned to the closest cen- troid. The BOF representation is obtained as the histogram of the assignment of all image descriptors to visual words. Therefore, itproducesak-dimensionalvector, whichissub- sequently normalized. There are several variations on how to normalize the histogram. When seen as an empirical distribution, the BOF vector is normalized using the Man- hattan distance. Another common choice consists in using Euclidean normalization. The vector components are then weighted by idf (inverse document frequency) terms. Sev- eral weighting schemes have been proposed [16, 23]. In the following, we perform L2 normalization of histograms and use the idf calculation of [23]. Several variations have been proposed to improve the quality of this representation. One of the most popu- lar [21, 26] consists in using soft quantization techniques instead of a k-means. 2.2. Fisher kernel The Fisher kernel [6] is a powerful tool to transform an incoming variable-size set of independent samples into a fixed size vector representation, assuming that the samples follow a parametric generative model estimated on a train- ing set. This description vector is the gradient of the sam- ple’s likelihood with respect to the parameters of this distri- bution, scaled by the inverse square root of the Fisher infor- mation matrix. It gives the direction in parameter space into which the learnt distribution should be modified to better fit the observed data. It has been shown that discriminative classifiers can be learned in this new representation space. Perronnin et al. [18] applied Fisher kernel in the con- text of image classification. They model the visual words with a Gaussian mixture model (GMM), restricted to diag- onal variance matrices for each of the k components of the mixture. Deriving a diagonal approximation of the Fisher matrix of a GMM, they obtain a (2d + 1) × k ? 1 dimen- sionalvectorrepresentationofanimagefeatureset, ord×k- dimensional when considering only the components associ- ated with either the means or the variances of the GMM. In comparison with the BOF representation, fewer visual words are required by this more sophisticated representa- tion. 2.3.VLAD:vectoroflocallyaggregateddescriptors We propose a vector representation of an image which aggregates descriptors based on a locality criterion in fea- ture space. It can be seen as a simplification of the Fisher kernel. As for BOF, we first learn a codebook C = {c1,.ck} of k visual words with k-means. Each local descriptor x is associated to its nearest visual word ci = NN(x). The idea of the VLAD descriptor is to accu- mulate, for each visual word ci, the differences x?ci of the vectors x assigned to ci. This characterizes the distribution of the vectors with respect to the center. Assuming the local descriptor to be d-dimensional, the dimension D of our representation is D = k × d. In the following, we represent the descriptor by vi,j, where the indices i = 1.k and j = 1.d respectively index the visual word and the local descriptor component. Hence, a component of v is obtained as a sum over all the image de- scriptors: vi,j = summationdisplay x such that NN(x)=ci xj ?ci,j (1) where xj and ci,j respectively denote the jth component of the descriptor x considered and of its corresponding visual word ci. The vector v is subsequently L2-normalized by v := v/||v||2 . Experimental results show that excellent results can be obtained even with a relatively small number of visual Figure1.ImagesandcorrespondingVLADdescriptors, fork=16centroids(D=16×128). Thecomponentsofthedescriptorarerepresented like SIFT, with negative components (see Equation 1) in red. words k: we consider values ranging from k=16 to k=256. Figure 1 depicts the VLAD representations associated with a few images, when aggregating 128-dimensional SIFT descriptors. The components of our descriptor map to components of SIFT descriptors. Therefore we adopt the usual 4×4 spatial grid representation of oriented gradients for each vi=1k. We have accumulated the descriptors in 16 of them, one per visual word. In contrast to SIFT descrip- tors, a component may be positive or negative, due to the difference in Equation 1. One can observe that the descriptors are relatively sparse (few values have a significant energy) and very structured: most high descriptor values are located in the same cluster, and the geometrical structure of SIFT descriptors is observ- able. Intuitively and as shown later, a principal component analysis is likely to capture this structure. For sufficiently similar images, the closeness of the descriptors is obvious. 3. From vectors to codes This section addresses the problem of coding an image vector. Given a D-dimensional input vector, we want to produce a code of B bits encoding the image representa- tion, such that the nearest neighbors of a (non-encoded) query vector can be efficiently searched in a set of n en- coded database vectors. We handle this problem in two steps, that must be opti- mized jointly: 1) a projection that reduces the dimension- ality of the vector and 2) a quantization used to index the resulting vectors. For this purpose, we consider the recent approximate nearest neighbor search method of [7], which is briefly described in the next section. We will show the importance of the joint optimization by measuring the mean squared Euclidean error generated by each step. 3.1. Approximate nearest neighbor Approximate nearest neighbors search methods [4, 11, 15, 24, 27] are required to handle large databases in com- puter vision applications [22]. One of the most popu- lar techniques is Euclidean Locality-Sensitive Hashing [4], which has been extended in [11] to arbitrary metrics. How- ever, these approaches and the one of [15] are memory con- suming, as several hash tables or trees are required. The method of [27], which embeds the vector into a binary space, better satisfies the memory constraint. It is, how- ever, significantly outperformed in terms of the trade-off between memory and accuracy by the product quantization- based approximate search method of [7]. In the following, we use this method, as it offers better accuracy and because the search algorithm provides an explicit approximation of the indexed vectors. This allows us to compare the vector approximations introduced by the dimensionality reduction and the quantization. We use the asymmetric distance com- putation (ADC) variant of this approach, which only en- codes the vectors of the database, but not the query vector. This method is summarized in the following. ADC approach. Let x ∈ ?D be a query vector and Y = {y1,.,yn} a set of vectors in which we want to find the nearest neighbor NN(x) of x. The ADC approach consists in encoding each vector yi by a quantized version ci = q(yi) ∈ ?D. For a quantizer q(.) with k centroids, the vector is encoded by log2(k) bits, k being a power of 2. Finding the a nearest neighbors NNa(x) of x simply con- sists in computing NNa(x) = a-argmini ||x?q(yi)||2. (2) Note that, in contrast with the embedding method of [27], the query x is not converted to a code: there is no approxi- mation error on the query side. To get a good vector approximation, k should be large (k = 264 for a 64 bit code). For such large values of k, learning a k-means codebook as well as assignment to the centroids is not tractable. Our solution is to use a prod- uct quantization method which defines the quantizer with- out explicitly enumerating its centroids. A vector x is first split into m subvectors x1, . xm of equal length D/m. A product quantizer is then defined as function q(x) = parenleftbigq1(x1),.,qm(xm)parenrightbig, (3) which maps the input vector x to a tuple of indices by sepa- rately quantizing the subvectors. Each individual quantizer qj(.) has ks reproduction values learned by k-means. To limit the assignment complexity O(m×ks), ks is a small value (e.g., ks=256). However, the set k of centroids in- duced by the product quantizer q(.) is large: k = (ks)m. The squared distances in Equation 2 are computed using the decomposition ||x?q(yi)||2 = summationdisplay j=1,.,m ||xj ?qj(yji)||2, (4) where yji is the jth subvector of yi. The square distances in this summation are read from look-up tables computed, prior to the search, between each subvector xj and the ks centroids associated with the corresponding quantizer qj. The generation of the tables is of complexity O(D ×ks). When ks ? n, this complexity is negligible compared with the summation cost of O(D ×n) in Equation 2. This quantization method was chosen because of its ex