# 2013JII-Joint Inverted Indexing.pdf

Joint Inverted Indexing Yan Xia 1? Kaiming He 2 Fang Wen 2 Jian Sun 2 1 University of Science and Technology of China 2 Microsoft Research Asia Abstract Inverted indexing is a popular non-exhaustive solution to large scale search. An inverted file is built by a quantizer such as k-means or a tree structure. It has been found that multiple inverted files, obtained by multiple independent random quantizers, are able to achieve practically good re- call and speed. Instead of computing the multiple quantizers indepen- dently, we present a method that creates them jointly. Our method jointly optimizes all codewords in all quantizers. Then it assigns these codewords to the quantizers. In exper- iments this method shows significant improvement over var- ious existing methods that use multiple independent quan- tizers. On the one-billion set of SIFT vectors, our method is faster and more accurate than a recent state-of-the-art inverted indexing method. 1. Introduction Inverted indexing [32] is of central importance in modern search engines for both text retrieval [4] and image/video retrieval [29]. In text search engines like Google [5], a doc- ument is represented by a vector of weighted word frequen- cies. Inverted indexing is built as a vocabulary of words, and each word has a list of the documents that contain this word. Online, the search engine retrieves the lists corresponding to the query words and ranks the returned documents. This is a non-exhaustive solution because only a very small por- tion of documents are checked. Inverted indexing provides a practical solution to large scale problems. The above approach has been adapted to image/video retrieval by introducing “visual words” [29]. The visual words (codewords) are obtained by quantizing visual vec- tors, e.g., via k-means [29], k-means trees [24], or other tree structures [11, 10, 20]. The inverted indexing is built as a codebook of the codewords. Each codeword has a list of all vectors (or their IDs) belonging to this codeword. Given a query vector, the system finds the codeword of the query and checks those items in the list of this codeword. The checking can be distance computation (using raw vectors, ? This work was done when Yan Xia was an intern at MSRA. binary embedding [17], or product-quantized codes [19]) or with geometric considerations [17, 33]. An effective way to improve the recall of inverted in- dexing methods is to use multiple quantizers (also known as multiple hash tables [8, 26]). Multiple quantizers intro- duce redundant coverage among the lists. A query is com- pared to the items in multiple (overlapping) lists to reduce the chance of missing true neighbors. The Locality Sen- sitive Hashing (LSH) [16, 8, 2] is a widely used multiple- quantizer method 1 . In LSH each quantizer (hash table) is binning a low dimensional random subspace. In [28, 23] multiple randomized k-d trees are used and have shown ad- vantages over LSH. In [26] Paulev′e et al. compare the per- formance of lattices, trees, and k-means as the hash tables. They recommend using randomly re-initialized k-means as the multiple quantizers. This method is named as “k-means LSH” or KLSH [26]. In the spirit of hash, all the above multi-quantizer meth- ods create the quantizers independently and randomly. The KLSH method [26] enjoys an advantage that its individu- al quantizer is (locally) optimal in the sense of minimizing quantization error [14]. But we point out that for KLSH, the codewords from different codebooks tend to be similar. This is because the k-means algorithm tends to put the code- words around densely distributed data even in different ran- dom trials. Actually, the codewords can be different from each other only because of the local optimality of the k- means nature. The similar codewords would reduce the re- dundant coverage and limit the gain of multiple quantizers. To illustrate, we consider simple 1-d data subject to a two-peak distribution (Fig. 1). Suppose we want two quan- tizers each with two codewords. If each quantizer is con- structed randomly and independently as in KLSH, the two codewords in a quantizer will always be placed near the t- wo peaks. The space partitioning of the two quantizers are almost the same (Fig. 1 left), and the gain of having the second quantizer is lost. If a query is located very close to the partitioning boundary, the recall of retrieving the true k-NNs can be low. 1 LSH has also been used as compact binary encoding for distance rank- ing [30]. Unless specified, in this paper we refer to LSH as a multiple- quantizer method that uses inverted indexing, as it was described in [8, 2]. 1 c1 c1 c1 c1 c2 c2 c2 c2 c3 c3 c3 c3 c4 c4 c4 c4 KLSH Joint quantizer 1 quantizer 2 quantizer 2 quantizer 1 q q di str i but io n data q q Figure 1. KLSH vs. joint quantizers. We illustrate two quantizers with two codewords in each quantizer. Top: the locations of the four codewords (c 1 to c 4 ). Middle and Bottom: quantizer 1 and quantizer 2, with the partitioning boundaries. When a query (q) is near any partitioning boundary, the independent quantizers may miss many true neighbors, while the joint quantizers can retrieve more true neighbors because the query is well inside at least one partition. In the above example, we can instead generate all the four codewords jointly via a single pass of k-means clus- tering. These four codewords are distant from each other (Fig. 1 right). Then we construct each quantizer by mutual- exclusively selecting two codewords with some care, and the two resulting quantizers can be substantially different. Any query that lies near the partitioning boundary of one quantizer can still find its k-NNs through the other quantiz- er, and the recall is improved. With this motivation, in this paper we propose joint in- verted indexing - a novel multiple-quantizer method. Rather than construct quantizers independently, we create them jointly. We divide this challenging problem into two sub- problems, each of which is formulated as optimization. We first jointly optimize all the codewords of all quantizers. The optimized codewords are then assigned to the quan- tizers, with a consideration that the total distortion of all quantizers is small. Experiments on million/billion-scale datasets show that our method performs significantly better than a various multiple-quantizer methods like LSH [2], randomized trees [28, 23], and KLSH [26]. On a one-billion SIFT set, our method is faster and more accurate than a recent state-of-the-art method called “invert- ed multi-index” [3]. When searching in one billion items for the first nearest neighbor of a query, our method takes less than ten milliseconds using a single thread and achieves over 90% recall in the first retrieved one hundred items (de- tailed in Sec. 4 and Table 3). The superiorities of our method to [3] are at the cost of more memory consumption. It depends on applications whether this is desired. But we believe accuracy and speed are both of central importance in many practices, and our method can be favored in these cases. 2. Related Work Recent progress on large scale search are mainly in two ways: exhaustive and non-exhaustive. One could efficiently exhaust millions of items using compact encoding like bina- ry embedding (e.g., [30, 31, 13, 15]) or product quantization [19, 12]. But for larger datasets, non-exhaustive search is still highly desired, and compact encoding is often used for re-ranking the retrieved data [17, 19, 3]. In this paper, we focus on non-exhaustive search based on inverted indexing. Inverted indexing could be built using a quantizer like k-means [22], a binary tree [10, 7], hierarchical k-means [11, 24], or a lattice [1]. An index represents a cluster, a tree leaf, or a bin. The k-means quantizer is (locally) optimal in terms of quantization distortion [14]. It is an attractive way to use multiple quantizers to im- prove search accuracy. The LSH method [16, 6, 8] random- ly generates L hash tables, each of which is binning a low dimensional subspace. For each single hash table, an in- verted indexing system is built with each bin as an index, and each index has a list containing all the vectors in the bin. This is repeated for each of the L hashing tables. Giv- en a query, the algorithm retrieves L lists and checks all the items in them. Instead of binning, the methods in [28, 23] use random- ized trees. Each single tree is generated by random projec- tion. In [28] it is empirically observed that if the trees are independently rotated, the error rate is smaller when visit- ing shallow nodes in all trees than deeply backtracking a single one. In [23] this way is found superior to LSH. The KLSH method [26] uses randomly re-initialized k- means to build each quantizer. The experiments in [26] (al- so in this paper) show this is superior to various alternatives including lattices and trees. But as discussed in the intro- duction, the codewords in the multiple quantizers can be similar in random trials. Most recently, an inverted multi-index method [3] has been proposed. This is actually a single quantizer method, whose quantizer is the Cartesian product of multiple sub- quantizers. As such, each codeword is indexed by the Cartesian product of multiple sub-indices. The quantiza- tion of the space can be much finer (e.g., than classic k- means). The finer quantization, in conjunction with a soft- assignment scheme, has shown great superiority to a single k-means quantizer [3]. This is a methodology different from multiple quantizers. We will compare with this method ex- perimentally. 3. Joint Inverted Indexing We first introduce the formulation of the method, and then present the algorithm. 3.1. Formulation Denote the number of quantizers as L, and the number of codewords in each quantizer as K. We assume K and L are pre-defined and application-based (e.g., in [26] K=2 n and L=16). Background It is well-known that a k-means quantizer attempts to minimize the quantization distortion [14]: min {c k } K summationdisplay k=1 summationdisplay x∈C k d(x,c k ), (1) with: C k = {x | d(x,c k ) d(x,c k prime), ?k prime negationslash= k}. Here x belongs to the training dataset, c k is a codeword, C k is a cluster, and d(·,·) is the squared Euclidean distance between two vectors. The optimization is w.r.t. the set of codewords {c k }. The distortion in (1) can be optimized via the classical EM-alike algorithm: alternatively update the codewords {c k } and assign data to the clusters {C k }. For L independent k-means quantizers as in KLSH [26], we can formulate them as minimizing the total distortion of all quantizers: min {c l k } L summationdisplay l=1 ? ? K summationdisplay k=1 summationdisplay x∈C l k d(x,c l k ) ? ? , (2) with: C l k = braceleftbig x | d(x,c l k ) d(x,c l k prime), ?k prime negationslash= k bracerightbig . Here we use C l k and c l k to denote the l-th quantizer’s cluster and codeword. If all the codewords {c l k | 1 ≤ k ≤ K, 1 ≤ l ≤ L} are initialized randomly, minimizing (2) is equiva- lent to independently minimizing L separate problems as in (1). This is the way of KLSH. If it were not for the local optimality of k-means clustering, all the quantizers should be identical. Joint Codewords Optimization In (2) the codewords in one quantizer are unaware of those in other quantizers. So the independent optimization of any two quantizers may push the codewords towards similar positions due to the k- means nature. To overcome this issue, we propose to opti- mize all the codewords jointly. We notice that in multiple inverted files, it is sufficient for a true datum to be correctly retrieved if it is covered by one of the L lists. Motivated by this, it is reasonable for us to consider the smallest distortion of a datum x among all L quantizers. Formally, we consider: min l parenleftbigg min k d(x,c l k ) parenrightbigg , (3) or equivalently min (l,k) d(x,c l k ), as the smallest distortion of x. To jointly optimize all codewords of all quantizers, we propose to minimize the sum of the smallest distortion of all data: summationtext x parenleftbig min (l,k) d(x,c l k ) parenrightbig . Itiseasytoshowthat this problem is equivalent to optimizing: min {c l k } L summationdisplay l=1 K summationdisplay k=1 ? ? summationdisplay x∈S (l,k) d(x,c l k ) ? ? , (4) with: S (l,k) = braceleftBig x | d(x,c l k ) d(x,c l prime k prime), ?(l prime ,k prime ) negationslash=(l,k) bracerightBig . Unlike in (2) where each x has been counted L times, in (4) each x has been counted only once. Replacing (l,k) by a one-to-one mapping to an integer j: M(l,k) mapsto→ j ∈ [1,L×K], we can rewrite (4) as this optimization problem: min {c j } L×K summationdisplay j=1 ? ? summationdisplay x∈S j d(x,c j ) ? ? (5) with: S j = {x | d(x,c j ) d(x,c j prime), ?j prime negationslash= j}. Comparing (5) to (1), one can find the problem (5) is a s- ingle pass of k-means with L×K clusters, so can be eas- ily optimized. We denote the optimized codewords as C = {c j | 1 ≤ j ≤ L×K}. The above joint codewords optimization has not yet con- sidered the belonging of any codeword to the quantizers. A naive way is to randomly assign the codewords inCwithout replacement 2 to the L quantizers. We call this way as “Joint (rand-assign)”. In experiments, this simple way has already shown better search performance than KLSH (detailed in Sec. 4, Fig. 6). This implies that the joint codewords op- timization is an effective method for multiple quantizers. This is due to two nice properties of joint codewords opti- mization: (i) each x is expected to be accurately quantized at least once, in the sense of optimizing (4); (ii) the code- words among the quantizers are guaranteed to be sufficient- ly different from each other. Joint Codewords Assignment Suppose the set C has been fixed, i.e., the problem in (4) (or equivalently (5)) has been optimized. It is still possible for us to improve the per- formance of “Joint (rand-assign)” by a better assignment. Notice that the quantization distortion of any individual quantizer has not been considered in (4). For a better as- signment, we consider the total distortion of all quantizers subject to a constraint. Formally, we consider: min L summationdisplay l=1 ? ? K summationdisplay k=1 summationdisplay x∈C l k d(x,c l k ) ? ? , (6) with: C l k = braceleftbig x | d(x,c l k ) d(x,c l k prime), ?k prime negationslash= k bracerightbig s.t. c l k ∈ C, and c l k negationslash= c l prime k prime ?(l prime ,k prime ) negationslash=(l,k). (7) 2 The term “without replacement” indicates each sample can only be selected once. (a) Codewords Generation (b) Codewords Grouping (c) Codewords Assignment quantizer 2 quantizer 3 quantizer 1 Figure 2. Joint Inverted Indexing. We illustrate L=3 quantizers with K=8 codewords in each quantizer. Each dot represents a data point. (a) Codewords Generation.AllL×K codewords in all quantizers are generated simultaneously by k-means clusterin