IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 Optimized Cartesian K-Means Jianfeng Wang, Jingdong Wang, Jingkuan Song, Xin-Shun Xu, Heng Tao Shen, Shipeng Li Abstract—Product quantization-based approaches are effective to encode high-dimensional data points for approximate nearest neighbor search. The space is decomposed into a Cartesian product of low-dimensional subspaces, each of which generates a sub codebook. Data points are encoded as compact binary codes using these sub codebooks, and the distance between two data points can be approximated efficiently from their codes by the precomputed lookup tables. Traditionally, to encode a subvector of a data point in a subspace, only one sub codeword in the corresponding sub codebook is selected, which may impose strict restrictions on the search accuracy. In this paper, we propose a novel approach, named Optimized Cartesian K- Means (OCKM), to better encode the data points for more accurate approximate nearest neighbor search. In OCKM, multiple sub codewords are used to encode the subvector of a data point in a subspace. Each sub codeword stems from different sub codebooks in each subspace, which are optimally generated with regards to the minimization of the distortion errors. The high- dimensional data point is then encoded as the concatenation of the indices of multiple sub codewords from all the subspaces. This can provide more flexibility and lower distortion errors than traditional methods. Experimental results on the standard real-life datasets demonstrate the superiority over state-of-the-art approaches for approximate nearest neighbor search. Index Terms—Clustering, Cartesian product, Nearest neighbor search F 1 INTRODUCTION Nearest neighbor (NN) search in large data sets has wide applications in information retrieval, computer vision, ma- chine learning, pattern recognition, recommendation sys- tem, etc. However, exact NN search is often intractable because of the large scale of the database and the curse of the high dimensionality. Instead, approximate nearest neighbor (ANN) search is more practical and can achieve orders of magnitude speed-ups than exact NN search with near-optimal accuracy [29]. There has been a lot of research interest on designing effective data structures, such as k-d tree [4], randomized k-d forest [30], FLANN [22], trinary-projection tree [11], [39], and neighborhood graph search [1], [35], [37], [38]. The hashing algorithms have been attracting a large amount of attentions recently as the storage cost is small and the distance computation is efficient. Such approaches map data points to compact binary codes through a hash function, which can be generally expressed as b = h(x)2f0;1gL; where x is a P-dimensional real-valued point, h( ) is the hash function, and b is a binary vector with L entries. For description convenience, we will use a vector or a code to name b interchangeably. Jianfeng Wang is with University of Science and Technology of China. Email:

[email protected] Jingdong Wang and Shipeng Li are with Microsoft Research, Beijing, P.R. China. Emails:fjingdw,

[email protected] Xin-Shun Xu is with Shandong University. Email:

[email protected] Jingkuan Song and Heng Tao Shen are with School of Information Technology and Electrical Engineering, The University of Queensland, Australia. Email:fjk.song,

[email protected] The pioneering hashing work, locality sensitive hashing (LSH) [3], [8], adopts random linear projections and the similarity preserving is probabilistically guaranteed. Other approaches based on random functions include kernelized LSH [14], non-metric LSH [21], LSH from shift-invariant kernels [25], and super-bit LSH [10]. To preserve some notion of similarities, numerous efforts have been devoted to finding a good hash function by exploring the distribution of the specific data set. Typical approaches are unsupervised hashing [5], [12], [13], [33], [36], [40], [41], [42] and supervised hashing [16], [23], with kernelized version [7], [17], and extensions to multi- modality [31], [32], [43], etc. Those algorithms usually use Hamming distance, which is only able to produce a few distinct distances, resulting in limited ability and flexibility of distance approximation. The quantization-based algorithms have been shown to achieve superior performances [9], [24]. The representative algorithms include product quantization (PQ) [9] and Carte- sian K-means (CKM) [24], which are modified versions of the conventional K-means algorithm [19]. The quantiza- tion approaches typically learn a codebook fd1; ;dKg, where each codeword dk is a P-dimensional vector. The data point x is encoded in the following way, k = arg mink2f1;2; ;Kgkx dkk22; (1) where k k2 denotes the l2 norm. The index k indicates which codeword is the closest to x and can be represented as a binary code of length dlog2(K)e1. The crucial problem for quantization algorithms is how to learn the codebook. In the traditional K-means, the codebook is composed of the cluster centers with a minimal squared distortion error. The drawbacks when applying K- means to ANN search include that the size of the codebook 1. In the following, we omit the d e operator without affecting the understanding. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2 is quite limited and computing the distances between the query and the codewords is expensive. PQ [9] addresses this problem by splitting the P-dimensional space into multiple disjoint subspaces and making the codebook as the Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code is formed by concatenating the indices of the selected sub codeword within each sub codebook. CKM [24] improves PQ by optimally rotating the P dimensional space to give a lower distortion error. In PQ and CKM, only one sub codeword on each subvector is used to quantize the data points, which results in limited capability of reducing the distortion error and thus limited search accuracy. In this paper, we first present a simple algorithm, extended Cartesian K-means (ECKM), which extends CKM by using multiple (e.g., C) sub code- words for a data point from the sub codebook in each subspace. Then, we propose the optimized Cartesian K- means (OCKM) algorithm, which learns C sub codebooks in each subspace instead of a single sub codebook like ECKM, and selects C sub codewords, each chosen from a different sub codebook. We show that both PQ and CKM are constrained versions of our OCKM under the same code length, which suggests that our OCKM can lead to a lower quantization error and thus a higher search accuracy. Experimental results also validate that our OCKM achieves superior performance. The remainder of this paper is organized as follows. Related work is first reviewed in Sec. 2. The proposed ECKM is introduced in Sec. 3, followed by the OCKM in Sec. 4. Discussions and experimental results are given in Sec. 5 and 6, respectively. Finally, a conclusion is made in Sec. 7. 2 RELATED WORK Hashing is an emerging technique to represent the high- dimensional vectors as binary codes for ANN search, and has achieved a lot of success in multimedia applications, e.g. image search [6], [15], video retrieval [2], [31], event detection [26], document retrieval [27]. According to the form of the hash function, we roughly categorize the binary encoding approaches as those based on Hamming embedding and on quantization. Roughly, the former adopts the Hamming distance as the dissimilarity between the codes, while the latter does not. Table 1 illustrates part of the notations and descriptions used in the paper. Generally, we use the uppercase unbolded symbol as a constant, the lowercase unbolded as the index, the uppercase bolded as the matrix and the lowercase bolded as the vector. 2.1 Hamming embedding Linear mapping is one of typical hash functions. Each bit is calculated by hi(x) = sign(wTi x +ui); (2) TABLE 1 Notations and descriptions. Symbol Description N number of training points P dimension of training points M number of subvectors S number of dimensions on each subvector K number of (sub) codewords m index of the subvector i index of the training point R rotation matrix Dm codebook on m-th subvector bmi 1-of-K encoding vector on m-th subvector where wi is the projection vector, ui is the offset, and sign(z) is a sign function which is 1 if z 0, and 0 otherwise. Such approaches include [3], [5], [12]. The differences mainly reside in how to obtain the parameters in the hash function. For example, LSH [3] adopts a random parameter and the similarity is probability preserved. Iterative quanti- zation hashing [5] constructs hash functions by rotating the axes so that the difference between the binary codes and the projected data is minimized. Another widely-used approach is the kernel-based hash function [7], [13], [14], [17], i.e. hi(x) = sign( X j wij (x;zj)); (3) where zj is the vector in the same space with x, and ( ; ) is the kernel function. The cosine function can also be used to generate the binary codes, such as in [40]. 2.2 Quantization In the quantization-based encoding methods, different con- straints on the codeword lead to different approaches, i.e. K-Means [18], [19], Product Quantization (PQ) [9] and Cartesian K-Means (CKM) [24]. 2.2.1 K-Means Given N P-dimensional points X =fx1; ;xNg RP, the K-means algorithm partitions the database into K clusters, each of which associates with one codeword di2 RP. Let D = [d1; ;dK] RP be the corresponding codebook. Then the codebook is learned by minimizing the within-cluster distortion, i.e. min NX i=1 kxi Dbik22 s:t: bi2f0;1gK kbik1 = 1 i2f1; ;Ng where bi is a 1-of-K encoding vector (K dimensions with one 1 and K 1 0s. ) to indicate which codeword is used to quantize xi, and k k1 is the l1 norm. The problem can be solved by iteratively alternating optimization with respect to D and fbigNi=1 [18]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3 2.2.2 Product Quantization One issue of K-Means is the size of the codebook is quite limited due to the storage and computational cost. To address the problem, PQ [9] splits each xi into M disjoint subvectors. Assume the m-th subvector contains Sm dimensions and then PMm=1Sm = P. Without loss of generality, Sm is set to S , P=M and P is assumed to be divisible by M. On the m-th subvector, K-means is performed to obtain K sub codewords. By this method, it generates KM clusters with only O(KP) storage, while K- means requires O(KMP) storage with the same number of clusters. Meanwhile, the computing complexity is reduced from O(KMP) to O(KP) to encode one data point. Let Dm 2RS K be the matrix of the m-th sub code- book and each column is a S-dimensional sub codeword. PQ can be taken as optimizing the following problem with respect to fDmgMm=1 and fbmi gN;Mi=1;m=1. min fpq;M;K = NX i=1 xi 2 64 D1b1i . DMbMi 3 75 2 2 s:t: bmi 2f0;1gK kbmi k1 = 1 i2f1; ;Ng;m2f1; ;Mg (4) where bmi is also the 1-of-K encoding vector on the m-th subvector and the index of 1 indicates which sub codeword is used to encode xi. 2.2.3 Cartesian K-Means CKM [24] optimally rotates the original space and formu- lates the problem as min fck;M;K = NX i=1 xi R 2 64 D1b1i . DMbMi 3 75 2 2 s:t: RTR = I bmi 2f0;1gK kbmi k1 = 1 i2f1; ;Ng;m2f1; ;Mg (5) The rotation matrix R is optimally learned by minimizing the distortion. If R is constrained to be the identity matrix I, it will be reduced to Eqn. 4. Thus, we can assert that under the optimal solutions, we have f ck;M;K f pq;M;K, where the asterisk superscript indicates the objective function with the optimal parameters. 3 EXTENDED CARTESIAN K-MEANS In both PQ and CKM, only one sub codeword is used to encode the subvector. To make the representation more flexible, we propose the extended Cartesian K-means (ECKM), where multiple sub codewords can be used in each subspace. Mathematically, we allow the l1 norm of bmi to be a pre- set number C (C 1), instead of limiting it to be exactly 1. Meanwhile, any entry of bmi is relaxed as a non-negative integer instead of a binary value. The formulation is min feck;M;K;C = NX i=1 xi R 2 64 D1b1i . DMbMi 3 75 2 2 s:t: RTR = I bmi 2ZK+ kbmi k1 = C (6) where Z+ denotes the set of non-negative integers. The constraint is applied on all the points i2f1; ;Ng and on all the subspaces m 2f1; ;Mg. In the following, we omit the range of i;m without confusion. For the m-th sub codebook Dm 2RS K, traditionally only one sub codeword can be selected and there are only K choices to encode the m-th subvector of RTxi. In the extended version, any feasible bmi satisfying bmi 2ZK+ and kbmi k1 = C constructs a quantizer, i.e. Dmbmi . Thus, the total number of choices is K+C 1K 1 K. For example with K = 256 and C = 2, the difference is K+C 1K 1 = 32896 K = 256. With a more powerful representation, the distortion errors can be potentially reduced. In theory, log2 K+C 1K 1 bits can be used to encode one bmi , and the code length is M log2( K+C 1K 1 ). Practically, we use log2(K) bits to encode one position of 1. The l1 norm of bmi is C, which can be interpreted that there are C 1s in bmi . Then MC log2(K) bits are allocated to encode one data point. 3.1 Learning Similar to [24], we present an iterative coordinate descent algorithm to solve the problem in Eqn. 6. There are three kinds of unknown variables, R, Dm, and bmi . In each iteration, two of them are fixed, and the other one is optimized. 3.1.1 Solve R with bmi and Dm fixed With X, x1 xN D, 2 64 D1 . DM 3 75 B, b1 bN bi , h b1iT bMi T iT ; we re-write the objective function of Eqn. 6 in a matrix form as kX RDBk2F; wherek kF is the Frobenius norm. The problem of solving R is the classic Orthogonal Procrustes problem [28] and the solution can be obtained as follows: if SVD of X(DB)T is X(DB)T = U VT, the optimal R will be UVT. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4 Algorithm 1 Code Generation for ECKM Input: zmi , Dm2RS K, C Output: bmi 1: bmi = zeros(K;1) 2: r = zmi 3: for c = 1 : C do 4: k = arg minkkr dmkk22 5: r = r dmk 6: bmi (k ) = bmi (k ) + 1 7: end for 3.1.2 Solve Dm with bmi and R fixed Let zi , RTxi and the m-th subvector of zi be zmi . The objective function of Eqn. 6 can also be written as, NX i=1 MX m=1 kzmi Dmbmi k22 = MX m=1 kZm DmBmk2F; (7) where Zm , [zm1 ; ;zmN] Bm , [bm1 ; ;bmN]: Each Dm can be individually optimized as (ZmBmT)(BmBmT)+, where ( )+ denotes the matrix (pseudo)inverse. 3.1.3 Solve bmi with Dm and R fixed From Eqn. 6 and Eqn. 7, bmi can be solved by optimizing min geck(bmi ) =kzmi Dmbmi k22 s:t: bmi 2ZK+ kbmi k1 = C This is an integer quadratic programming and challeng- ing to solve. Here, we present a simple but practically efficient algorithm, based on matching pursuit [20] and illustrated in Alg. 1. In each iteration, we hold a residual variable r, initialized by zmi (Line 2 in Alg. 1). Let dmk be the