Cartesian k-means Mohammad Norouzi David J. Fleet Department of Computer Science University of Toronto {norouzi,fleet}@cs.toronto.edu Abstract A fundamental limitation of quantization techniques like the k-means clustering algorithm is the storage and run- time cost associated with the large numbers of clusters re- quired to keep quantization errors small and model fidelity high. We develop new models with a compositional param- eterization of cluster centers, so representational capacity increases super-linearly in the number of parameters. This allows one to effectively quantize data using billions or tril- lions of centers. We formulate two such models, Orthogonal k-means and Cartesian k-means. They are closely related to one another, to k-means, to methods for binary hash func- tion optimization like ITQ [5], and to Product Quantization for vector quantization [7]. The models are tested on large- scale ANN retrieval tasks (1M GIST, 1B SIFT features), and on codebook learning for object recognition (CIFAR-10). 1. Introduction and background Techniques for vector quantization, like the well-known k-means algorithm, are used widely in vision and learning. Common applications include codebook learning for object recognition [16], and approximate nearest neighbor (ANN) search for information retrieval [12, 14]. In general terms, such techniques involve partitioning an input vector space into multiple regions {Si}ki=1, mapping points x in each region into region-specific representatives{ci}ki=1, known as centers. As such, a quantizer, q(x), can be expressed as q(x) = ksummationdisplay i=1 1(x∈Si)ci , (1) where 1(·) is the usual indicator function. The quality of a quantizer is expressed in terms of ex- pected distortion, a common measure of which is squared errorbardblx?q(x)bardbl22. In this case, given centers{ci}, the re- gion to which a point is assigned with minimal distortion is obtained by Euclidean nearest neighbor (NN) search. The k-means algorithm can be used to learn centers from data. To reduce expected distortion, crucial for many appli- cations, one can shrink region volumes by increasing k, the number of regions. In practice, however, increasing k results in prohibitive storage and run-time costs. Even if one resorts to ANN search with approximate k-means [14] or hierarchical k-means [12], it is hard to scale to large k (e.g., k = 264), as storing the centers is untenable. This paper concerns methods for scalable quantization with tractable storage and run-time costs. Inspired by Prod- uct Quantization (PQ), a state-of-the-art algorithm for ANN search with high-dimensional data (e.g., [7]), composition- ality is one key. By expressing data in terms of recurring, reusable parts, the representational capacity of composi- tionalmodelsgrowsexponentiallyinthenumberofparame- ters. Compression techniques like JPEG accomplish this by encodingimagesasdisjointrectangularpatches. PQdivides the feature space into disjoint subspaces that are quantized independently. Other examples include part-based recogni- tion models (e.g., [18]), and tensor-based models for style- content separation (e.g., [17]). Here, with a compositional parameterizationofregioncenters, wefindafamilyofmod- els that reduce the encoding cost of k centers down from k to between log2 k and√k. A model parameter controls the trade-off between model fidelity and compactness. We formulate two related algorithms, Orthogonal k- means (ok-means) and Cartesian k-means (ck-means). They are natural extensions of k-means, and are closely re- lated to other hashing and quantization methods. The ok- means algorithm is a generalization of the Iterative Quan- tization (ITQ) algorithm for finding locality-sensitive bi- nary codes [5]. The ck-means model is an extension of ok- means, and can be viewed as a generalization of PQ.1 We then report empirical results on large-scale ANN search tasks, and on codebook learning for recognition. For ANN search we use datasets of 1M GIST and 1B SIFT fea- tures, with both symmetric and asymmetric distance mea- sures on the latent codes. We consider codebook learning for a generic form of recognition on CIFAR-10. 2. k-means Given a dataset of n p-dim points,D≡{xj}nj=1, the k- means algorithm partitions the n points into k clusters, and 1A very similar generalization of PQ, was developed concurrently by Ge, He, Ke and Sun, and also appears in CVPR 2013 [4]. 1 represents each cluster by a center point. Let C ∈ Rp×k be a matrix whose columns comprise the k cluster centers, i.e., C = [c1,c2,···,ck]. The k-means objective is to min- imize the within-cluster squared distances: ?k-means(C) = summationdisplay x∈D mini bardblx?cibardbl22 = summationdisplay x∈D min b∈H1/k bardblx?Cbbardbl22 (2) whereH1/k≡{b|b∈{0,1}k andbardblbbardbl=1}, i.e., b is a bi- nary vector comprising a 1-of-k encoding. Lloyd’s k-means algorithm [11] finds a local minimum of (2) by iterative, al- ternating optimization with respect to C and the b’s. The k-means model is simple and intuitive, using NN search to assign points to centers. The assignment of points to centers can be represented with a logk-bit index per data point. The cost of storing the centers grows linearly with k. 3. Orthogonal k-means with 2m centers With a compositional model one can represent clus- ter centers more efficiently. One such approach is to re- construct each input with an additive combination of the columns ofC. To this end, instead of the 1-of-k encoding in (2), we let b be a general m-bit vector, b∈Hm≡{0,1}m, andC∈Rp×m. As such, each cluster center is the sum of a subset of the columns of C. There are 2m possible subsets, and therefore k = 2m centers. While the number of param- eters in the quantizer is linear in m, the number of centers increases exponentially. While efficient in representing cluster centers, the ap- proach is problematic, because solving ?b = argmin b∈Hm bardblx?Cbbardbl22 , (3) is intractable; i.e., the discrete optimization is not submodu- lar. Obviously, for small 2m one could generate all possible centers and then perform NN search to find the optimal so- lution, but this would not scale well to large values of m. One key observation is that if the columns of C are or- thogonal, then optimization (3) becomes tractable. To ex- plain this, without loss of generality, assume the bits belong to{?1,1}instead of{0,1}, i.e., b′ ∈H′m ≡{?1,1}m. Then, bardblx?Cb′bardbl22 = xTx + b′TCTCb′?2xTCb′. (4) For diagonal CTC, the middle quadratic term on the RHS becomes trace(CTC), independent of b′. As a conse- quence, when C has orthogonal columns, argmin b′∈H′m bardblx?Cb′bardbl22 = sgn(CTx) , (5) where sgn(·) is the element-wise sign function. bracketleftbigg+1 +1 bracketrightbiggbracketleftbigg?1 +1 bracketrightbigg bracketleftbigg+1 ?1 bracketrightbiggbracketleftbigg?1 ?1 bracketrightbigg bracketleftbigg?1 +1 bracketrightbigg bracketleftbigg?1 ?1 bracketrightbigg bracketleftbigg+1 +1 bracketrightbigg bracketleftbigg+1 ?1 bracketrightbigg Figure 1. Two quantizations of 2D data (red ×’s) by ok-means. Cluster centers are depicted by dots, and cluster boundaries by dashed lines. (Left) Clusters formed by a 2-cube with no rotation, scaling or translation; centers = {b′|b′ ∈ H′2}. (Right) Rotation, scaling and translation are used to reduce distances between data points and cluster centers; centers = {μ+RDb′ |b′ ∈ H′2}. To reduce quantization error further we can also intro- duce an offset, denoted μ, to translate x. Taken together with (5), this leads to the loss function for the model we call orthogonal k-means (ok-means):2 ?ok-means(C,μ) = summationdisplay x∈D min b′∈H′m bardblx?μ?Cb′bardbl22 . (6) Clearly, with a change of variables,b′ = 2b?1, we can de- fine new versions of μ and C, with identical loss, for which the unknowns are binary, but in{0,1}m. Theok-meansquantizerencodeseachdatapointasaver- tex of a transformed m-dimensional unit hypercube. The transform, via C and μ, maps the hypercube vertices onto the input feature space, ideally as close as possible to the data points. The matrix C has orthogonal columns and can therefore be expressed in terms of rotation and scal- ing; i.e., C ≡ RD, where R ∈ Rp×m has orthonormal columns (RTR = Im), and D is diagonal and positive def- inite. The goal of learning is to find the parameters, R, D, and μ, which minimize quantization error. Fig. 1 depicts a small set of 2D data points (red x’s) and two possible quantizations. Fig. 1 (left) depicts the vertices of a 2-cube with C = I2 and zero translation. The cluster regions are simply the four quadrants of the 2D space. The distances between data points and cluster centers, i.e., the quantization errors, are relatively large. By comparison, Fig. 1 (right) shows how a transformed 2-cube, the full model, can greatly reduce quantization errors. 3.1. Learning ok means Toderivethelearningalgorithmforok-meanswefirstre- write the objective in matrix terms. Given n data points, let X = [x1,x2,···,xn]∈Rp×n. Let B′∈{?1,1}m×n de- note the corresponding cluster assignment coefficients. Our 2While closely related to ITQ, we use the term ok-means to emphasize the relationship to k-means. goal is to find the assignment coefficients B′ and the trans- formation parameters, namely, the rotation R, scaling D, and translation μ, that minimize ?ok-means(B′,R,D,μ) = bardblX?μ1T?RDB′bardbl2f (7) = bardblX′?RDB′bardbl2f (8) wherebardbl·bardblf denotes the Frobenius norm, X′≡X?μ1T, R is constrained to have orthonormal columns (RTR = Im), and D is a positive diagonal matrix. Like k-means, coordinate descent is effective for opti- mizing (8). We first initialize μ and R, and then iteratively optimize ?ok-means with respect to B′, D, R, and μ: ? Optimize B′ and D: With straightforward algebraic ma- nipulation, one can show that ?ok-means = bardblRTX′?DB′bardbl2f + bardblR⊥TX′bardbl2f , (9) where columns of R⊥ span the orthogonal complement of the column-space of R (i.e., the block matrix [R R⊥] is orthogonal). It follows that, given X′ and R, we can optimize the first term in (9) to solve for B′ and D. Here, DB′ is the least- squares approximation to RTX′, where RTX′ and DB′ are m×n. Further, the ith row of DB′ can only contain elements from{?di,+di}where di = Dii. Wherever the corresponding element ofRTX′ is positive (negative) we should put a positive (negative) value in DB′. The optimal di is determined by the mean absolute value of the elements in the ith row of RTX′: B′ ← sgnparenleftbigRTX′parenrightbig (10) D ← Diag(meanrow (abs(RTX′))) (11) ? Optimize R: Using the original objective (8), find R that minimizes bardblX′?RAbardbl2f , subject to RTR = Im, and A≡DB′. ThisisequivalenttoanOrthogonalProcrustes problem [15], and can be solved exactly using SVD. In particular, by adding p?m rows of zeros to the bottom of D, DB becomes p×n. Then R is square and orthog- onal and can be estimated with SVD. But since DB is degenerate we are only interested in the first m columns of R; the remaining columns are unique only up to rota- tion of the null-space.) ? Optimize μ: Given R, B′ and D, the optimal μ is given by the column average of X?RDB′: μ ← mean colum (X?RDB′) (12) 3.2. Approximate nearest neighbor search One application of scalable quantization is ANN search. Before introducing more advanced quantization techniques, we describe some experimental results with ok-means on Euclidean ANN search. The benefits of ok-means for ANN search are two-fold. Storage costs for the centers is reduced to O(logk), from O(k) with k-means. Second, substantial speedups are possible by exploiting fast methods for NN search on binary codes in Hamming space (e.g., [13]). Generally, in terms of a generic quantizer q(·), there are two natural ways to estimate the distance between two vec- tors, v and u [7]. Using the Symmetric quantizer distance (SQD)bardblv?ubardblisapproximatedbybardblq(v)?q(u)bardbl. Usingthe Asymmetric quantizer distance (AQD), only one of the two vectors is quantized, andbardblv?ubardblis estimated asbardblv?q(u)bardbl. While SQD might be slightly faster to compute, AQD in- curs lower quantization errors, and hence is more accurate. For ANN search, in a pre-processing stage, each database entry, u, is encoded into a binary vector corre- sponding to the cluster center index to which u is assigned. At test time, the queries may or may not be encoded into indices, depending on whether one uses SQD or AQD. In the ok-means model, the quantization of an input x is straightforwardly shown to be qok(x) = μ + RDsgn(RT(x?μ)) . (13) The corresponding m-bit cluster index is sgn(RT(x?μ)). Given two indices, namely b′1,b′2 ∈{?1,+1}m, the sym- metric ok-means quantizer distance is SQDok(b′1,b′2) = bardblμ + RDb′1?μ?RDb′2bardbl22 = bardblD(b′1?b′2)bardbl22 . (14) In effect, SQDok is a weighted Hamming distance. It is the sum of the squared diagonal entries of D correspond- ing to bits where b′1 and b′2 differ. Interestingly, in our experiments with ok-means, Hamming and weighted Ham- ming distances yield similar results. Thus, in ok-means ex- periments we simply report results for Hamming distance, to facilitate comparisons with other techniques. When the scaling in ok-means is constrained to be isotropic (i.e., D = αIm for α∈R+), then SQDok becomes a constant multi- ple of the usual Hamming distance. As discussed in Sec. 5, this isotropic ok-means is closely related to ITQ [5]. The ok-means AQD between a feature vector x and a cluster index b′, is defined as AQDok(x,b′) = bardblx?μ?RDb′bardbl22 = bardblRTx′?Db′bardbl22 +bardblR⊥Tx′bardbl22 , (15) where x′ = x?μ. For ANN search, in comparing distances from query x to a dataset of binary indices, the second term on the RHS of (15) is irrelevant, since it does not depend on b′. Without this term,AQDok becomes a form of asymmet- ric Hamming (AH) distance between RTx′ and b′. While previous work [6] discussed various ad hoc AH distance measures for binary hashing, interestingly, the optimal AH distance for ok-means is derived directly in our model. 1 10 100 1K 10K0 0.2 0.4 0.6 0.8 1 1M SIFT, 64?bit encoding (k = 264)

[email protected] R PQ (AD) ok?means (AH) ITQ (AH) ok?means (H) ITQ (H) Figure 2. Euclidean ANN retrieval results for different methods and distance functions on the 1M SIFT dataset. 3.3. Experiments with ok means Following [7], we report ANN search results on 1M SIFT, a corpus of 128D SIFT descriptors with disjoint sets of 100K training, 1M base, and 10K test vectors. The train- ing set is used to train models. The base set is the database, and the test points are queries. The number of bits m is typically less than p, but no pre-processing is needed for di- mensionality reduction. Rather, to initialize learning, R is a random rotation of the first m principal directions of the training data, and μ is the mean of the data. For each query we find R nearest neighbors, and com- pute

[email protected], the fraction of queries for which the ground-truth Euclidean NN is found in th