Spectral Hashing Yair Weiss1,3 3School of Computer Science, Hebrew University, 91904, Jerusalem, Israel

[email protected] Antonio Torralba1 1CSAIL, MIT, 32 Vassar St., Cambridge, MA 02139

[email protected] Rob Fergus2 2Courant Institute, NYU, 715 Broadway, New York, NY 10003

[email protected] Abstract Semantic hashing[1] seeks compact binary codes of data-points so that the Hamming distance between codewords correlates with semantic similarity. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresholded eigen- vectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel data- point. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes outper- form the state-of-the art. 1 Introduction With the advent of the Internet, it is now possible to use huge training sets to address challenging tasks in machine learning. As a motivating example, consider the recent work of Torralba et al. who collected a dataset of 80 million images from the Internet [2, 3]. They then used this weakly labeled dataset to perform scene categorization. To categorize a novel image, they simply searched for similar images in the dataset and used the labels of these retrieved images to predict the label of the novel image. A similar approach was used in [4] for scene completion. Although conceptually simple, actually carrying out such methods requires highly efficient ways of (1) storing millions of images in memory and (2) quickly finding similar images to a target image. Semantic hashing, introduced by Salakhutdinov and Hinton[5] , is a clever way of addressing both of these challenges. In semantic hashing, each item in the database is represented by a compact binary code. The code is constructed so that similar items will have similar binary codewords and there is a simple feedforward network that can calculate the binary code for a novel input. Retrieving similar neighbors is then done simply by retrieving all items with codes within a small Hamming distance of the code for the query. This kind of retrieval can be amazingly fast - millions of queries per second on standard computers. The key for this method to work is to learn a good code for the dataset. We need a code that is (1) easily computed for a novel input (2) requires a small number of bits to code the full dataset and (3) maps similar items to similar binary codewords. To simplify the problem, we will assume that the items have already been embedded in a Euclidean space, say Rd, in which Euclidean distance correlates with the desired simi- larity. The problem of finding such a Euclidean embedding has been addressed in a large 1 number of machine learning algorithms (e.g. [6, 7]). In some cases, domain knowledge can be used to define a good embedding. For example, Torralba et al. [3] found that a 512 dimensional descriptor known as the GIST descriptor, gives an embedding where Euclidean distance induces a reasonable similarity function on the items. But simply having Euclidean embedding does not give us a fast retrieval mechanism. If we forget about the requirement of having a small number of bits in the codewords, then it is easy to design a binary code so that items that are close in Euclidean space will map to similar binary codewords. This is the basis of the popular locality sensitive hashing method E2LSH [8]. As shown in[8], if every bit in the code is calculated by a random linear projection followed by a random threshold, then the Hamming distance between codewords will asymptotically approach the Euclidean distance between the items. But in practice this method can lead to very inefficient codes. Figure 1 illustrates the problem on a toy dataset of points uniformly sampled in a two dimensional rectangle. The figure plots the average precision at Hamming distance 1 using a E2LSH encoding. As the number of bits increases the precision improves (and approaches one with many bits), but the rate of convergence can be very slow. Rather than using random projections to define the bits in a code, several authors have pursued machine learning approaches. In [5] the authors used an autoencoder with several hidden layers. The architecture can be thought of as a restricted Boltzmann machine (RBM) in which there are only connections between layers and not within layers. In order to learn 32 bits, the middle layer of the autoencoder has 32 hidden units, and noise was injected during training to encourage these bits to be as binary as possible. This method indeed gives codes that are much more compact than the E2LSH codes. In [9] they used multiple stacked RBMs to learn a non-linear mapping between input vector and code bits. Backpropagation using an Neighborhood Components Analysis (NCA) objective function was used to refine the weights in the network to preserve the neighborhood structure of the input space. Figure 1 shows that the RBM gives much better performance compared to random bits. A simpler machine learning algorithm (Boosting SSC) was pursued in [10] who used adaBoost to classify a pair of input items as similar or nonsimilar. Each weak learner was a decision stump, and the output of all the weak learners on a given output is a binary code. Figure 1 shows that this boosting procedure also works much better than E2LSH codes, although slightly worse than the RBMs1. The success of machine learning approaches over LSH is not limited to synthetic data. In [5], RBMs gave several orders of magnitude improvement over LSH in document retrieval tasks. In [3] both RBMs and Boosting were used to learn binary codes for a database of millions of images and were found to outperform LSH. Also, the retrieval speed using these short binary codes was found to be significantly faster than LSH (which was faster than other methods such as KD trees). The success of machine learning methods leads us to ask: what is the best code for perform- ing semantic hashing for a given dataset? We formalize the requirements for a good code and show that these are equivalent to a particular form of graph partitioning. This shows that even for a single bit, the problem of finding optimal codes is NP hard. On the other hand, the analogy to graph partitioning suggests a relaxed version of the problem that leads to very efficient eigenvector solutions. These eigenvectors are exactly the eigenvectors used in many spectral algorithms including spectral clustering and Laplacian eigenmaps [6, 11]. This leads to a new algorithm, which we call “spectral hashing” where the bits are calculated by thresholding a subset of eigenvectors of the Laplacian of the similarity graph. By utiliz- ing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel data- point. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes outperform the state-of-the art. 1All methods here use the same retrieval algorithm, i.e. semantic hashing. In many applica- tions of LSH and Boosting SSC, a different retrieval algorithm is used whereby the binary code only creates a shortlist and exhaustive search is performed on the shortlist. Such an algorithm is impractical for the scale of data we are considering. 2 0 5 10 15 20 25 30 350 0.2 0.4 0.6 0.8 1 LSH stumps boosting SSC RBM Proportion good neighbors for hamming distance 2 number of bits stumps boosting SSC LSH RBM (two hidden layers) Training samples Figure 1: Building hash codes to find neighbors. Neighbors are defined as pairs of points in 2D whose Euclidean distance is less than epsilon1. The toy dataset is formed by uniformly sampling points in a two dimensional rectangle. The figure plots the average precision (number of neighbors in the original space divided by number of neighbors in a hamming ball using the hash codes) at Hamming distance ≤ 1 for three methods. The plots on the left show how each method partitions the space to compute the bits to represent each sample. Despite the simplicity of this toy data, the methods still require many bits in order to get good performance. 2 Analysis: what makes a good code As mentioned earlier, we seek a code that is (1) easily computed for a novel input (2) requires a small number of bits to code the full dataset and (3) maps similar items to similar binary codewords. Let us first ignore the first requirement, that codewords be easily computed for a novel input and search only for a code that is efficient (i.e. requires a small number of bits) and similarity preserving (i.e. maps similar items to similar codewords). For a code to be efficient, we require that each bit has a 50% chance of being one or zero, and that different bits are independent of each other. Among all codes that have this property, we will seek the ones where the average Hamming distance between similar points is minimal. Let {yi}ni=1 be the list of codewords (binary vectors of length k) for n datapoints and Wn×n be the affinity matrix. Since we are assuming the inputs are embedded in Rd so that Euclidean distance correlates with similarity, we will use W(i,j) = exp(?bardblxi ? xjbardbl2/epsilon12). Thus the parameter epsilon1 defines the distance in Rd which corresponds to similar items. Using this notation, the average Hamming distance between similar neighbors can be written:summationtext ij Wijbardblyi ? yjbardbl 2. If we relax the independence assumption and require the bits to be uncorrelated we obtain the following problem: minimize : summationdisplay ij Wijbardblyi ?yjbardbl2 (1) subject to : yi ∈ {?1,1}ksummationdisplay i yi = 0 1 n summationdisplay i yiyTi = I where the constraint summationtexti yi = 0 requires each bit to fire 50% of the time, and the constraint 1 n summationtext i yiy Ti = I requires the bits to be uncorrelated. Observation: For a single bit, solving problem 1 is equivalent to balanced graph partition- ing and is NP hard. 3 Proof: Consider an undirected graph whose vertices are the datapoints and where the weight between item i and j is given by W(i,j). Consider a code with a single bit. The bit partitions the graph into two equal parts (A,B), vertices where the bit is on and vertices where the bit is off. For a single bit, summationtextij Wijbardblyi?yjbardbl2 is simply the weight of the edges cut by the partition: cut(A,B) =summationtexti∈A,j∈B W(i,j). Thus problem 1 is equivalent to minimizing cut(A,B) with the requirement that |A| = |B| which is known to be NP hard [12]. For k bits the problem can be thought of as trying to find k independent balanced partitions, each of which should have as low cut as possible. 2.1 Spectral Relaxation By introducing a n×k matrix Y whose jth row is yTj and a diagonal n×n matrix D(i,i) =summationtext j W(i,j) we can rewrite the problem as: minimize : trace(Y T(D ?W)Y) (2) subject to : Y(i,j) ∈ {?1,1} Y T1 = 0 Y TY = I This is of course still a hard problem, but by removing the constraint that Y(i,j) ∈ {?1,1} we obtain an easy problem whose solutions are simply the k eigenvectors of D ? W with minimal eigenvalue (after excluding the trivial eigenvector 1 which has eigenvalue 0). 2.2 Out of Sample Extension The fact that the solution to the relaxed problem are the k eigenvectors of D ? W with minimal eigenvalue would suggest simply thresholding these eigenvectors to obtain a binary code. But this would only tell us how to compute the code representation of items in the training set. This is the problem of out-of-sample extension of spectral methods which is often solved using the Nystrom method [13, 14]. But note that the cost of calculating the Nystrom extension of a new datapoint is linear in the size of the dataset. In our setting, where there can be millions of items in the dataset this is impractical. In fact, calculating the Nystrom extension is as expensive as doing exhaustive nearest neighbor search. In order to enable efficient out-of-sample extension we assume the datapoints xi ∈ Rd are samples from a probability distribution p(x). The equations in the problem 1 are now seen to be sample averages which we replace with their expectations: minimize : integraldisplay bardbly(x1)?y(x2)bardbl2W(x1,x2)p(x1)p(x2)dx1x2 (3) subject to : y(x) ∈ {?1,1}kintegraldisplay y(x)p(x)dx = 0 integraldisplay y(x)y(x)Tp(x)dx = I with W(x1,x2) = e?bardblx1?x2bardbl2/epsilon12. Relaxing the constraint that y(x) ∈ {?1,1}k now gives a spectral problem whose solutions are eigenfunctions of the weighted Laplace-Beltrami operators defined on manifolds [15, 16, 13, 17]. More explicitly, define the weighted Lapla- cian Lp as an operator that maps a function f to g = Lpf by g(x)p(x) = D(x)f(x)p(x) ?integraltext s W(s,x)f(s)p(s)ds with D(x) = integraltext s W(x,s). The solution to the relaxation of problem 3are functions that satisfy L pf = λf with minimal eigenvalue (ignoring the trivial solution f(x) = 1 which has eigenvalue 0). As discussed in [16, 15, 13], with proper normalization, the eigenvectors of the discrete Laplacian defined by n points sampled from p(x) converges to eigenfunctions of Lp as n → ∞. What do the eigenfunctions of Lp look like ? One important special case is when p(x) is a separable distribution. A simple case of a separable distribution is a multidimensional 4 uniform distribution Pr(x) = producttexti ui(xi) where ui is a uniform distribution in the range [ai,bi]. Another example is a multidimensional Gaussian, which is separable once the space has been rotated so that the Gaussian is axes aligned. Observation: [17] If p(x) is separable, and similarity between datapoints is defined as e?bardblxi?xjbardbl2/epsilon12 then the eigenfunctions of the continuous weighted Laplacian, Lp have an outer product form. That is, if Φi(x) is an eigenfunction of the weighted Laplacian defined on R1 with eigenvalue λi then Φi(x1)Φj(x2)···Φd(xd) is an eigenfunction of the d dimensional problem with eigenvalue λiλj ···λd. Specifically for a case of a uniform distribution on [a,b] the eigenfunctions of the one- dimensional Laplacian Lp are extremely well studied objects in mathematics. They corre- spond to the fundamental modes of vibration of a metallic plate. The eigenfunctions Φk(x) and eigenvalues λk are: Φk(x) = sin(pi2 + kpib?ax) (4) λk = 1?e?epsilon1 2 2 | kpi b?a| 2 (5) A similar equation is also available for the one dimensional Gaussian . In this case the eigenfunctions of the one-dimensional Laplacian Lp are (in the limit of small epsilon1) solutions to the Schrodinger equations and a