Semantic Hashing Ruslan Salakhutdinov Department of Computer Science University of Toronto Toronto, Ontario M5S

[email protected] Geoffrey Hinton Department of Computer Science University of Toronto Toronto, Ontario M5S

[email protected] ABSTRACT We show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Analysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the graphical model performs “semantic hashing”: Documents are mapped to memory addresses in such a way that semantically similar documents are located at nearby ad- dresses. Documents similar to a query document can then be found by simply accessing all the addresses that differ by only a few bits from the address of the query document. This way of extending the efficiency of hash-coding to approximate matching is much faster than locality sensitive hashing, which is the fastest current method. By using semantic hashing to filter the documents given to TF-IDF, we achieve higher accuracy than applying TF-IDF to the entire doc- ument set. 1. INTRODUCTION One of the most popular and widely used algorithms for retriev- ing documents that are similar to a query document is TF-IDF[15, 14] which measures the similarity between documents by com- paring their word-count vectors. The similarity metric weights each word by both its frequency in the query document (Term Fre- quency) and the logarithm of the reciprocal of its frequency in the whole set of documents (Inverse Document Frequency). TF-IDF has several major drawbacks: ? It computes document similarity directly in the word-count space, which can be slow for large vocabularies. ? It assumes that the counts of different words provide inde- pendent evidence of similarity. ? It makes no use of semantic similarities between words so it cannot see the similarity between “Wolfowitz resigns” and “Gonzales quits”. To remedy these drawbacks, numerous models for capturing low- dimensional, latent representations have been proposed and suc- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2007 ACM SIGIR. Semantically Similar Documents Document Address Space Semantic Hashing Function Figure 1: A schematic representation of semantic hashing. cessfully applied in the domain of information retrieval. A simple and widely-used method is Latent Semantic Analysis (LSA) [5], which extracts low-dimensional semantic structure using SVD de- composition to get a low-rank approximation of the word-document co-occurrence matrix. This allows document retrieval to be based on “semantic” content rather than just on individually weighted words. LSA, however, is very restricted in the types of semantic content it can capture because it is a linear method so it can only capture pairwise correlations between words. A probabilistic ver- sion of LSA (pLSA) was introduced by [11], using the assumption that each word is modeled as a sample from a document-specific multinomial mixture of word distributions. A proper generative model at the level of documents, Latent Dirichlet Allocation, was introduced by [2], improving upon [11]. These recently introduced probabilistic models can be viewed as graphical models in which hidden topic variables have directed connections to variables that represent word-counts. Their major drawback is that exact inference is intractable due to explaining away, so they have to resort to slow or inaccurate approximations to compute the posterior distribution over topics. This makes it diffi- cult to fit the models to data. Also, as Welling et. al. [16] point out, fast inference is important for information retrieval. To achieve this [16] introduce a class of two-layer undirected graphical models that generalize Restricted Boltzmann Machines (RBM’s)[7] to expo- nential family distributions. This allows them to model non-binary data and to use non-binary hidden (i.e. latent) variables. Maxi- mum likelihood learning is intractable in these models, but learn- ing can be performed efficiently by following an approximation to the gradient of a different objective function called “contrastive di- vergence” [7]. Several further developments of these undirected models [6, 17] show that they are competitive in terms of retrieval accuracy with their directed counterparts. All of the above models, however, have important limitations. W W W W W +ε W +ε W +ε W W +ε W +ε W +ε W 2000 1 2 500 2000 3 1 500 500 2000 1 1 2 2 500 500 GaussianNoise 500 3 3 2000 500 500 RBM 500 500 RBM 3 RBM Recursive Pretraining Top Layer Binary Codes Fine?tuning 2 1 3 4 5 6 Code Layer The Deep Generative Model 2 32 32 32 T T T Figure 2: Left panel: The deep generative model. Middle panel: Pretraining consists of learning a stack of RBM’s in which the feature activations of one RBM are treated as data by the next RBM. Right panel: After pretraining, the RBM’s are “unrolled” to create a multi-layer autoencoder that is fine-tuned by backpropagation. First, there are limitations on the types of structure that can be rep- resented efficiently by a single layer of hidden variables. We will show that a network with multiple hidden layers and with millions of parameters can discover latent representations that work much better for information retrieval. Second, all of these text retrieval algorithms are based on computing a similarity measure between a query document and other documents in the collection. The sim- ilarity is computed either directly in the word space or in a low- dimensional latent space. If this is done naively, the retrieval time complexity of these models is O(NV ), where N is the size of the document corpus and V is the size of vocabulary or dimensionality of the latent space. By using an inverted index, the time complexity for TF-IDF can be improved to O(BV ), where B is the average, over all terms in the query document, of the number of other doc- uments in which the term appears. For LSA, the time complexity can be improved to O(V logN) by using special data structures such as KD-trees, provided the intrinsic dimensionality of the rep- resentations is low enough for KD-trees to be effective. For all of these models, however, the larger the size of document collection, the longer it will take to search for relevant documents. In this paper we describe a new retrieval method called “seman- tic hashing” that produces a shortlist of similar documents in a time that is independent of the size of the document collection and linear in the size of the shortlist. Moreover, only a few machine instruc- tions are required per document in the shortlist. Our method must store additional information about every document in the collec- tion, but this additional information is only about one word of mem- ory per document. Our method depends on a new way of training deep graphical models one layer at a time, so we start by describing the type of graphical model we use and how we train it. The lowest layer in our graphical model represents the word- count vector of a document and the highest (i.e. deepest) layer represents a learned binary code for that document. The top two layers of the generative model form an undirected bipartite graph and the remaining layers form a belief net with directed, top-down connections (see fig. 2). The model can be trained efficiently by using a Restriced Boltzmann Machine (RBM) to learn one layer of hidden variables at a time [8]. After learning is complete, the mapping from a word-count vector to the states of the top-level variables is fast, requiring only a matrix multiplication followed by a componentwise non-linearity for each hidden layer. After the greedy, layer-by-layer training, the generative model is not significantly better than a model with only one hidden layer. To take full advantage of the multiple hidden layers, the layer-by-layer learning must be treated as a “pretraining” stage that finds a good region of the parameter space. Starting in this region, a gradient search can then fine-tune the model parameters to produce a much better model [10]. In the next section we introduce the “Constrained Poisson Model” that is used for modeling word-count vectors. This model can be viewed as a variant of the Rate Adaptive Poisson model [6] that is easier to train and has a better way of dealing with documents of different lengths. In section 3, we describe both the layer-by-layer pretraining and the fine-tuning of the deep multi-layer model. We also show how “deterministic noise” can be used to force the fine- tuning to discover binary codes in the top layer. In section 4, we describe two different ways of using binary codes for retrieval. For relatively small codes we use semantic hashing and for larger bi- nary codes we simply compare the code of the query document to the codes of all candidate documents. This is still very fast because it can use bit operations. We present experimental results showing that both methods work very well on a collection of about a million documents as well as on a smaller collection. 2. THE CONSTRAINED POISSON MODEL We use a conditional “constrained” Poisson distribution for mod- eling observed “visible” word count datav and a conditional Bernoulli distribution for modeling “hidden” topic features h: p(vi = n|h) = Ps ? n, exp(λi + P j hjwij)P k exp `λ k + P j hjwkj ′N ? (1) p(hj = 1|v) = σ(bj + X i wijvi) (2) v h W Poisson Binary Constrained Latent Topic Features Observed Distribution over Words over Words N*W W softmax Reconstructed Distribution Figure 3: The left panel shows the Markov random field of the constrained Poisson model. The top layer represents a vector, h, of stochastic, binary, latent, topic features and and the bottom layer represents a Poisson visible vector v. The right panel shows a different interpretation of the constrained Poisson model in which the visible activities have all been divided by the number of words in the document so that they represent a probability distribution. The factor of N that multiplies the upgoing weights is a result of having N i.i.d. observations from the observed distribution. where Ps`n,λ′ = e?λλn/n!, σ(x) = 1/(1+e?x), wij is a sym- metric interaction term between word i and feature j, N = Pi vi is the total length of the document, λi is the bias of the conditional Poisson model for word i, and bj is the bias of feature j. The Pois- son rate, whose log is shifted by the weighted combination of the feature activations, is normalized and scaled up by N. We call this the “Constrained Poisson Model” (see fig. 3) since it ensures that the mean Poisson rates across all words sum up to the length of the document. This normalization is significant because it makes learning stable and it deals appropriately with documents of differ- ent lengths. The marginal distribution over visible count vectors v is: p(v) = X h exp(?E(v,h))P u,g exp(?E(u,g)) (3) with an “energy” term (i.e. the negative log probability + unknown constant offset) given by: E(v,h) = ? X i λivi + X i log(vi!) ? X j bjhj ? X i,j vihjwij (4) The parameter updates required to perform gradient ascent in the log-likelihood can be obtained from Eq. 3: ?wij = ?? logp(v)?w ij = ?(data ? model) where ? is the learning rate, data denotes an expectation with respect to the data distribution and model is an expectation with respect to the distribution defined by the model. To avoid computing model, we use 1-step Contrastive Divergence [7]: ?wij = ?(data ? recon) (5) The expectation data defines the frequency with which word i and feature j are on together when the features are being driven by the observed count data from the training set using Eq. 2. After stochastically activating the features, Eq. 1 is used to “recon- struct” the Poisson rates for each word. Then Eq. 2 is used again to activate the features and recon is the corresponding ex- pectation when the features are being driven by the reconstructed counts. The learning rule for the biases is just a simplified version of Eq. 5. 3. PRETRAINING AND FINE-TUNING A DEEP GENERATIVE MODEL A single layer of binary features is not the best way to capture the structure in the count data. We now describe an efficient way to learn additional layers of binary features. After learning the first layer of hidden features we have an undi- rected model that defines p(v,h) via the energy function in Eq. 4. We can also think of the model as defining p(v,h) by defining a consistent pair of conditional probabilities, p(h|v) and p(v|h) which can be used to sample from the model distribution. A dif- ferent way to express what has been learned is p(v|h) and p(h). Unlike a standard directed model, this p(h) does not have its own separate parameters. It is a complicated, non-factorial prior on h that is defined implicitly by the weights. This peculiar decompo- sition into p(h) and p(v|h) suggests a recursive algorithm: keep the learned p(v|h) but replace p(h) by a better prior over h, i.e. a prior that is closer to the average, over all the data vectors, of the conditional posterior over h. We can sample from this average conditional posterior by simply applying p(h|v) to the training data. The sampled h vectors are then the “data” that is used for training a higher-level RBM that learns the next layer of features. We could initialize the higher-level RBM model by using the same parameters as the lower-level RBM but with the roles of the hidden and visible units reversed. This ensures that p(v) for the higher-level RBM starts out being exactly the same as p(h) for the lower-level one. Provided the number of features per layer does not decrease, [8] show that each extra layer increases a variational lower bound on the log probability of the data. This bound is described in more detail in the appendix. The directed connections from the first hidden layer to the visi- ble units in the final, composite graphical model (see figure 1) are a consequence of the the fact that we keep the p(v|h) but throw away the p(h) defined by the first level RBM. In the final compos- ite model, the only undirected connections are between the top two layers, because we do not throw away the p(h) for the highest-level RBM. The first layer of hidden features is learned using a constrained Poisson RBM in which the visible units represent word-counts and the hidden units are binary. All the higher-level RBM’s use binary units for both their hidden and their visible lay