Unsupervised learning hashing to filter the documents given to TF-IDF, we achieve higher accuracy than applying sed in directly (pLSA) was introduced by [13], using the assumption that each word is modeled as a sample from a document-specific multi- nomial mixture of word distributions. A proper generative model at the level of documents, Latent Dirichlet Allocation, was introduced by [3], improving upon [13]. 0888-613X/$ - see front matter C211 2008 Elsevier Inc. All rights reserved. * Corresponding author. E-mail addresses:

[email protected] (R. Salakhutdinov),

[email protected] (G. Hinton). International Journal of Approximate Reasoning 50 (2009) 969–978 Contents lists available at ScienceDirect International Journal of Approximate Reasoning doi:10.1016/j.ijar.2008.11.006 – It assumes that the counts of different words provide independent evidence of similarity. – It makes no use of semantic similarities between words. To remedy these drawbacks, numerous models for capturing low-dimensional, latent representations have been pro- posed and successfully applied in the domain of information retrieval. A simple and widely-used method is Latent Semantic Analysis (LSA) [6], which extracts low-dimensional semantic structure using SVD decomposition to get a low-rank approx- imation of the word-document co-occurrence matrix. This allows document retrieval to be based on ‘‘semantic” content rather than just on individually weighted words. LSA, however, is very restricted in the types of semantic content it can cap- ture because it is a linear method so it can only capture pairwise correlations between words. A probabilistic version of LSA 1. Introduction One of the most popular and widely-u IDF [19,18] which measures the similarity weights each word by both its frequency frequency in the whole set of documents – It computes document similarity TF-IDF to the entire document set. C211 2008 Elsevier Inc. All rights reserved. algorithms for retrieving documents that are similar to a query document is TF- between documents by comparing their word-count vectors. The similarity metric the query document (Term Frequency) and the logarithm of the reciprocal of its (Inverse Document Frequency). TF-IDF has several major limitations: in the word-count space, which could be slow for large vocabularies. Semantic hashing Ruslan Salakhutdinov * , Geoffrey Hinton Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario, Canada M5S 3G4 article info Article history: Received 11 January 2008 Received in revised form 15 November 2008 Accepted 19 November 2008 Available online 10 December 2008 Keywords: Information retrieval Graphical models abstract We show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Anal- ysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the graphical model performs ‘‘semantic hashing”: Documents are mapped to memory addresses in such a way that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by simply accessing all the addresses that differ by only a few bits from the address of the query document. This way of extending the efficiency of hash-coding to approximate matching is much faster than locality sensitive hashing, which is the fastest current method. By using semantic journal homepage: www.elsevier.com/locate/ijar These recently introduced probabilistic models can be viewed as graphical models in which hidden topic variables have directed connections to variables that represent word-counts. Their major drawback is that exact inference is intractable due to explaining away, so they have to resort to slow or inaccurate approximations to compute the posterior distribution over topics. This makes it difficult to fit the models to data. Also, as Welling et al. [21] point out, fast inference is important for information retrieval. To achieve this [21] introduce a class of two-layer undirected graphical models that generalize Re- stricted Boltzmann Machines (RBM’s) [9] to exponential family distributions. This allows them to model non-binary data and to use non-binary hidden (i.e. latent) variables. Maximum likelihood learning is intractable in these models, but learning can be performed efficiently by following an approximation to the gradient of a different objective function called ‘‘contras- tive divergence” [9]. Several further developments of these undirected models [8,22] show that they are competitive in terms of retrieval accuracy with their directed counterparts. All of the above models, however, have important limitations. First, there are limitations on the types of structure that can be represented efficiently by a single layer of hidden variables [2]. We will show that a network with multiple hidden layers and with millions of parameters can discover latent representations that work much better for information retrieval. Second, all of these text retrieval algorithms are based on computing a similarity measure between a query document and other doc- uments in the collection. The similarity is computed either directly in the word space or in a low-dimensional latent space. If this is done naively, the retrieval time complexity of these models isOeNVT, where N is the size of the document corpus and V 970 R. Salakhutdinov, G. Hinton/International Journal of Approximate Reasoning 50 (2009) 969–978 is the size of vocabulary or dimensionality of the latent space. By using an inverted index, the time complexity for TF-IDF can be improved to OeBVT, where B is the average, over all terms in the query document, of the number of other documents in which the term appears. For LSA, the time complexity can be improved to OeV logNT by using special data structures such as KD-trees [7,15], provided the intrinsic dimensionality of the representations is low enough for KD-trees to be effective. For all of these models, however, the larger the size of document collection, the longer it will take to search for relevant documents. In this paper, we describe a new retrieval method called ‘‘semantic hashing” that produces a shortlist of similar docu- ments in a time that is independent of the size of the document collection and linear in the size of the shortlist. Moreover, only a few machine instructions are required per document in the shortlist. Our method must store additional information about every document in the collection, but this additional information is only about one word of memory per document. Our method depends on a new way of training deep graphical models one layer at a time, so we start by describing the type of graphical model we use and how we train it. The lowest layer in our graphical model represents the word-count vector of a document and the highest (i.e. deepest) layer represents a learned binary code for that document. The top two layers of the generative model form an undirected bipartite graph and the remaining layers form a belief net with directed, top–down connections (see Fig. 2). The model can be trained efficiently by using a Restricted Boltzmann Machine (RBM) to learn one layer of hidden variables at a time [10]. After learning is complete, the mapping from a word-count vector to the states of the top-level variables is fast, requir- ing only a matrix multiplication followed by a componentwise non-linearity for each hidden layer. After the greedy, layer-by-layer training, the generative model is not significantly better than a model with only one hid- den layer. To take full advantage of the multiple hidden layers, the layer-by-layer learning must be treated as a ‘‘pretraining” stage that finds a good region of the parameter space. Starting in this region, a gradient search can then fine-tune the model parameters to produce a much better model [12]. In the next section, we introduce the ‘‘Constrained Poisson Model” that is used for modeling word-count vectors. This model can be viewed as a variant of the Rate Adaptive Poisson model [8] that is easier to train and has a better way of dealing with documents of different lengths. In Section 3, we describe both the layer-by-layer pretraining and the fine-tuning of the deep multi-layer model. We also show how ‘‘deterministic noise” can be used to force the fine-tuning to discover binary codes in the top layer. In Section 4, we describe two different ways of using binary codes for retrieval. For relatively small codes we use semantic hashing and for larger binary codes we simply compare the code of the query document to the codes Semantically Similar Documents Document Address Space Semantic Hashing Function Fig. 1. A schematic representation of semantic hashing. We tional W W W W W +ε W +ε W +ε W W +ε W +ε W +ε W 2000 1 2 500 2000 3 1 500 500 2000 11 22 500 500 Gaussian Noise 500 33 2000 500 500 RBM 500 500 RBM 3 RBM Top Layer Binary Codes 2 1 34 5 6 Code Layer 2 32 32 32 T T T Fig. 2. are treated backpropagation. R. Salakhutdinov, G. Hinton/International Journal of Approximate Reasoning 50 (2009) 969–978 971 peh j ? 1jvT?r b j t X i w ij v i C16C17 ; e2T where Ps(n,k)=e C0k k n /n!, r(x) = 1/(1 + e C0x ), w ij is a symmetric interaction term between word i and feature j, N ? P i v i is the total length of the document, k i is the bias of the conditional Poisson model for word i, and b j is the bias of feature j. The Poisson rate, whose log is shifted by the weighted combination of the feature activations, is normalized and scaled up by N. We call this the ‘‘Constrained Poisson Model” (see Fig. 3) since it ensures that the mean Poisson rates across all words sum up to the length of the document. This normalization is significant because it makes learning stable and it deals appro- priately The with an Fig. 3. topic features in which N that multiplies use a conditional ‘‘constrained” Poisson distribution for modeling observed ‘‘visible” word-count data v and a condi- Bernoulli distribution for modeling ‘‘hidden” topic features h: pev i ? njhT?Ps n; expek i t P j h j w ij T P k exp k k t P j h j w kj C16C17N 0 @ 1 A ; e1T of all candidate documents. This is still very fast because it can use bit operations. We present experimental results showing that both methods work very well on a collection of about a million documents as well as on a smaller collection. 2. The constrained Poisson model Left panel: The deep generative model. Middle panel: pretraining consists of learning a stack of RBM’s in which the feature activations of one RBM as data by the next RBM. Right panel: After pretraining, the RBM’s are ‘‘unrolled” to create a multi-layer autoencoder that is fine-tuned by Recursive Pretraining Fine?tuningThe Deep Generative Model with documents of different lengths. marginal distribution over visible count vectors v is: pevT? X h expeC0Eev;hTT P u;g expeC0Eeu;gTT e3T ‘‘energy” term (i.e. the negative log probability + unknown constant offset) given by: v h W Poisson Binary Constrained Latent Topic Features Observed Distribution over Words over Words N*W W softmax Reconstructed Distribution The left panel shows the Markov random field of the constrained Poisson model. The top layer represents a vector, h, of stochastic, binary, latent, and the bottom layer represents a Poisson visible vector v. The right panel shows a different interpretation of the constrained Poisson model the visible activities have all been divided by the number of words in the document so that they represent a probability distribution. The factor of the upgoing weights is a result of having N i.i.d. observations from the observed distribution. learning is done by following an approximation to the gradient of a different objective function the ‘‘Contrastive Divergence” [9]: The directed connections from the first hidden layer to the visible units in the final, composite graphical model (see Fig. 1) peh j ? 1jvT?r b j t w ij v i ;pev i ? 1jhT?r b i t w ij h j : e6T tween the activities of features in the layer below. To suppress noise in the learning signal, we use the real-valued activation probabilities bit of units. training algorithm still works very well as a way to initialize a subsequent stage of fine-tuning. The pretraining finds a point that lies nearby 972 R. Salakhutdinov, G. Hinton/International Journal of Approximate Reasoning 50 (2009) 969–978 in a good region of parameter space and the myopic fine-tuning then performs a local gradient search that finds a point that is considerably better. The variational bound does not apply if the layers get smaller, as they do in an autoencoder, but, as we shall see, the pre- for the visible units of all the higher-level RBM’s, but to prevent hidden units from transmitting more than one information from the data to its reconstruction, the pretraining always uses stochastic binary values for the hidden i j The learning rule provided in the previous section remains the same [9]. This greedy, layer-by-layer training can be repeated several times to learn a deep, hierarchical model in which each layer of features captures strong high-order correlations be- X ! X ! are a consequence of the fact that we keep the p(vjh) but throw away the p(h) defined by the first level RBM. In the final composite model, the only undirected connections are between the top two layers, because we do not throw away the p(h) for the highest-level RBM. The first layer of hidden features is learned using a constrained Poisson RBM in which the visible units represent word- counts and the hidden units are binary. All the higher-level RBM’s use binary units for both their hidden and their visible layers. The update rules for each layer are: The expectation hv i h j i data defines the frequency with which word i and feature j are on together when the features are being driven by the observed count data from the training set using Eq. (2). After stochastically activating the features, Eq. (1) is used to ‘‘reconstruct” the Poisson rates for each word. Then Eq. (2) is used again to activate the features and hv i h j i recon is the corresponding expectation when the features are being driven by the reconstructed counts. The learning rule for the biases is just a simplified version of Eq. (5). 3. Pretraining and fine-tuning a deep generative model A single layer of binary features may not the best way to capture the structure in the count data. We now describe an efficient way to learn additional layers of binary features. After learning the first layer of hidden features we have an undi- rected model that defines p(v,h) via the energy function in Eq. (4). We can also think of the model as defining p(v,h) by defin- ing a consistent pair of conditional probabilities, p(hjv) and p(vjh) which can be used to