IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013 141 Spectral Hashing With Semantically Consistent Graph for Image Indexing Peng Li, Meng Wang,Member,IEEE, Jian Cheng, Member, IEEE, Changsheng Xu, Senior Member, IEEE,and Hanqing Lu, Senior Member, IEEE Abstract—The ability of fast similarity search in a large-scale datasetisofgreatimportancetomany multimedia applications. Semantic hashing is a promising way to accelerate similarity search, which designs compact binary codes foralargenumber of images so that semantically similar images are mapped to close codes. Retrieving similar neighbors is then simply accom- plished by retrieving images that have codeswithinasmal Hamming distance of the code of the query. Among various hashing approaches, spectral hashing (SH) has shown promising performance by learning the binary codes with a spectral graph partitioning method. However, the Euclidean distance is usually usedtoconstructthegraphLaplacianinSH,whichmaynotreflect the inherent distribution of the data. Therefore, in this paper, we propose a method to directly optimize the graph Laplacian. The learned graph, which can better represent similarity between samples, is then applied to SH for effective binary code learning. Meanwhile, our approach, unlike metric learning, can automati- cally determine the scale factor during the optimization. Extensive experiments are conducted on publicly available datasets and the comparison results demonstrate the effectiveness of our approach. Index Terms—Graph Laplacian, metric learning, similarity search, spectral hashing. I. INTRODUCTION S IMILARITY search (also known as nearest neighbor search) is one of the most fundamental problems in in- formation retrieval, database and machine learning research communities. It is defined as the task of finding close samples for a given query. It is of great importance to many multimedia applications, such as content-based multimedia retrieval [21], [24], classification [41], and annotation [25]–[27]. Recently, with the rapid evolution of the Internet and the explosive growing of visual contents on the Web, large-scale image search has attracted considerable attention. How to per- form a fast similarity search at large scale has become an urgent Manuscript received July 14, 2011; revised October 12, 2011 and March 08, 2012; accepted May 10, 2012. Date of publication May 17, 2012; date of cur- rent version December 12, 2012. This work was supported in part by the 973 Program under Project 2010CB327905, by the National Natural Science Foun- dation of China under Grant 61170127, 60975010, 60833006, and 61070104. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Xian-Sheng Hua. P. Li, J. Cheng, C. Xu, and H. Lu are with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail:

[email protected];

[email protected];

[email protected];

[email protected]). M. Wang is with the School of Computer and Information, Hefei University of Technology, Hefei 230009, China (e-mail:

[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2012.2199970 research issue. Exhaustively comparing the query image with eachsampleinthedatabaseisinfeasiblebecausethelinearcom- plexity is not scalable in practical situations. For example, the photo sharing website Flickr has over 4 billion images. Another visual content sharing website YouTube receives more than 20 hofuploadedvideosperminute. Besides, mostlarge-scale con- tent-based image retrieval applications suffer from the curse of dimensionality since visual descriptors usually have hundreds or even thousands of dimensions. Therefore, beyond the infea- sibility of exhaustive search, storage of the original data also becomes a big problem. Over the past decade, several approximate nearest neighbor (ANN) search techniques have been developed for large scale applications. Although there exit many tree-based methods [2], [9], [35], for applications with memory constrains, hashing-based ANN techniques have attracted more atten- tion. Hashing-based methods are promising in accelerating similarity search for their capability of generating compact binary codes for a large number of images in the dataset so that similar images will have close binary codes. Retrieving similar neighbors is then accomplished simply by finding the images that have codes within a small Hamming distance of the code of the query. It is extremely fast to perform similarity search over such binary codes [36], because 1) the encoded data are highly compressed and thus can be loaded into the main memory; 2) the hamming distance between two binary codes can be computed efficiently by using bit XOR operation and computing the number of set bits (an ordinary PC today would be able to do millions of Hamming distance computation in just a few milliseconds). Many hashing algorithms have been developed in recent years and the hashing methods can mainly be divided into two categories: unsupervised methods (e.g., locality sensitive hashing (LSH) [1], spectral hashing (SH) [46]) and supervised methods (e.g., boosting similarity sensitive coding (Boost- ingSSC) [34], restricted Boltzmann machines (RBMs) [15]). Such hashing-based methods for fast similarity search can be considered as a means for embedding high dimensional feature vectors to a low dimensional Hamming space, while retaining as much as possible the semantic similarity structure of data. Unsupervised methods use just the unlabeled data to generate binary codes for given points, while supervised methods, which incorporate the label information, are able to preserve the semantic similarity and thus facilitate semantic retrieval and classification. Although these hashing methods have shown success in large-scale image search, there is a problem that is seldom 1520-9210/$31.00 ? 2012 IEEE 142 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013 Fig. 1. Schematic illustration of the proposed semantically consistent graph learning approach for spectral hashing. exploited. The existing hashing approaches are defined only for the standard Euclidean distance. For example, in SH, Eu- clidean distance is used to measure the image similarity and constructthe graphLaplacian. However, the Euclideandistance may not accurately reflect the true underlying relationships between images and the best distance metric often should be specialized for the task at hand [10], [48]. One approach to addressing the problem is to learn a distance metric based on certain supervision information and then apply it to different hashing algorithms, such as the method in [16]. There already exist many distance metric learning methods [12], [30], [45] that can be adopted. But there are several weaknesses of such approaches. First, the distance metric learning and hashing can be regarded as two isolated steps in this approach, and the objective optimized in distance metric learning may not be appropriate for hashing. Second, for many hashing algorithms that rely on similarity rather than distance, such as SH, it needs a further step to convert distance to similarity and the involved radius parameter is usually sensitive. Inordertoovercometheaboveproblems,weproposeanovel spectral hashing with semantically consistent graph approach in this paper. It improves spectral hashing, a state-of-the-art hashing algorithm that has shown superior performance than many other methods, by optimizing the graph Laplacian that is built based on the pairwise similarities of images in the hash function learning process. The framework of our approach is il- lustrated in Fig. 1. First, the graph Laplacian is learned through pairwise similarities based on image labels or tags. The opti- mizedgraphLaplacian,whichcanbetterrepresentthesimilarity between images, is then incorporated into spectral hashing to obtain better binary codes. Our approach has the following advantages: 1) Our method is designed to directly optimize the graph Laplacian in spectral hashing instead of firstly learning a distance metric and then constructing graph Laplacian based on it. 2) Most of the existing distance metric learning methods also need to further convert the distance between sam- ples into similarity via a radius parameter, whereas our approach is able to learn the scale of the distance metric automatically and does not need this parameter. 3) The method can be applied to not only the images with single class labels but also multiple tags. The rest of the paper is organized as follows. Section II re- views related work. Section III gives a brief introduction of the conventionalspectralhashingmethod.SectionIV describesour learning-based spectral hashing approach. In Section V, we in- troduce the experiments on two publicly available datasets. Fi- nally, conclusions and feature work are given in Section VI. II. RELATED WORK A. Fast Similarity Search Given a set of data points, the objective in nearest neighbor search is to find a subset that is most similar to the query. Ex- haustively comparing the query with each sample in the data- base is computationally infeasible because linear complexity is notacceptableforlarge-scaledatabase.Toavoidexcessivecom- putational and memory costs, one would like to instead do an ANN search with sublinear query complexity [33]. There has been extensive research on fast similarity search duetoitscentralimportanceinmanyapplications.Foralow-di- mensional feature space, similarity search can be carried out efficiently by some tree-based methods (e.g., KD-tree, M-tree, LI et al.: SPECTRAL HASHING WITH SEMANTICALLY CONSISTENT GRAPH FOR IMAGE INDEXING 143 cover tree and metric tree). These methods usually partition the data space recursively to implement an exact similarity search in the low-dimension feature space. For example, the KD-tree method pre-builds space-partitioning index structures, and the R-treemethodpre-definesdata-partitioningindexstructures[6]. However, the tree-based methods can degenerate their com- plexity to a linear scan in the worst case while attempting to speed up the computation of the similarity search. Moreover, in termsofachievingexactresults,thetree-basedsimilaritysearch methods do not perform better than the naive method, that is, a linearscanoftheentiredataset,whenthenumberofdimensions of the feature space is slightly high (e.g., ) [44]. Thus, they will encounter difficulties in practical applications where the number of the dimensions can be hundreds or even thousands. Nevertheless, if the complete exactness of results is not really necessary, similarity search in a high-dimensional space can be dramaticallyspeededupbyusinghashing-basedmethodswhich are purposefully designed to approximately answer queries in virtually constant time [36]. In addition, the storage is also sub- stantially reduced as they usually store only compact binary codes for each data point. B. Hashing-Based Image Indexing Many hashing algorithms have been developed in recent years. One of the most popular methods is LSH [1]. Given a similarity metric in a feature space, the LSH algorithm typically guarantees the probability for any two samples and falling into the same bucket to be , known as the “locality sensitive” property. One popular method in LSH is to generate a random vector from a particular probabilistic distribution, e.g., -stable distribution [7] for -metric space. However, since the random vector is data-independent, LSH may lead to quite inefficient codes in practice [31], [46] as it requires multiple tables with long codes [11]. In order to overcome this problem, several recently proposed hashing techniques attempt to apply machine learning approaches rather than random projections to find good data-aware hash functions.In[31],Salakhutdinovetal.showthatstackedRBMs [14], [15], is able togenerate much more compact binary codes. The RBMs model is trained with two stages: an unsupervised pre-training stage and a supervised fine-tuning. The greedy pre-training is progressively executed layer by layer from input to output. After achieving the convergence of the parameters of a layer via contrastive divergence, the derived activation probabilities are fixed and treated as input to derive the training of the next layer. In the fine-tuning stage, the labeled data are used to help refine the trained network through back-prop- agation. Then, the network weights are refinedtomaximize this objective function through gradient descent. RBM-based binary encoding involves the estimation of a large number of weights. For example, the RBMs architecture used in [38] has four layers of hidden units, with sizes 512–512-256–32, which requires a total of 663552 weights to learn. This does not only involve anextremelycostly training procedure butalsodemand sufficient training data for fine-tuning. Researchers have also tried the boosting approach, such as the similarity sensitive coding (SSC) algorithm [34]. They first train AdaBoost [32] classifiers with similar pairs of data points as positive examples (and also non-similar pairs as negative examples in SSC), and then take the output of all weak learners on a given data point as its binary code. In [38], both RBMs and boosting SSC are found to work significantly better than LSH when applied to a database containing tens of millions of images. In [46], a new technique called SH is proposed based on spectral graph partitioning [5]. SH calculates the bits by thresholding a subset of eigenvectors of the Laplacian of the similarity graph and it has demonstrated significant improvements over many other methods in terms of the number of bits required to find good similar neighbors. As an extension of SH, self-taught hashing (STH) is proposed in [49], which learns the hash functions via SVM for the unseen data points. Liu et al. [23] propose a scalable graph-based unsupervised hashing approach, in which Anchor Graph is used to overcome the computationally prohibitive step of building the graph Laplacian. A semi-su- pervised hashing (SSH) is proposed in [39], which learns hash functions that minimize the error on the labeled training data while maximally satisfying the desirable properties of hashing. They then further introduce a method to sequentially learn the hash functions [40]. Mu et al. [29] propose a hashing algorithm named LAMP with kernel tricks and weak supervision, which is formulated with a regularized maximum margin framework. In [17], the authors use query-adaptive ranking to address the issue that a large number of images sharing equal Hamming distances to a query. Reconfigurable hashing is proposed in [28], which constructs a large hash pool by one-off data in- dexing and then selects most effective hashing-bit combination at runtime