Locality-Sensitive Hashing Scheme Based on p-Stable Distributions Mayur Datar Department of Computer Science, Stanford University

[email protected] Nicole Immorlica Laboratory for Computer Science, MIT

[email protected] Piotr Indyk A3 Laboratory for Computer Science, MIT

[email protected] Vahab S. Mirrokni Laboratory for Computer Science, MIT

[email protected] ABSTRACT We present a novel Locality-Sensitive Hashing scheme for the Ap- proximate Nearest Neighbor Problem under D0 D4 norm, based on D4- stable distributions. Our scheme improves the running time of the earlier algorithm for the case of the D0 BE norm. It also yields the first known provably efficient approximate NN algorithm for the case D4BOBD. We also show that the algorithm finds the exact near neigbhor in C7B4D0D3CVD2B5 time for data satisfying certain “bounded growth” condition. Unlike earlier schemes, our LSH scheme works directly on points in the Euclidean space without embeddings. Consequently, the re- sulting query time bound is free of large factors and is simple and easy to implement. Our experiments (on synthetic data sets) show that the our data structure is up to 40 times faster than CZCS-tree. Categories and Subject Descriptors E.1 [Data]: Data Structures; F.0 [Theory of Computation]: Gen- eral General Terms Algorithms, Experimentation, Design, Performance, Theory Keywords Sublinear Algorithm, Approximate Nearest Neighbor, Locally Sen- sitive Hashing, D4-Stable Distributions 1. INTRODUCTION A similarity search problem involves a collection of objects (doc- uments, images, etc.) that are characterized by a collection of rel- evant features and represented as points in a high-dimensional at- tribute space; given queries in the form of points in this space, we A3 This material is based upon work supported by the NSF CAREER grant CCR-0133849. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SCG’04, June 9–11, 2004, Brooklyn, New York, USA. Copyright 2004 ACM 1-58113-885-7/04/0006 .$5.00. are required to find the nearest (most similar) object to the query. A particularly interesting and well-studied instance is CS-dimensional Euclidean space. This problem is of major importance to a variety of applications; some examples are: data compression, databases and data mining, information retrieval, image and video databases, machine learning, pattern recognition, statistics and data analysis. Typically, the features of the objects of interest (documents, im- ages, etc) are represented as points in BO CS and a distance metric is used to measure similarity of objects. The basic problem then is to perform indexing or similarity searching for query objects. The number of features (i.e., the dimensionality) ranges anywhere from tens to thousands. The low-dimensional case (say, for the dimensionality CS equal to 2 or 3) is well-solved, so the main issue is that of dealing with a large number of dimensions, the so-called “curse of dimensional- ity”. Despite decades of intensive effort, the current solutions are not entirely satisfactory; in fact, for large enough CS, in theory or in practice, they often provide little improvement over a linear algo- rithm which compares a query to each point from the database. In particular, it was shown in [28] (both empirically and theoretically) that all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high dimensions. In recent years, several researchers proposed to avoid the run- ning time bottleneck by using approximation (e.g., [3, 22, 19, 24, 15], see also [12]). This is due to the fact that, in many cases, ap- proximate nearest neighbor is almost as good as the exact one; in particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter. In fact, in situations when the quality of the approximate nearest neighbor is much worse than the quality of the actual near- est neighbor, then the nearest neighbor problem is unstable, and it is not clear if solving it is at all meaningful [4, 17]. In [19, 14], the authors introduced an approximate high-dimensional similarity search scheme with provably sublinear dependence on the data size. Instead of using tree-like space partitioning, it re- lied on a new method called locality-sensitive hashing (LSH). The key idea is to hash the points using several hash functions so as to ensure that, for each function, the probability of collision is much higher for objects which are close to each other than for those which are far apart. Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point. In [19, 14] the authors provided such locality-sensitive hash functions for the case when the points live in binary Hamming space CUBCBNBDCV CS . They showed experimentally that the data structure achieves large speedup over several tree-based data structures when 253 the data is stored on disk. In addition, since the LSH is a hashing- based scheme, it can be naturally extended to the dynamic setting, i.e., when insertion and deletion operations also need to be sup- ported. This avoids the complexity of dealing with tree structures when the data is dynamic. The LSH algorithm has been since used in numerous applied settings, e.g., see [14, 10, 16, 27, 5, 7, 29, 6, 26, 13]. However, it suffers from a fundamental drawback: it is fast and simple only when the input points live in the Hamming space (indeed, almost all of the above applications involved binary data). As mentioned in [19, 14], it is possible to extend the algorithm to the D0 BE norm, by embedding D0 BE space into D0 BD space, and then D0 BD space into Hamming space. However, it increases the query time and/or error by a large factor and complicates the algorithm. In this paper we present a novel version of the LSH algorithm. As with the previous schemes, it works for the B4CABNCRB5-Near Neigh- bor (NN) problem, where the goal is to report a point within dis- tance CRCA from a query D5, if there is a point in the data set C8 within distance CA from D5. Unlike the earlier algorithm, our algorithm works directly on points in Euclidean space without embeddings. As a consequence, it has the following advantages over the previous algorithm: AF For the D0 BE norm, its query time is C7B4CSD2 AQB4CRB5 D0D3CVD2B5, where AQB4CRB5 BO BDBPCR for CR BE B4BDBNBDBCCL (the inequality is strict, see Fig- ure 1(b)). Thus, for large range of values of CR, the query time exponent is better than the one in [19, 14]. AF It is simple and quite easy to implement. AF It works for any D0 D4 norm, as long as D4 BE B4BCBNBECL. Specifically, we show that for any D4 BE B4BCBNBECL and AD BQ BC there exists an algorithm for B4CABNCRB5-NN under D0 D4 CS which uses C7B4CSD2B7D2 BDB7AQ B5 space, with query time C7B4D2 AQ D0D3CV BDBPAD D2B5, where where AQ AK B4BD B7 ADB5 A1 D1CPDC A0 BD CR D4 BN BD CR A1 . To our knowledge, this is the only known provable algorithm for the high-dimensional nearest neighbor problem for the case D4BOBD. Similarity search under such fractional norms have recently attracted interest [1, 11]. Our algorithm also inherits two very convenient properties of LSH schemes. The first one is that it works well on data that is extremely high-dimensional but sparse. Specifically, the running time bound remains unchanged if CS denotes the maximum number of non-zero elements in vectors. To our knowledge, this property is not shared by other known spatial data structures. Thanks to this property, we were able to use our new LSH scheme (specif- ically, the D0 BD norm version) for fast color-based image similarity search [20]. In that context, each image was represented by a point in roughly BDBCBC BF -dimensional space, but only about 100 dimensions were non-zero per point. The use of our LSH scheme enabled achieving order(s) of magnitude speed-up over the linear scan. The second property is that our algorithm provably reports the exact near neighbor very quickly, if the data satisfies certain bounded growth property. Specifically, for a query point D5, and CR AL BD, let C6B4D5BNCRB5 be the number of CR-approximate nearest neighbors of D5 in C8.IfC6B4D5BNCRB5 grows “sub-exponentially” as a function of CR, then the LSH algorithm reports D4, the nearest neighbor, with constant probability within time C7B4CSD0D3CVD2B5, assuming it is given a constant factor approximation to the distance from D5 to its nearest neighbor. In particular, we show that if C6B4D5BNCRB5BPC7B4CR CQ B5, then the running time is C7B4D0D3CVD2B7BE C7B4CQB5 B5. Efficient nearest neighbor algorithms for data sets with polynomial growth properties in general metrics have been recently a focus of several papers [9, 21, 23]. LSH solves an easier problem (near neighbor under D0 BE norm), while working under weaker assumptions about the growth function. It is also somewhat faster, due to the fact that the D0D3CVD2 factor in the query time of the earlier schemes is multiplied by a function of CQ, while in our case this factor is additive. We complement our theoretical analysis with experimental eval- uation of the algorithm on data with wide range of parameters. In particular, we compare our algorithm to an approximate version of the CZCS-tree algorithm [2]. We performed the experiments on syn- thetic data sets containing “planted” near neighbor (see section 5 for more details); similar model was earlier used in [30]. Our ex- periments indicate that the new LSH scheme achieves query time of up to 40 times better than the query time of the CZCS-tree algorithm. 1.1 Notations and problem definitions We use D0 CS D4 to denote the space BO CS under the D0 D4 norm. For any point DA BEBO CS , we denote by CYCYDIDACYCY D4 the D0 D4 norm of the vector DIDA. Let C5 BPB4CGBNCSB5 be any metric space, and DA BE CG. The ball of radius D6 centered at DA is defined as BUB4DABND6B5BPCUD5 BE CG CY CSB4DABND5B5 AK D6CV. Let CR BPBDB7AF. In this paper we focus on the B4CABNCRB5-NN prob- lem. Observe that B4CABNCRB5-NN is simply a decision version of the Approximate Nearest Neighbor problem. Although in many ap- plications solving the decision version is good enough, one can also reduce the approximate NN problem to approximate NN via binary-search-like approach. In particular, it is known [19, 15] that the CR-approximate NN problem reduces to C7B4D0D3CVB4D2BPAFB5B5 instances of B4CABNCRB5-NN. Then, the complexity of CR-approximate NN is the same (within log factor) as that of the B4CABNCRB5-NN problem. 2. LOCALITY-SENSITIVE HASHING An important technique from [19], to solve the B4CABNCRB5-NN prob- lem is locality sensitive hashing or LSH. For a domain CB of the points set with distance measure BW, an LSH family is defined as: DEFINITION 1. A familyC0 BP CUCW BM CB AX CDCVis called B4D6 BD BND6 BE BND4 BD BND4 BE B5- sensitive for BW if for any DABND5 BE CB AF if DA BE BUB4D5BND6 BD B5 then C8D6 C0 CJCWB4D5B5BPCWB4DAB5CL AL D4 BD , AF if DA BPBE BUB4D5BND6 BE B5 then C8D6 C0 CJCWB4D5B5BPCWB4DAB5CL AK D4 BE . In order for a locality-sensitive hash (LSH) family to be useful, it has to satisfy inequalities D4 BD BQD4 BE and D6 BD BOD6 BE . We will briefly describe, from [19], how a LSH family can be used to solve the B4CABNCRB5-NN problem: We choose D6 BD BP CA and D6 BE BP CR A1 CA. Given a family C0 of hash functions with parame- ters B4D6 BD BND6 BE BND4 BD BND4 BE B5 as in Definition 1, we amplify the gap between the “high” probability D4 BD and “low” probability D4 BE by concate- nating several functions. In particular, for CZ specified later, de- fine a function family BZ BP CUCV BM CB AX CD CZ CV such that CVB4DAB5 BP B4CW BD B4DAB5BNBMBMBMBNCW CZ B4DAB5B5, where CW CX BEC0. For an integer C4 we choose C4 functions CV BD BNBMBMBMBNCV C4 from BZ, independently and uniformly at random. During preprocessing, we store each DA BE C8 (input point set) in the bucket CV CY B4DAB5, for CY BP BDBNBMBMBMBNC4. Since the total num- ber of buckets may be large, we retain only the non-empty buck- ets by resorting to hashing. To process a query D5, we search all buckets CV BD B4D5B5BNBMBMBMBNCV C4 B4D5B5; as it is possible (though unlikely) that the total number of points stored in those buckets is large, we in- terrupt search after finding first BFC4 points (including duplicates). Let DA BD BNBMBMBMBNDA D8 be the points encountered therein. For each DA CY ,if DA CY BE BUB4D5BND6 BE B5 then we return YES and DA CY , else we return NO. The parameters CZ and C4 are chosen so as to ensure that with a constant probability the following two properties hold: 1. If there exists DA A3 BE BUB4D5BND6 BD B5 then CV CY B4DA A3 B5BPCV CY B4D5B5 for some CY BPBDBMBMBMC4, and 254 2. The total number of collisions of D5 with points from C8 A0 BUB4D5BND6 BE B5 is less than BFC4, i.e. C4 CG CYBPBD CYB4C8 A0BUB4D5BND6 BE B5B5CKCV A0BD CY B4CV CY B4D5B5B5CY BO BFC4BM Observe that if (1) and (2) hold, then the algorithm is correct. It follows (see [19] Theorem 5 for details) that if we set CZ BP D0D3CV BDBPD4 BE D2, and C4 BP D2 AQ where AQ BP D0D2BDBPD4 BD D0D2BDBPD4 BE then (1) and (2) hold with a constant probability. Thus, we get following theorem (slightly different version of Theorem 5 in [19]), which relates the efficiency of solving B4CABNCRB5-NN problem to the sensitivity parameters of the LSH. THEOREM 1. Suppose there is a B4CABNCRCABND4 BD BND4 BE B5-sensitive fam- ily C0 for a distance measure BW. Then there exists an algorithm for B4CABNCRB5-NN under measure BW which uses C7B4CSD2 B7 D2 BDB7AQ B5 space, with query time dominated by C7B4D2 AQ B5 distance computations, and C7B4D2 AQ D0D3CV BDBPD4 BE D2B5 evaluations of hash functions from C0, where AQ BP D0D2BDBPD4 BD D0D2BDBPD4 BE . 3. OUR LSH SCHEME In this section, we present a LSH family based on D4-stable dis- tributions, that works for all D4 BE B4BCBNBECL. Since we consider points in D0 CS D4 , without loss of generality we can consider CA BPBD, which we assume from now on. 3.1 D4-stable distributions Stable distributions [31] are defined as limits of normalized sums of independent identically distributed variables (an alternate defini- tion follows). The most well-known example of a stable distribu- tion is Gaussian (or normal) distribution. However, the class is much wider; for example, it includes heavy-tailed distributions. Stable Distribution: A distribution BW over BO is called D4-stable,if there exists D4 AL BC such that for any D2 real numbers DA BD BMBMBMDA D2 and i.i.d. variables CG BD BMBMB