# 1975 kd-treesbentley.pdf

1975 ACM Student Award Paper: Second Place Multidimensional Binary Search Trees Used for Associative Searching Jon Louis Bentley Stanford University This paper develops the multidimensional binary search tree (or k-d tree, where k is the dimensionality of the search space) as a data structure for storage of information to be retrieved by associative searches. The k-d tree is defined and examples are given. It is shown to be quite efficient in its storage requirements. A signifi- cant advantage of this structure is that a single data structure can handle many types of queries very effici- ently. Various utility algorithms are developed; their proven average running times in an n record file are : in- sertion, O(log n); deletion of the root, 0 (n (k--1)/k) ; dele- tion of a random node, O(log n); and optimization (guar- antees logarithmic performance of searches), 0 (n log n). Search algorithms are given for partial match queries with t keys specified [proven maximum running time of O (n (k-t)/k) ] and for nearest neighbor queries [em- pirically observed average running time of O(log n). ] These performances far surpass the best currently known algorithms for these tasks. An algorithm is presented to handle any general intersection query. The main focus of this paper is theoretical. It is felt, however, that k-d trees could be quite useful in many applications, and ex- amples of potential uses are given. Key Words and Phrases: associative retrieval, binary search trees, key, attribute, information retrieval system, nearest neighbor queries, partial match queries, intersection queries, binary tree insertion CR Categories: 3.63, 3.70, 3.74, 4.49 Copyright (~) 1975, Association for Computing Machinery, Inc. General permission to republish, but not for profit, all or part of this material is granted provided that ACM s copyright notice is given and that reference is made to the publication, to its date of issue, and to the fact that reprinting privileges were granted by permission of the Association for Computing Machinery. This paper was awarded Second Place in ACM s 1975 George E. Forsythe Student Paper Competition. Author s present address: Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27514. 509 1. Introduction The problem of associative retrieval (often referred to as retrieval by secondary key) centers around a file F which is a collection of records. Each record R of F is an ordered k-tuple (Vo, vl, ., Vk_l) of values which are the keys, or attributes, of the record. A retrieval of records from F is initiated at the request of the user, which could be either mechanical or human. A retrieval request is called a query of the file, and specifies certain conditions to be satisfied by the keys of the records it re- quests to be retrieved from F. An information retrieval system must be capable of initiating an appropriate search upon the arrival of a query to that system. If a query is allowed to specify conditions dealing with a multiplicity of the keys, the searches performed by the system are considered associative. If the user of the sys- tem is restricted to specifying conditions for only one of the keys, the resulting search is not considered to be associative (in that case only one of the keys is consid- ered to be “the key“ and the remaining attributes are referred to as “data“). Numerous methods exist for building an informa- tion retrieval system capable of handling such associa- tive queries. Among these are inverted files, methods of compounding attributes, superimposed coding systems, and combinatorial hashing. Knuth [5] discusses these ideas in detail. McCreight [6] has proposed that the keys be “shuffled“ together bitwise; then unidimensional retrieval algorithms could be used to answer certain queries. Rivest investigated in his thesis [7] the use of binary search tries (see [5] for a detailed discussion of tries) to store records when they are composed of binary keys. Finkel and Bentley [3] discuss a data structure, quad trees, that stores the records of the file in a tree whose nodes have out-degree of 2k; theirs was the first general approach to use a tree structure. None of the above approaches seem to provide a “perfect“ environ- ment for associative retrieval. Each of them falls short in some very important way, either in having only a small class of queries easily performed, large running time, large space requirements, horrible worst cases, or some other adverse properties. This paper presents a new type of data structure for associative searching, called the multidimensional bi- nary search tree or k-d tree, which is defined in Section 2. In Section 3 an efficient insertion algorithm is given and analyzed. Many types of associative searches are discussed in Section 4, and the k-d tree is shown to perform quite well in all of them. Its worst case per- formance in partial match searches is analyzed and is shown to equal the best previously attained average performance. A search algorithm is presented that can answer any intersection query. An algorithm given seems to solve the nearest neighbor problem in logarith- mic time; its running time is much less than any other known method. In Section 5 deletion is shown to be possible in k-d trees, and it is analyzed. Section 6 dis- Communications September 1975 of Volume 18 the ACM Number 9 cusses an optimization algorithm that efficiently trans- forms any collection of records into a k-d tree with optimal properties. With this background, a discussion of potential applications of k-d trees is presented in Section 7. Areas for further work are discussed in Section 8, and conclusions are drawn in Section 9. 2. Definitions and Notations If a file is represented as a k-d tree, then each record in the file is stored as a node in the tree. In addition to the k keys which comprise the record, each node con- tains two pointers, which are either null or point to another node in the k-d tree. (Note that each pointer can be considered as specifying a subtree.) Associated with each node, though not necessarily stored as a field, is a discriminator, which is an integer between 0 and k -- 1, inclusive. Let us define a notation for these items: the k keys of node P will be called Ko(P),., Kk-I(P), the pointers will be LOSON(P) and HISON(P), and the discriminator will be DISC(P). The defining order imposed by a k-d tree is this: For any node P in a k-d tree, letj be DISC(P), then for any node Q in LOSON(P), it is true that Kj(Q) Kt(P). (This statement does not take into account the possibility of equality of keys, which will be discussed shortly.) All nodes on any given level of the tree have the same discriminator. The root node has discriminator 0, its two sons have discriminator 1, and so on to the kth level on which the discriminator is k - 1 ; the (k + 1)-th level has discriminator 0, and the cycle repeatsl In general, NEXTDISC is a function defined as NEXTDISC(i) = (i + 1) mod k, and NEXTDISC(DISC(P)) = DISC(LOSON(P)), and likewise for HISON(P) (if they are non-null). Figure 1 gives an example of records in 2-space stored as nodes in a 2-d tree. The problem of equality of keys mentioned above arises in the definition of a function telling which son of P s to visit next while searching for node Q. This func- tion is written SUCCESSOR(P, Q) and returns either LOSON or HISON. Let j be DISC(P); if K/P) # Kj(Q), then it is clear by the defining property of k-d trees which son SUCCESSOR should return. If the two Kfls are equal, the decision must be based on the remain- ing keys. The choice of decision function is arbitrary, but for most applications the following method works nicely: Define a superkey of P by S/P) = Kt(P)Kt+x(P) . Kk-x(P)Ko(P) . Kt- x(P), the cyclical concatenation of all keys starting with Kt. If S/Q) S/P), it returns HISON. If S/Q) = S/P) then all k keys are equal, and SUCCESSOR returns a special value to indicate that. Fig. 1. Records in 2-space stored as nodes in a 2-d tree. Records in 2-space stored as nodes in a 2-d tree (boxes represent range of subtree) : (0,100) E(40,85) B(IO,70) C(lO=,60) D(25,20) K 1 (0,0) %-,. (i00,i00) r p F(70,85) A(50,50) C(80,85) (100,0) Planar graph representation of the same 2-d tree (LOSON s are expressed by left branches, HISON s by right branches, and null sons by boxes): Dlscriminator 0 Let us define node Q to be j-greater than node P if and only if SUCCESSOR(P, Q) is HISON, and to be j-less than node R if and only if SUCCESSOR(R, Q) is LOSON. It is obvious how these definitions can be used to define the j-maximum and j-minimum elements of a collection of nodes. The j-distance between nodes P and Q is [ K/P) -- Kj(Q) l, and the j-nearest node of a set S to node P is the node Q in S with minimum j-distance between P and Q. If a multiplicity of the nodes have minimum j-distance from P, then the j- nearest node is defined as the j-minimum element in the subset of closest nodes which are j-greater than P (or similarly, the j-maximum element of the nodes which are j-less than P). We will typically use n to be the number of records in the file which is stored as a k-d tree, and therefore the number of nodes in the tree. We have already used k as the dimensionality of the records. The keys of all nodes in the subtree of any node, say P, in a k-d tree are known by P s position in the tree to be bounded by certain values. For instance, if P is in the HISON subtree of Q, and DISC(Q) is j, then all nodes in P are j-greater than Q; so for any R in HISON(P), K/R) __ K/P). To employ this knowledge in our algorithms, we will define a bounds array to hold the information. If B is a bounds array associated v~ith node P, then B has 2k entries, B(0),., B(2k -- 1). If Q is a 510 Communications September 1975 of Volume 18 the ACM Number 9 descendant of P, then it is true for all integers j E [0, k- 1] that B(2j) Ks(P) we go down HISON(P), and if the values are equal we go down one of the subtrees depending on the definition of successor. If J ~ {s,} then we must continue the search down both 512 Communications September 1975 of Volume 18 the ACM Number 9 of P s subtrees. (Implicit in the algorithm is the fact that the search continues down a subtree only if that subtree is non-null, i.e. P s SONfield is not A.) Let us now analyze the number of nodes visited by the algorithm in doing a partial match search with t keys specified in an “ideal“ k-d tree of n nodes; a tree in which n = 2 kh -- 1 and all leaves appear on level kh. (It is shown in Section 6 that such a tree is attainable for any collection of 2 kh - 1 nodes by using algorithm OPTIMIZE.) The search algorithm starts by visiting one node, the root, and the number of nodes visited on the next level grows by a factor of 1 (if J = si, we go down only one subtree of each node, hence visiting the same number on the next level) or two (if J ~ {si}, we have to visit both subtrees of the node, thereby visiting twice as many nodes on the next level). Since the growth rate at any level is a function of the discriminator of that level, and the discriminators are cyclic with cycle k, the growth rate will be cyclic as well. The pessimal arrange- ment of keys is to have these unspecified as the first k - t keys in the cycle, postponing any pruning until after a small geometric explosion. The maximum num- ber of nodes visited during the whole search will there- fore be the sum of the following series: k - t elements t elements ] + 2 + “ + 2 k-t~ +~2 e-t-1 + . + 2 k-t-] k elements + 2~-t + . + 22(~ ) -1 + 22(k-o -a + - + 2~(k-0-x + . + 2(h--:t)( Lo-t) -~- . + 2h(k--t) -:t + 2~ (k--t) -1 + . + 2h( ~-t) -1. lehels We can now sum this series to calculate V(n, t), the maximum number of nodes visited during a partial match search with t keys specified in an 5deal tree of n nodes. (For brevity, we will here define m = k- t.) V(n, t) = ~] 2 “i [( ~] 2 j) -}- 2m-it] O RECDEF(2.I+I) then return false end; return true end; boolean procedure BOUNDS_INTERSECT_REGION (array B); begin comment returns true iff the hyper-rectangle defined by bounds array B intersects the hyper-rectangle defined by RECDEF; for I ~ 0 step 2 until 2- (k- 1) do begin ifB(l) RECDEF(I+ 1) then return false; if B(I+I) 22k. Any application in which there are a multiplic- ity of keys with no key inherently primary, and which fits the two basic requirements mentioned above, is a potential application for k-d trees. We will investigate some specifics of two such examples. 7.1 Applications in Information Retrieval Systems Consider a terminal-oriented information retrieval system involving a file whose records are cities on a map, say of the continental United States. The cities could be stored as nodes in a k-d tree with latitude and longitude serving as the keys. Queries could take on many forms. An exact match query might be “What is the city at latitude 43 ° 3 N and longitude 88 ° W?“ One could ask the partial match query “What are all cities with latitude 39 ° 43 N?“ to find all cities on the Mason- Dixon line. To find all cities in the Oklahoma Panhandle one could pose a region query defining a rectangle bounded by latitudes in the range 36 ° 30 to 37 ° and longitudes in the range of I00 ° to 103 °. The nearest neighbor algorithm would be able to answer the query “Which is the closest city to Durham, North Carolina?“ It might be the case that not all cities were to be stored in the file, but only those in which there was some scarce commodity (for instance, an automobile rental agency might wish to ask “What is the closest city to Los Angeles in which there is an available auto- 516 mobile?“). In such a dynamic file, cities could be in- serted in the tree as the commodity became available and deleted as it became unavailable; the file could then be optimized if, after much activity, it became too un- balanced. The optimization algorithm could also be used, for example, if the nodes were sorted by state at the time of file creation. Using random insertion in this case would probably produce a terribly unbalanced tree. 7.2 Applications in Speech Recognition Most speech recognition systems being built now have fairly small vocabularies, on the order of 100 words. As the size of vocabularies increases, k-d trees could play an important role in identifying spoken “unknown utterances“ as words in the vocabulary. When an utterance of the speaker enters the system it is decomposed into a fixed number of “features.“ As an example, the speech might be passed through a bank of bandpass filters, and the amplitude variations as func- tions of time of each filter s output would together comprise the features. Each word (or “utterance class“) in the vocabulary is represented by a “template“ which consists of a description of its features. A recognizer must find which template most closely matches the unknown utterance, and report that as the most prob- able word spoken by the speaker. If the te