rk Arenber accepted 15 yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors puter vision applications. Image registration, camera cali- bration, object recognition, and image retrieval are just a represented by a feature vector. This descriptor has to be distinctive and at the same time robust to noise, detection tance. The dimension of the descriptor has a direct impact on the time this takes, and less dimensions are desirable for keeping it su?ciently distinctive. A wide variety of detectors and descriptors have already been proposed in the literature (e.g. [21,24,27,39,25]). Also, detailed comparisons and evaluations on benchmarking datasets have been performed [28,30,31]. Our fast detector and descriptor, called SURF (Speeded-Up Robust * Corresponding author. E-mail address:

[email protected] (A. Ess). Available online at www.sciencedirect.com Computer Vision and Image Understand few. The search for discrete image point correspondences can be divided into three main steps. First, ‘interest points’ are selected at distinctive locations in the image, such as cor- ners, blobs, and T-junctions. The most valuable property of an interest point detector is its repeatability. The repeat- ability expresses the reliability of a detector for finding the same physical interest points under di?erent viewing condi- tions. Next, the neighbourhood of every interest point is fast interest point matching. However, lower dimensional feature vectors are in general less distinctive than their high-dimensional counterparts. It has been our goal to develop both a detector and descriptor that, in comparison to the state-of-the-art, are fast to compute while not sacrificing performance. In order to succeed, one has to strike a balance between the above requirements like simplifying the detection scheme while keeping it accurate, and reducing the descriptor’s size while and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by sim- plifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paperencompasses a detailed description of thedetectorand descriptorand thenexplores thee?ects of the most important param- eters.WeconcludethearticlewithSURF’sapplicationtotwochallenging,yetconversegoals:cameracalibrationasaspecialcaseofimage registration, and object recognition. Our experiments underline SURF’s usefulness in a broad range of topics in computer vision. C211 2007 Elsevier Inc. All rights reserved. Keywords: Interest points; Local features; Feature description; Camera calibration; Object recognition 1. Introduction The task of finding point correspondences between two images of the same scene or object is part of many com- displacements and geometric and photometric deforma- tions. Finally, the descriptor vectors are matched between di?erent images. The matching is based on a distance between the vectors, e.g. the Mahalanobis or Euclidean dis- Speeded-Up Robust Herbert Bay a , Andreas Ess a, * , Tinne a ETH Zurich, BIWI, Sternwartstrasse b K.U. Leuven, ESAT-PSI, Kasteelpa Received 31 October 2006; Available online Abstract This article presents a novel scale- and rotation-invariant detector SURF approximates or even outperforms previously proposed schemes 1077-3142/$ - see front matter C211 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2007.09.014 Features (SURF) Tuytelaars b , Luc Van Gool a,b 7, CH-8092 Zurich, Switzerland g 10, B-3001 Leuven, Belgium 5 September 2007 December 2007 and descriptor, coined SURF (Speeded-Up Robust Features). with respect to repeatability, distinctiveness, and robustness, www.elsevier.com/locate/cviu ing 110 (2008) 346–359 Features), was introduced in [4]. It is built on the insights gained from this previous work. In our experiments on these benchmarking datasets, SURF’s detector and descriptor are not only faster, but the former is also more repeatable and the latter more distinctive. We focus on scale and in-plane rotation-invariant detec- torsanddescriptors.Theseseemtoo?eragoodcompromise between feature complexity and robustness to commonly occurring photometric deformations. Skew, anisotropic scaling, and perspective e?ects are assumed to be second order e?ects, that are covered to some degree by the overall robustness of the descriptor. Note that the descriptor can be extended towards a?ne-invariant regions using a?ne normalisation ofthe ellipse (cf.[31]), although this will have animpacton thecomputation time. Extendingthedetector, on the other hand, is less straightforward. Concerning the photometricdeformations,weassumeasimplelinearmodel with a bias (o?set) and contrast change (scale factor). Nei- ther detector nor descriptor use colour information. The article is structured as follows. In Section 2, we give a review over previous work in interest point detection and description. In Section 3, we describe the strategy applied for fast and robust interest point detection. The input image is analysed at di?erent scales in order to guarantee invariance to scale changes. The detected interest points are provided with a rotation and scale-invariant descriptor in Section 4. Furthermore, a simple and e?cient first-line indexing technique, based on the contrast of the interest point with its surrounding, is proposed. In Section 5, some of the available parameters and their e?ects are discussed, including the benefits of an upright version (not invariant to image rotation). We also investi- gate SURF’s performance in two important application scenarios. First, we consider a special case of image regis- tration, namely the problem of camera calibration for 3D reconstruction. Second, we will explore SURF’s applica- tion to an object recognition experiment. Both applications highlight SURF’s benefits in terms of speed and robustness as opposed to other strategies. The article is concluded in Section 6. 2. Related work 2.1. Interest point detection The most widely used detector is probably the Harris corner detector [15], proposed back in 1988. It is based on the eigenvalues of the second moment matrix. However, Harris corners are not scale invariant. Lindeberg [21] intro- duced the concept of automatic scale selection. This allows to detect interest points in an image, each with their own characteristic scale. He experimented with both the deter- minant of the Hessian matrix as well as the Laplacian (which corresponds to the trace of the Hessian matrix) to detect blob-like structures. Mikolajczyk and Schmid [26] H. Bay et al. / Computer Vision and Image refined this method, creating robust and scale-invariant feature detectors with high repeatability, which they coined Harris-Laplace and Hessian-Laplace. They used a (scale- adapted) Harris measure or the determinant of the Hessian matrix to select the location, and the Laplacian to select the scale. Focusing on speed, Lowe [23] proposed to approxi- mate the Laplacian of Gaussians (LoG) by a Di?erence of Gaussians (DoG) filter. Several other scale-invariant interest point detectors have been proposed. Examples are the salient region detec- tor, proposed by Kadir and Brady [17], which maximises the entropy within the region, and the edge-based region detector proposed by Jurie and Schmid [16]. They seem less amenable to acceleration though. Also several a?ne-invari- ant feature detectors have been proposed that can cope with wider viewpoint changes. However, these fall outside the scope of this article. From studying the existing detectors and from published comparisons [29,30], we can conclude that Hessian-based detectors are more stable and repeatable than their Harris- based counterparts. Moreover, using the determinant of the Hessian matrix rather than its trace (the Laplacian) seemsadvantageous,asitfireslessonelongated,ill-localised structures. We also observed that approximations like the DoG can bring speed ata low cost intermsoflost accuracy. 2.2. Interest point description An even larger variety of feature descriptors has been proposed, like Gaussian derivatives [11], moment invari- ants [32], complex features [1], steerable filters [12], phase-based local features [6], and descriptors representing the distribution of smaller-scale features within the interest point neighbourhood. The latter, introduced by Lowe [24], have been shown to outperform the others [28]. This can be explained by the fact that they capture a substantial amount of information about the spatial intensity patterns, while at the same time being robust to small deformations or localisation errors. The descriptor in [24], called SIFT for short, computes a histogram of local oriented gradients around the interest point and stores the bins in a 128D vec- tor (8 orientation bins for each of 4C24 location bins). Various refinements on this basic scheme have been pro- posed. Ke and Sukthankar [18] applied PCA on the gradi- ent image around the detected interest point. This PCA- SIFT yields a 36D descriptor which is fast for matching, but proved to be less distinctive than SIFT in a second comparative study by Mikolajczyk and Schmid [30]; and applying PCA slows down feature computation. In the same paper [30], the authors proposed a variant of SIFT, called GLOH, which proved to be even more distinctive with the same number of dimensions. However, GLOH is computationally more expensive as it uses again PCA for data compression. The SIFT descriptor still seems the most appealing descriptor for practical uses, and hence also the most widely used nowadays. It is distinctive and relatively fast, Understanding 110 (2008) 346–359 347 which is crucial for on-line applications. Recently, Se et al. [37] implemented SIFT on a Field Programmable decrease in performance does not outweigh the advan- tage of fast convolutions brought by the discretisation and cropping. As real filters are non-ideal in any case, and given Lowe’s success with his LoG approximations, we push the approximation for the Hessian matrix even further with box filters (in the right half of Fig. 2). These approximate second order Gaussian derivatives and can be evaluated at a very low computational cost Fig. 2. Left to right: The (discretised and cropped) Gaussian second order partial derivative in y-(L yy ) and xy-direction (L xy ), respectively; our Understanding 110 (2008) 346–359 Gate Array (FPGA) and improved its speed by an order of magnitude. Meanwhile, Grabner et al. [14] also used inte- gral images to approximate SIFT. Their detection step is based on di?erence-of-mean (without interpolation), their description step on integral histograms. They achieve about the same speed as we do (though the description step is constant in speed), but at the cost of reduced quality compared to SIFT. Generally, the high dimensionality of the descriptor is a drawback of SIFT at the matching step. For on-line applications relying only on a regular PC, each one of the three steps (detection, description, matching) has to be fast. An entire body of work is available on speeding up the matching step. All of them come at the expense of getting an approximative matching. Methods include the best- bin-first proposed by Lowe [24], balltrees [35], vocabulary trees [34], locality sensitive hashing [9], or redundant bit vectors [13]. Complementary to this, we suggest the use of the Hessian matrix’s trace to significantly increase the matching speed. Together with the descriptor’s low dimen- sionality, any matching algorithm is bound to perform faster. 3. Interest point detection Our approach for interest point detection uses a very basic Hessian matrix approximation. This lends itself to the use of integral images as made popular by Viola and Jones [41], which reduces the computation time drastically. Integral images fit in the more general framework of box- lets, as proposed by Simard et al. [38]. 3.1. Integral images In order to make the article more self-contained, we briefly discuss the concept of integral images. They allow for fast computation of box type convolution filters. The entry of an integral image I R exT at a location x ?ex;yT T represents the sum of all pixels in the input image I within a rectangular region formed by the origin and x. I R exT? X i6x i?0 X j6y j?0 Iei;jTe1T Once the integral image has been computed, it takes three additions to calculate the sum of the intensities over any upright, rectangular area (see Fig. 1). Hence, the calcu- lation time is independent of its size. This is important in our approach, as we use big filter sizes. 3.2. Hessian matrix-based interest points We base our detector on the Hessian matrix because of its good performance in accuracy. More precisely, we detect blob-like structures at locations where the determi- 348 H. Bay et al. / Computer Vision and Image nant is maximum. In contrast to the Hessian-Laplace detector by Mikolajczyk and Schmid [26], we rely on the determinant of the Hessian also for the scale selection, as done by Lindeberg [21]. Given a point x ?ex;yT in an image I, the Hessian matrix Hex;rT in x at scale r is defined as follows Hex;rT? L xx ex;rT L xy ex;rT L xy ex;rT L yy ex;rT C20C21 ; e2T where L xx ex;rT is the convolution of the Gaussian second order derivative o 2 ox 2 gerT with the image I in point x,and similarly for L xy ex;rTandL yy ex;rT. Gaussians are optimal for scale-space analysis [19,20], but in practice they have to be discretised and cropped (Fig. 2, left half). This leads to a loss in repeatability under image rotations around odd multiples of p 4 . This weakness holds for Hessian-based detectors in general. Fig. 3 shows the repeatability rate of two detectors based on the Hessian matrix for pure image rotation. The repeatability attains a maximum around multiples of p 2 . This is due to the square shape of the filter. Nev- ertheless, the detectors still perform well, and the slight Fig. 1. Using integral images, it takes only three additions and four memory accesses to calculate the sum of intensities inside a rectangular region of any size. approximation for the second order Gaussian partial derivative in y-(D yy ) and xy-direction (D xy ). The grey regions are equal to zero. 3.3. Scale space representation Interest points need to be found at di?erent scales, not least because the search of correspondences often requires their comparison in images where they are seen at di?erent scales. Scale spaces are usually implemented as an image pyramid. The images are repeatedly smoothed with a Gaussian and then sub-sampled in order to achieve a higher level of the pyramid. Lowe [24] subtracts these pyr- amid layers in order to get the DoG (Di?erence of Gaussi- Understanding 110 (2008) 346–359 349 using integral images. The calculation time therefore is independent of the filter s