A performance evaluation of local descriptors Krystian Mikolajczyk, Cordelia Schmid To cite this version: Krystian Mikolajczyk, Cordelia Schmid. A performance evaluation of lo- cal descriptors. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, Institute of Electrical and Electronics Engineers (IEEE), 2005, 27 (10), pp.1615{1630. . . HAL Id: inria-00548529 https://hal.inria.fr/inria-00548529 Submitted on 20 Dec 2010 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- enti c research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destin ee au d ep^ot et a la di usion de documents scienti ques de niveau recherche, publi es ou non, emanant des etablissements d’enseignement et de recherche fran cais ou etrangers, des laboratoires publics ou priv es. MIKOLAJCZYK AND SCHMID: A PERFORMANCE EVALUATION OF LOCAL DESCRIPTORS 1 A performance evaluation of local descriptors Krystian Mikolajczyk and Cordelia Schmid Dept. of Engineering Science INRIA Rh one-Alpes University of Oxford 655, av. de l’Europe Oxford, OX1 3PJ 38330 Montbonnot United Kingdom France

[email protected] [email protected] Abstract In this paper we compare the performance of descriptors computed for local interest regions, as for example extracted by the Harris-Af ne detector [32]. Many different descriptors have been proposed in the literature. However, it is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [3], steerable lters [12], PCA-SIFT [19], differential invariants [20], spin images [21], SIFT [26], complex lters [37], moment invariants [43], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor, and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT based descriptors perform best. Moments and steerable lters show the best performance among the low dimensional descriptors. Index Terms Local descriptors, interest points, interest regions, invariance, matching, recognition. I. INTRODUCTION Local photometric descriptors computed for interest regions have proved to be very successful in applications such as wide baseline matching [37, 42], object recognition [10, 25], texture Corresponding author is K. Mikolajczyk,

[email protected] February 23, 2005 DRAFT MIKOLAJCZYK AND SCHMID: A PERFORMANCE EVALUATION OF LOCAL DESCRIPTORS 2 recognition [21], image retrieval [29, 38], robot localization [40], video data mining [41], building panoramas [4], and recognition of object categories [8, 9, 22, 35]. They are distinctive, robust to occlusion and do not require segmentation. Recent work has concentrated on making these descriptors invariant to image transformations. The idea is to detect image regions covariant to a class of transformations, which are then used as support regions to compute invariant descriptors. Given invariant region detectors, the remaining questions are which is the most appropriate descriptor to characterize the regions, and does the choice of the descriptor depend on the region detector. There is a large number of possible descriptors and associated distance measures which emphasize different image properties like pixel intensities, color, texture, edges etc. In this work we focus on descriptors computed on gray-value images. The evaluation of the descriptors is performed in the context of matching and recognition of the same scene or object observed under different viewing conditions. We have selected a number of descriptors, which have previously shown a good performance in such a context and compare them using the same evaluation scenario and the same test data. The evaluation criterion is recall-precision, i.e. the number of correct and false matches between two images. Another possible evaluation criterion is the ROC (Receiver Operating Characteristics) in the context of image retrieval from databases [6, 31]. The detection rate is equivalent to recall but the false positive rate is computed for a database of images instead of a single image pair. It is therefore dif cult to predict the actual number of false matches for a pair of similar images. Local features were also successfully used for object category recognition and classi cation. The comparison of descriptors in this context requires a different evaluation setup. However, it is unclear how to select a representative set of images for an object category and how to prepare the ground truth, since there is no linear transformation relating images within a category. A possible solution is to select manually a few corresponding points and apply loose constraints to verify correct matches, as proposed in [18]. In this paper the comparison is carried out for different descriptors, different interest regions and for different matching approaches. Compared to our previous work [31], this paper performs a more exhaustive evaluation and introduces a new descriptor. Several descriptors and detectors have been added to the comparison and the data set contains a larger variety of scenes types and transformations. We have modi ed the evaluation criterion and now use recall-precision for image pairs. The ranking of the top descriptors is the same as in the ROC based evaluation [31]. February 23, 2005 DRAFT MIKOLAJCZYK AND SCHMID: A PERFORMANCE EVALUATION OF LOCAL DESCRIPTORS 3 Furthermore, our new descriptor, gradient location and orientation histogram (GLOH), which is an extension of the SIFT descriptor, is shown to outperform SIFT as well as the other descriptors. A. Related work Performance evaluation has gained more and more importance in computer vision [7]. In the context of matching and recognition several authors have evaluated interest point detectors [14, 30, 33, 39]. The performance is measured by the repeatability rate, that is the percentage of points simultaneously present in two images. The higher the repeatability rate between two images, the more points can potentially be matched and the better are the matching and recognition results. Very little work has been done on the evaluation of local descriptors in the context of matching and recognition. Carneiro and Jepson [6] evaluate the performance of point descriptors using ROC (Receiver Operating Characteristics). They show that their phase-based descriptor performs better than differential invariants. In their comparison interest points are detected by the Harris detector and the image transformations are generated arti cially. Recently, Ke and Sukthankar [19] have developed a descriptor similar to the SIFT descriptor. It applies Principal Components Analysis (PCA) to the normalized image gradient patch and performs better than the SIFT descriptor on arti cially generated data. The criterion recall-precision and image pairs were used to compare the descriptors. Local descriptors (also called lters) have also been evaluated in the context of texture classi cation. Randen and Husoy [36] compare different lters for one texture classi cation algorithm. The lters evaluated in this paper are Laws masks, Gabor lters, wavelet transforms, DCT, eigen lters, linear predictors and optimized nite impulse response lters. No single approach is identi ed as best. The classi cation error depends on the texture type and the dimensionality of the descriptors. Gabor lters were in most cases outperformed by the other lters. Varma and Zisserman [44] also compared different lters for texture classi cation and showed that MRF perform better than Gaussian based lter banks. Lazebnik et al. [21] propose a new invariant descriptor called spin image and compare it with Gabor lters in the context of texture classi cation. They show that the region-based spin image outperforms the point-based Gabor lter. However, the texture descriptors and the results for texture classi cation cannot be directly transposed to region descriptors. The regions often contain a single structure without repeated patterns, and the statistical dependency frequently explored in texture descriptors cannot February 23, 2005 DRAFT MIKOLAJCZYK AND SCHMID: A PERFORMANCE EVALUATION OF LOCAL DESCRIPTORS 4 be used in this context. B. Overview In section II we present a state of the art on local descriptors. Section III describes the implementation details for the detectors and descriptors used in our comparison as well as our evaluation criterion and the data set. In section IV we present the experimental results. Finally, we discuss the results. II. DESCRIPTORS Many different techniques for describing local image regions have been developed. The simplest descriptor is a vector of image pixels. Cross-correlation can then be used to compute a similarity score between two descriptors. However, the high dimensionality of such a description results in a high computational complexity for recognition. Therefore, this technique is mainly used for nding correspondences between two images. Note that the region can be sub-sampled to reduce the dimension. Recently, Ke and Sukthankar [19] proposed to use the image gradient patch and to apply PCA to reduce the size of the descriptor. Distribution based descriptors. These techniques use histograms to represent different charac- teristics of appearance or shape. A simple descriptor is the distribution of the pixel intensities represented by a histogram. A more expressive representation was introduced by Johnson and Hebert [17] for 3D object recognition in the context of range data. Their representation (spin image) is a histogram of the relative positions in the neighborhood of a 3D interest point. This descriptor was recently adapted to images [21]. The two dimensions of the histogram are distance from the center point and the intensity value. Zabih and Wood ll [45] have developed an approach robust to illumination changes. It relies on histograms of ordering and reciprocal relations between pixel intensities which are more robust than raw pixel intensities. The binary relations between intensities of several neighboring pixels are encoded by binary strings and a distribution of all possible combinations is represented by histograms. This descriptor is suitable for texture representation but a large number of dimensions is required to build a reliable descriptor [34]. Lowe [25] proposed a scale invariant feature transform (SIFT), which combines a scale invari- ant region detector and a descriptor based on the gradient distribution in the detected regions. The February 23, 2005 DRAFT MIKOLAJCZYK AND SCHMID: A PERFORMANCE EVALUATION OF LOCAL DESCRIPTORS 5 descriptor is represented by a 3D histogram of gradient locations and orientations, see gure 1 for illustration. The contribution to the location and orientation bins is weighted by the gradient magnitude. The quantization of gradient locations and orientations makes the descriptor robust to small geometric distortions and small errors in the region detection. Geometric histogram [1] and shape context [3] implement the same idea and are very similar to the SIFT descriptor. Both methods compute a 3D histogram of location and orientation for edge points where all the edge points have equal contribution in the histogram. These descriptors were successfully used, for example, for shape recognition of drawings for which edges are reliable features. Spatial-frequency techniques. Many techniques describe the frequency content of an image. The Fourier transform decomposes the image content into the basis functions. However, in this representation the spatial relations between points are not explicit and the basis functions are in nite, therefore dif cult to adapt to a local approach. The Gabor transform [13] overcomes these problems, but a large number of Gabor lters is required to capture small changes in frequency and orientation. Gabor lters and wavelets [27] are frequently explored in the context of texture classi cation. Differential descriptors. A set of image derivatives computed up to a given order approximates a point neighborhood. The properties of local derivatives (local jet) were investigated by Koen- derink [20]. Florack et al. [11] derived differential invariants, which combine components of the local jet to obtain rotation invariance. Freeman and Adelson [12] developed steerable lters, which steer derivatives in a particular direction given the components of the local jet. Steering derivatives in the direction of the gradient makes them invariant to rotation. A stable estimation of the derivatives is obtained by convolution with Gaussian derivatives. Figure 2(a) shows Gaussian derivatives up to order 4. Baumberg [2] and Schaffalitzky and Zisserman [37] proposed to use complex lters derived from the family a0a2a1a4a3a6a5a8a7a9a5a11a10a13a12a15a14a17a16a18a1a19a3a6a5a8a7a20a12a22a21a24a23a26a25a6a1a28a27a29a10a30a12 , where a10 is the orientation. For the function a16a18a1a19a3a6a5a8a7a22a12 Baumberg uses Gaussian derivatives and Schaffalitzky and Zisserman apply a polynomial (cf. section III-B and gure 2(b)). These lters differ from the Gaussian derivatives by a linear coordinates change in lter response space. Other techniques. Generalized moment invariants have been introduced by Van Gool et al. [43] to describe the multi-spectral nature of the image data. The invariants combine central moments de ned by a31a33a32a34a36a35 a14a17a37a38a37a40a39a20a3 a34 a7 a35a42a41a44a43 a1a4a3a6a5a8a7a22a12a46a45 a32a36a47 a3 a47 a7 with order a48a50a49a52a51 and degree a53 . The moments char- February 23, 2005 DRAFT MIKOLAJCZYK AND SCHMID: A PERFORMANCE EVALUATION OF LOCAL DESCRIPTORS 6 acterize shape and intensity distribution in a region a54 . They are independent and can be easily computed for any order and degree. However, the moments of high order and degree are sensitive to small geometric and photometric distortions. Computing the invariants reduces the number of dimensions. These descriptors are therefore more suitable for color images where the invariants can be computed for each color channel and between the channels. III. EXPERIMENTAL SETUP In the following we rst describe the region detectors used in our comparison and the region normalization necessary for computing the descriptors. We then give implementation details for the evaluated descriptors. Finally, we discuss the evaluation criterion and the image data used in the tests. A. Supp