PCA-SIFT: A More Distinctive Representation for Local Image Descriptors Yan Ke, Rahul Sukthankar Email:

[email protected],

[email protected] IRP-TR-03-15 November 2003 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL? PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Copyright ? Intel Corporation 2003 * Other names and brands may be claimed as the property of others. PCA-SIFT: A More Distinctive Representation for Local Image Descriptors Yan Ke1, Rahul Sukthankar1;2 {yke,rahuls}@cs.cmu.edu 1 School of Computer Science, Carnegie Mellon University; 2 Intel Research Pittsburgh http://www.cs.cmu.edu/?yke/pcasift/ Abstract Stable local feature detection and representation is a fun- damental component of many image registration and object recognition algorithms. Mikolajczyk and Schmid [14] re- cently evaluated a variety of approaches and identified the SIFT [11] algorithm as being the most resistant to common image deformations. This paper examines (and improves upon) the local image descriptor used by SIFT. Like SIFT, our descriptors encode the salient aspects of the image gra- dient in the feature point’s neighborhood; however, instead of using SIFT’s smoothed weighted histograms, we apply Principal Components Analysis (PCA) to the normalized gradient patch. Our experiments demonstrate that the PCA- based local descriptors are more distinctive, more robust to image deformations, and more compact than the standard SIFT representation. We also present results showing that using these descriptors in an image retrieval application re- sults in increased accuracy and faster matching. 1. Introduction Local descriptors [6, 12, 18] are commonly employed in a number of real-world applications such as object recogni- tion [3, 11] and image retrieval [13] because they can be computed efficiently, are resistant to partial occlusion, and are relatively insensitive to changes in viewpoint. There are two considerations to using local descriptors in these appli- cations. First, we must localize the interest point in position and scale. Typically, interest points are placed at local peaks in a scale-space search, and filtered to preserve only those that are likely to remain stable over transformations. Sec- ond, we must build a description of the interest point; ide- ally, this description should be distinctive (reliably differen- tiating one interest point from others), concise, and invari- ant over transformations caused by changes in camera pose and lighting. While the localization and description aspects of interest point algorithms are often designed together, the solutions to these two problems are independent [14]. This paper focuses on approaches to the second aspect – the con- struction and evaluation of local descriptor representations. Mikolajczyk and Schmid [14] presented a comparative study of several local descriptors including steerable fil- ters [4], differential invariants [9], moment invariants [18], complex filters [16], SIFT [11], and cross-correlation of dif- ferent types of interest points [6, 13]. Their experiments showed that the ranking of accuracy for the different algo- rithms was relatively insensitive to the method employed to find interest points in the image but was dependent on the representation used to model the image patch around the interest point. Since their best matching results were ob- tained using the SIFT descriptor, this paper focuses on that algorithm and explores alternatives to its local descriptor representation. The remainder of this paper is organized as follows. Sec- tion 2 reviews the relevant aspects of the SIFT algorithm. Section 3 details our PCA-based representation for local features (PCA-SIFT). Section 4 presents our evaluation methodology and performance metrics. Section 5 provides detailed experimental results comparing PCA-SIFT to stan- dard SIFT on feature-matching experiments and also in the context of an image retrieval application. Section 6 exam- ines the reasons behind PCA-SIFT’s accuracy by exploring the role of different components in the representation. Fi- nally, Section 7 summarizes the contributions of this paper and concludes with some ideas for future research in this area. 2. Review of the SIFT Algorithm SIFT, as described in [12], consists of four major stages: (1) scale-space peak selection; (2) keypoint localization; (3) orientation assignment; (4) keypoint descriptor. In the first stage, potential interest points are identified by scan- ning the image over location and scale. This is imple- mented efficiently by constructing a Gaussian pyramid and searching for local peaks (termed keypoints) in a series of difference-of-Gaussian (DoG) images. In the second stage, candidate keypoints are localized to sub-pixel accuracy and eliminated if found to be unstable. The third identifies the dominant orientations for each keypoint based on its local image patch. The assigned orientation(s), scale and loca- 1 tion for each keypoint enables SIFT to construct a canoni- cal view for the keypoint that is invariant to similarity trans- forms. The final stage builds a local image descriptor for each keypoint, based upon the image gradients in its local neighborhood (discussed below in greater detail). The first three stages will not be discussed further in this paper since our work makes no contributions to those areas. The final (keypoint descriptor) stage of the SIFT algo- rithm builds a representation for each keypoint based on a patch of pixels in its local neighborhood. Note that the patch has been previously centered about the keypoint’s lo- cation, rotated on the basis of its dominant orientation and scaled to the appropriate size. The goal is to create a de- scriptor for the patch that is compact, highly distinctive (i.e., patches from different keypoints map to different represen- tations) and yet robust to changes in illumination and cam- era viewpoint (i.e., the same keypoint in different images maps to similar representations). As discussed in [12], ob- vious approaches such as normalized correlation between image patches do not work since they are overly sensitive to registration errors and non-rigid deformations. The stan- dard keypoint descriptor used by SIFT is created by sam- pling the magnitudes and orientations of the image gradient in the patch around the keypoint, and building smoothed orientation histograms to capture the important aspects of the patch. A 4 4 array of histograms, each with 8 orienta- tion bins, captures the rough spatial structure of the patch. This 128-element vector is then normalized to unit length and thresholded to remove elements with small values. The standard SIFT keypoint descriptor representation is noteworthy in several respects: (1) the representation is carefully designed to avoid problems due to boundary ef- fects — smooth changes in location, orientation and scale do not cause radical changes in the feature vector; (2) it is fairly compact, expressing the patch of pixels using a 128 element vector; (3) while not explicitly invariant to affine transformations, the representation is surprisingly re- silient to deformations such as those caused by perspec- tive effects. These characteristics are evidenced in excellent matching performance against competing algorithms [14]. On the other hand, the construction of the standard SIFT feature vector is complicated and the choices behind its spe- cific design (as given in [12]) are not clear. Our initial goal was to explore simpler alternatives and to empirically evalu- ate the tradeoffs. However, as discussed in the remainder of this paper, we discovered that our alternate representation was theoretically simpler, more compact, faster and more accurate than the standard SIFT descriptor. To ensure that our results are an accurate reflection of reality, we use the original SIFT source code and restrict our changes to the fourth stage.1 1The SIFT website reports that a bug was recently fixed in the code, sig- nificantly improving SIFT’s matching performance. The results reported 3. PCA-based SIFT descriptors Our algorithm for local descriptors (termed PCA-SIFT) ac- cepts the same input as the standard SIFT descriptor: the sub-pixel location, scale, and dominant orientations of the keypoint. We extract a 41 41 patch at the given scale, centered over the keypoint, and rotated to align its domi- nant orientation to a canonical direction.2 PCA-SIFT can be summarized in the following steps: (1) pre-compute an eigenspace to express the gradient images of local patches; (2) given a patch, compute its local image gradient; (3) project the gradient image vector using the eigenspace to derive a compact feature vector. This feature vector is sig- nificantly smaller than the standard SIFT feature vector, and can be used with the same matching algorithms. The Euclidean distance between two feature vectors is used to determine whether the two vectors correspond to the same keypoint in different images. Principal Component Analysis (PCA) [7] is a stan- dard technique for dimensionality reduction and has been applied to a broad class of computer vision problems, including feature selection (e.g., [5]), object recognition (e.g., [15]) and face recognition (e.g., [17]). While PCA suffers from a number of shortcomings [8, 10], such as its implicit assumption of Gaussian distributions and its restric- tion to orthogonal linear combinations, it remains popular due to its simplicity. The idea of applying PCA to image patches is not novel (e.g., [3]). Our contribution lies in rig- orously demonstrating that PCA is well-suited to represent- ing keypoint patches (once they have been transformed into a canonical scale, position and orientation), and that this representation significantly improves SIFT’s matching per- formance. PCA-SIFT is detailed in the following subsec- tions. 3.1. Offline computation of patch eigenspace PCA enables us to linearly-project high-dimensional sam- ples onto a low-dimensional feature space. For our applica- tion, this projection (encoded by the patch eigenspace) can be pre-computed once and stored. As discussed above, the input vector is created by con- catenating the horizontal and vertical gradient maps for the 41 41 patch centered at the keypoint. Thus, the input vec- tor has 2 39 39=3042 elements. We then normalize this vector to unit magnitude to minimize the impact of varia- tions in illumination. It is important to note that the 41 41 patch does not span the entire space of pixel values, nor the smaller manifold of patches drawn from natural images; it consists of the highly-restricted set of patches that passed through the first three stages of SIFT. More precisely, each here use the correct (September 2003) version. 2For keypoints with multiple dominant orientations, we build a repre- sentation for each orientation, in the same manner as SIFT. 2 of the patches satisfies the following properties: (1) it is centered on a local maximum in scale-space; (2) it has been rotated so that (one of its) dominant gradient orientations is aligned to be vertical; (3) it only contains information for the scale appropriate to this keypoint – i.e., the 41 41 patch may have been created from a much larger region from the original image. The remaining variations in the input vec- tor are mainly due to the “identity” of the keypoint (i.e., the 3-D scene corresponding to this location) or to unmodeled distortions (such as perspective effects caused by changing camera viewpoint). It is not unreasonable to believe that these remaining variations can be reasonably modeled by low-dimensional Gaussian distributions, enabling PCA to accurately represent them with a compact feature represen- tation. More importantly, projecting the gradient patch onto the low-dimensional space appears to retain the identity- related variation while discarding the distortions induced by other effects. This hypothesis is supported by the ex- perimental evidence discussed in Sections 4 and 6. To build our eigenspace, we ran the first three stages of the SIFT algorithm on a diverse collection of images and collected 21,000 patches. Each was processed as described above to create a 3042-element vector, and PCA was ap- plied to the covariance matrix of these vectors. The matrix consisting of the top n eigenvectors was stored on disk and used as the projection matrix for PCA-SIFT. The images used in building the eigenspace were discarded and not used in any of the matching experiments. 3.2. Feature representation To find the feature vector for a given image patch, we sim- ply create its 3042-element normalized image gradient vec- tor and project it into our feature space using the stored eigenspace. We empirically determined good values for the dimensionality of the feature space, n; most of the results described in this paper use n = 20 (Section 6 discusses the effects of n on performance). The standard SIFT rep- resentation employs 128-element vectors; using PCA-SIFT results in significant space benefits. As discussed above, we use the Euclidean distance be- tween two feature vectors to determine whether the two vectors belong to the same keypoint in different images. Thresholding this distance generates a binary decision, and adjusting this threshold enables one to select the appropriate trade-off between false positives and false negatives. 4. Evaluation First, we discuss the evaluation metrics used to quantify our results. We then outline our experimental setup and discuss the issue of generating ground-truth data. Results are pre- sented in Section 5. 4.1. Evaluation metrics Receiver Operating Characteristics (ROC) and Recall- Precision are both popular metrics in the literature, and are sometimes used interchangeably. Both capture the fact that we want to increase the number of correct positives while minimizing the number of false positives; however, the sub- tle differences between the metrics should dictate the choice of one over the other for specific scenarios. As observed in [2], the former is well-suited for evaluating classifiers, since the false detection rate is well-defined; the latter is better-suited for evaluating detectors, since the number of false detections relative to the total number of detections is correctly expr