# 2010LLC.pdf

Locality-constrained Linear Coding for Image Classification Jinjun Wang ? , Jianchao Yang ? ,KaiYu § , Fengjun Lv § , Thomas Huang ? , and Yihong Gong ? ? Akiira Media System, Palo Alto, California ? Beckman Institute, University of Illinois at Urbana-Champaign § NEC Laboratories America, Inc., Cupertino, California ? {jjwang, ygong}@akiira.com, ? {jyang29, huang}@ifp.uiuc.edu, § {kyu, flv}@sv.nec-labs.com Abstract The traditional SPM approach based on bag-of-features (BoF) requires nonlinear classifiers to achieve good im- age classification performance. This paper presents a sim- ple but effective coding scheme called Locality-constrained Linear Coding (LLC) in place of the VQ coding in tradi- tional SPM. LLC utilizes the locality constraints to project each descriptor into its local-coordinate system, and the projected coordinates are integrated by max pooling to gen- erate the final representation. With linear classifier, the pro- posed approach performs remarkably better than the tra- ditional nonlinear SPM, achieving state-of-the-art perfor- mance on several benchmarks. Compared with the sparse coding strategy [22], the ob- jective function used by LLC has an analytical solution. In addition, the paper proposes a fast approximated LLC method by first performing a K-nearest-neighbor search and then solving a constrained least square fitting problem, bearing computational complexity of O(M + K 2 ). Hence even with very large codebooks, our system can still process multiple frames per second. This efficiency significantly adds to the practical values of LLC for real applications. 1. Introduction The recent state-of-the-art image classification systems consist of two major parts: bag-of-features (BoF) [19, 4] and spatial pyramid matching (SPM) [15]. The BoF method represents an image as a histogram of its local features. It is especially robust against spatial translations of features, and demonstrates decent performance in whole-image cat- egorization tasks. However, the BoF method disregards the information about the spatial layout of features, hence it is incapable of capturing shapes or locating an object. Of the many extensions of the BoF method, including the generative part models [7, 3, 2], geometric correspon- dence search [1, 14] and discriminative codebook learn- ing [13, 17, 23], the most successful results were reported Feature vector [ ] Feature Extraction Descriptor Coding Pooling Code SPM Concatenating Step 1: Find K-Nearest Neighbors of xi, denoted as Bi Step 2: Reconstruct xi using Bi c* = argmin || xi –ci T Bi || 2 c Image Step 3: ci is an Mx1 vector with K non-zero elements whose values are the corresponding c* of step 2 LLC Coding process K st. Σcj = 1 j input: xi codebook: B={bj} j=1,…,M input: xi code: ci Figure 1. Left: flowchart of the spatial pyramid structure for pool- ing features for image classification. Right: the proposed LLC coding process. by using SPM [15]. Motivated by [10], the SPM method partitions the image into increasingly finer spatial sub- regions and computes histograms of local features from each sub-region. Typically, 2 l × 2 l subregions, l =0,1,2 are used. Other partitions such as 3 × 1 has also been at- tempted to incorporate domain knowledge for images with “sky” on top and/or “ground” on bottom. The resulted “spa- tial pyramid” is a computationally efficient extension of the orderless BoF representation, and has shown very promis- ing performance on many image classification tasks. A typical flowchart of the SPM approach based on BoF is illustrated on the left of Figure 1. First, feature points are detected or densely located on the input image, and descrip- tors such as “SIFT” or “color moment” are extracted from each feature point (highlighted in blue circle in Figure 1). This obtains the “Descriptor” layer. Then, a codebook with M entries is applied to quantize each descriptor and gen- 1 erate the “Code” layer, where each descriptor is converted into anR M code (highlighted in green circle). If hard vector quantization (VQ) is used, each code has only one non-zero element, while for soft-VQ, a small group of elements can be non-zero. Next in the “SPM” layer, multiple codes from inside each sub-region are pooled together by averaging and normalizing into a histogram. Finally, the histograms from all sub-regions are concatenated together to generate the fi- nal representation of the image for classification. Although the traditional SPM approach works well for image classification, people empirically found that, to achieve good performance, traditional SPM has to use clas- sifiers with nonlinear Mercer kernels, e.g., Chi-square ker- nel. Accordingly, the nonlinear classifier has to afford addi- tional computational complexity, bearing O(n 3 ) in training and O(n) for testing in SVM, where n is the number of support vectors. This implies a poor scalability of the SPM approach for real applications. To improve the scalability, researchers aim at obtaining nonlinear feature representations that work better with lin- ear classifiers, e.g. [26, 22]. In particular, Yang et al. [22] proposed the ScSPM method where sparse coding (SC) was used instead of VQ to obtain nonlinear codes. The method achieved state-of-the-art performances on several bench- marks. Yu et al. [24] empirically observed that SC results tend to be local – nonzero coefficients are often assigned to bases nearby to the encoded data. They suggested a modifi- cation to SC, called Local Coordinate Coding (LCC), which explicitly encourages the coding to be local, and theoreti- cally pointed out that under certain assumptions locality is more essential than sparsity, for successful nonlinear func- tion learning using the obtained codes. Similar to SC, LCC requires to solve L1-norm optimization problem, which is however computationally expensive. In this paper, we present a novel and practical coding scheme called Locality-constrained Linear Coding (LLC), which can be seem as a fast implementation of LCC that utilizes the locality constraint to project each descriptor into its local-coordinate system. Experimental results show that, the final representation (Figure 1) generated by using LLC code can achieve an impressive image classification accu- racy even with a linear SVM classifier. In addition, the op- timization problem used by LLC has an analytical solution, where the computational complexity is only O(M + M) for each descriptor. We further propose an approximated LLC method by performing a K-nearest-neighbor (K-NN) search and then solving a constrained least square fitting problem. This further reduces the computational complex- ity to O(M + K), where K is the number of nearest neigh- bors, and usually KD, and hence lscript 1 regularization is necessary to en- sure that the under-determined system has a unique solu- tion; Second, the sparsity prior allows the learned represen- tation to capture salient patterns of local descriptors; Third, the sparse coding can achieve much less quantization er- ror than VQ. Accordingly, even with linear SVM classifier, ScSPM can outperform the nonlinear SPM approach by a large margin on benchmarks like Caltech-101 [22]. 2.3. Coding descriptors in LLC In this paper, we present a new coding algorithm called Locality-constrained Linear Coding (LLC). As suggested by LCC [24], locality is more essential than sparsity, as locality must lead to sparsity but not necessary vice versa. LLC incorporates locality constraint instead of the sparsity constraint in Eq.(2), which leads to several favorable prop- erties as explained in Subsection 2.4. Specifically, the LLC code uses the following criteria: min C N summationdisplay i=1 bardblx i ?Bc i bardbl 2 + λ||d i circledotc i || 2 (3) s.t. 1 latticetop c i =1, ?i where circledot denotes the element-wise multiplication, and d i ∈ R M is the locality adaptor that gives different freedom for each basis vector proportional to its similarity to the input descriptor x i . Specifically, d i =exp parenleftbigg dist(x i ,B) σ parenrightbigg . (4) where dist(x i ,B)=[dist(x i ,b 1 ),.,dist(x i ,b M )] T , and dist(x i ,b j ) is the Euclidean distance between x i and b j . σ is used for adjusting the weight decay speed for the locality adaptor. Usually we further normalize d i to be between (0,1] by subtracting max parenleftbig dist(x i ,B) parenrightbig from dist(x i ,B). The constraint1 T c i =1follows the shift-invariant require- ments of the LLC code. Note that the LLC code in Eqn. 3 is not sparse in the sense of lscript 0 norm, but is sparse in the sense that the solution only has few significant values. In practice, we simply threshold those small coefficients to be zero. 2.4. Properties of LLC To achieve good classification performance, the coding scheme should generate similar codes for similar descrip- tors. Following this requirement, the locality regularization term ||d i circledotc i || 2 in Eq.(3) presents several attractive prop- erties: 1. Better reconstruction. In VQ, each descriptor is rep- resented by a single basis in the codebook, as illus- trated in Fig. 2.a. Due to the large quantization errors, input: xi codebook: B={bj} j=1,…,M input: xi codebook: B={bj} j=1,…,M input: xi codebook: B={bj} j=1,…,M VQ SC LLC Figure 2. Comparison between VQ, SC and LLC. The selected basses for representation are highlighted in red the VQ code for similar descriptors might be very dif- ferent. Besides, the VQ process ignores the relation- ships between different bases. Hence non-linear ker- nel projection is required to make up such information loss. On the other side, as shown in Fig. 2.c in LLC, each descriptor is more accurately represented by mul- tiple bases, and LLC code captures the correlations be- tween similar descriptors by sharing bases. 2. Local smooth sparsity. Similar to LLC, SC also achieves less reconstruction error by using multiple bases. Nevertheless, the regularization term of lscript 1 norm in SC is not smooth. As shown in Fig. 2.b, due to the over-completeness of the codebook, the SC process might select quite different bases for similar patches to favor sparsity, thus losing correlations between codes. On the other side, the explicit locality adaptor in LLC ensures that similar patches will have similar codes. 3. Analytical solution. Solving SC usually requires computationally demanding optimization procedures. For instance, the Feature Sign algorithm utilized by Yang et al. [22] has a computation complexity of O(M ×K) in the optimal case [16], where K denotes the number of non-zero elements. Unlike SC, the so- lution of LLC can be derived analytically by: ?c i = parenleftbig C i + λdiag(d) parenrightbig \1 (5) c i = ?c i /1 latticetop ?c i , (6) where C i =(B?1x latticetop i )(B?1x latticetop i ) latticetop denotes the data covariance matrix. As seen in Section 3, the LLC can be performed very fast in practice. 3. Approximated LLC for Fast Encoding The LLC solution only has a few significant values, or equivalently, solving Eq.(3) actually performs feature selec- tion: it selects the local bases for each descriptor to form a local coordinate system. This suggests that we can develop an even faster approximation of LLC to speedup the encod- ing process. Instead of solving Eq.(3), we can simply use the K (K 0.01}, B i ← B(:,id), 10: ?c i ← argmax c ||x i ?B i c|| 2 s.t. summationtext j c(j)=1. {update basis} 11: ΔB i ←= ?2?c i (x i ?B i ?c i ), μ ← radicalbig 1/i, 12: B i ← B i ? μΔB i /|?c i | 2 , 13: B(:,id) ← proj(B i ). 14: end for 5. Experimental Results In this section, we report results based on three widely used datasets: Caltech-101 [7], Caltech-256 [11], and Pas- cal VOC 2007 [6]. We used only a single descriptor, the Histogram of Oriented Gradient (HOG) [5], throughout the experiment. In our setup, the HOG features were extracted from patches densely located by every 8 pixels on the im- age, under three scales, 16 × 16, 25 × 25 and 31 × 31 re- spectively. The dimension of each HOG descriptor is 128. During LLC processing, only the approximated LLC was used, and the number of neighbors was set to 5 (Section 3) with the shift-invariant constraint. In the “SPM” layer, for each spatial sub-region, the codes of the descriptors (e.g., VQ codes, SC codes or LLC codes) are pooled together to get the corresponding pooled feature. These pooled features from each sub-region are concate- nated and normalized as the final image feature representa- tion. Specifically two pooling methods have been used ? sum pooling [15]:: c out = c in1 +,.,+c in2 ? max pooling [22]: c out = max(c in1 ,.,c in2 ) where “max” functions in a row-wise manner, returning a vector the same size as c in1 . These pooled features can then normalized by ? sum normalization: c out = c in / summationtext j c in (j) ? lscript 2 normalization: c out = c in /bardblc in bardbl 2 Different combinations can be explored, e.g., “sum pool- ing” followed by “sum normalization” with VQ codes pro- duces the histogram. In our LLC framework, we used “max pooling” combined with “lscript 2 normalization” as in [22]. All the experiments were conducted on a Dell Pow- erEdge 1950 server with 16G memory and 2.5Ghz Quad Core CPU. 5.1. Caltech-101 The Caltech-101 dataset [7] contains 9144 images in 101 classes including animals, vehicles, flowers, etc, with sig- nificant variance in shape. The number of images per cat- egory varies from 31 to 800. As suggested by the original dataset [7] and also by many other researchers [25, 11], we partitioned the whole dataset into 5, 10, ., 30 training im- ages per class and no more than 50 testing images per class, and measured the performance using average accuracy over 102 classes (i.e. 101 classes and a “background” class). We trained a codebook with 2048 bases, and used 4 × 4, 2 × 2 and 1 × 1 sub-regions for SPM. In the experiment, the im- ages were resized to be no larger than 300×300 pixels with preserved aspect ratio. In our evaluation, totally 13 classes achieve 100% clas- sification accuracy with 30 training image per class. Fig- ure 5.1 illustrates five out of these 13 classes that have more than 10 testing images. We compared our result with sev- eral existing approaches. Detailed results are shown in Ta- ble 1, and it can be seen that in most cases, our proposed LLC method leads the performance. In addition, the aver- age processing time for our method in generating the final representation from a raw image input is only 0.24 second. Table 1. Image classification results on Caltech-101 dataset training images 5 1015202530 Zhang [25] 46.6 55.8 59.1 62.0 - 66.20 Lazebnik [15] - - 56.40 - - 64.60 Griffin [11] 44.2 54.5 59.0 63.3 65.8 67.60 Boiman [2] - - 65.00 - - 70.40 Jain [12] - - 61.00 - - 69.10 Gemert [8] -----64.16 Yang [22] --67.00 - - 73.20 Ours 51.15 59.77 65.43 67.74 70.16 73.44 accordion, acc: 100% car, acc: 100% leopards, acc: 100% rooster, acc: 100% chair, acc: 100% 5.2. Caltech-256 The Cal