# Do Deep Convolutional Nets Really Need to be Deep and Convolutional.pdf

Under review as a conference paper at ICLR 2017 DO DEEP CONVOLUTIONAL NETS REALLY NEED TO BE DEEP AND CONVOLUTIONAL? Gregor Urban1, Krzysztof J. Geras2, Samira Ebrahimi Kahou3, Ozlem Aslan4, Shengjie Wang5, Abdelrahman Mohamed6, Matthai Philipose6, Matt Richardson6, Rich Caruana6 1UC Irvine, USA 2University of Edinburgh, UK 3Ecole Polytechnique de Montreal, CA 4University of Alberta, CA 5University of Washington, USA 6Microsoft Research, USA ABSTRACT Yes, they do. This paper provides the first empirical demonstration that deep convolutional models really need to be both deep and convolutional, even when trained with methods such as distillation that allow small or shallow models of high accuracy to be trained. Although previous research showed that shallow feed-forward nets sometimes can learn the complex functions previously learned by deep nets while using the same number of parameters as the deep models they mimic, in this paper we demonstrate that the same methods cannot be used to train accurate models on CIFAR-10 unless the student models contain multiple layers of convolution. Although the student models do not have to be as deep as the teacher model they mimic, the students need multiple convolutional layers to learn functions of comparable accuracy as the deep convolutional teacher. 1 INTRODUCTION Cybenko (1989) proved that a network with a large enough single hidden layer of sigmoid units can approximate any decision boundary. Empirical work, however, suggests that it can be difficult to train shallow nets to be as accurate as deep nets. Dauphin and Bengio (2013) trained shallow nets on SIFT features to classify a large-scale ImageNet dataset and found that it was difficult to train large, high-accuracy, shallow nets. A study of deep convolutional nets suggests that for vision tasks deeper models are preferred under a parameter budget (e.g. Eigen et al. (2014); He et al. (2015); Simonyan and Zisserman (2014); Srivastava et al. (2015)). Similarly, Seide et al. (2011) and Geras et al. (2015) show that deeper models are more accurate than shallow models in speech acoustic modeling. More recently, Romero et al. (2015) showed that it is possible to gain increases in accuracy in models with few parameters by training deeper, thinner nets (FitNets) to mimic much wider nets. Cohen and Shashua (2016); Liang and Srikant (2016) suggest that the representational efficiency of deep networks scales exponentially with depth, but it is unclear if this applies only to pathological problems, or is encountered in practice on data sets such as TIMIT and CIFAR. Ba and Caruana (2014), however, demonstrated that shallow nets sometimes can learn the functions learned by deep nets, even when restricted to the same number of parameters as the deep nets. They did this by first training state-of-the-art deep models, and then training shallow models to mimic the deep models. Surprisingly, and for reasons that are not well understood, the shallow models learned more accurate functions when trained to mimic the deep models than when trained on the original data used to train the deep models. In some cases shallow models trained this way were as accurate as state-of-the-art deep models. But this demonstration was made on the TIMIT speech recognition benchmark. Although their deep teacher models used a convolutional layer, convolution is less important for TIMIT than it is for other domains such as image classification. Ba and Caruana (2014) also presented results on CIFAR-10 which showed that a shallow model could learn functions almost as accurate as deep convolutional nets. Unfortunately, the results on CIFAR-10 are less convincing than those for TIMIT. To train accurate shallow models on CIFAR-10 1 Under review as a conference paper at ICLR 2017 they had to include at least one convolutional layer in the shallow model, and increased the number of parameters in the shallow model until it was 30 times larger than the deep teacher model. Despite this, the shallow convolutional student model was several points less accurate than a teacher model that was itself several points less accurate than state-of-the-art models on CIFAR-10. In this paper we show that the methods Ba and Caruana used to train shallow students to mimic deep teacher models on TIMIT do not work as well on problems such as CIFAR-10 where multiple layers of convolution are required to train accurate teacher models. If the student models have a similar number of parameters as the deep teacher models, high accuracy can not be achieved without multiple layers of convolution even when the student models are trained via distillation. To ensure that the shallow student models are trained as accurately as possible, we use Bayesian optimization to thoroughly explore the space of architectures and learning hyperparameters. Although this combination of distillation and hyperparameter optimization allows us to train the most accurate shallow models ever trained on CIFAR-10, the shallow models still are not as accurate as deep models. Our results clearly suggest that deep convolutional nets do, in fact, need to be both deep and convolutional, even when trained to mimic very accurate models via distillation (Hinton et al., 2015). 2 TRAINING SHALLOW NETS TO MIMIC DEEPER CONVOLUTIONAL NETS In this paper, we revisit the CIFAR-10 experiments in Ba and Caruana (2014). Unlike in that work, here we compare shallow models to state-of-the-art deep convolutional models, and restrict the number of parameters in the shallow student models to be comparable to the number of parameters in the deep convolutional teacher models. Because we anticipated that our results might be different, we follow their approach closely to eliminate the possibility that the results differ merely because of changes in methodology. Note that the goal of this paper is not to train models that are small or fast as in Bucila et al. (2006), Hinton et al. (2015), and Romero et al. (2015), but to examine if shallow models can be as accurate as deep convolutional models given the same parameter budget. There are many steps required to train shallow student models to be as accurate as possible: train state-of-the-art deep convolutional teacher models, form an ensemble of the best deep models, collect and combine their predictions on a large transfer set, and then train carefully optimized shallow student models to mimic the teacher ensemble. For negative results to be informative, it is important that each of these steps be performed as well as possible. In this section we describe the experimental methodology in detail. Readers familiar with distillation (model compression), training deep models on CIFAR-10, data augmentation, and Bayesian hyperparameter optimization may wish to skip to the empirical results in Section 3. 2.1 MODEL COMPRESSION AND DISTILLATION The key idea behind model compression is to train a compact model to approximate the function learned by another larger, more complex model. Bucila et al. (2006) showed how a single neural net of modest size could be trained to mimic a much larger ensemble. Although the small neural nets contained 1000 fewer parameters, often they were as accurate as the large ensembles they were trained to mimic. Model compression works by passing unlabeled data through the large, accurate teacher model to collect the real-valued scores it predicts, and then training a student model to mimic these scores. Hinton et al. (2015) generalized the methods of Bucila et al. (2006) and Ba and Caruana (2014) by incorporating a parameter to control the relative importance of the soft targets provided by the teacher model to the hard targets in the original training data, as well as a temperature parameter that regularizes learning by pushing targets towards the uniform distribution. Hinton et al. (2015) also demonstrated that much of the knowledge passed from the teacher to the student is conveyed as dark knowledge contained in the relative scores (probabilities) of outputs corresponding to other classes, as opposed to the scores given to just the output for the one correct class. Surprisingly, distillation often allows smaller and/or shallower models to be trained that are nearly as accurate as the larger, deeper models they are trained to mimic, yet these same small models are not as accurate when trained on the 1-hot hard targets in the original training set. The reason for this is not yet well understood. Similar compression and distillation methods have also successfully 2 Under review as a conference paper at ICLR 2017 been used in speech recognition (e.g. Chan et al. (2015); Geras et al. (2015); Li et al. (2014)) and reinforcement learning Parisotto et al. (2016); Rusu et al. (2016). Romero et al. (2015) showed that distillation methods can be used to train small students that are more accurate than the teacher models by making the student models deeper, but thinner, than the teacher model. 2.2 MIMIC LEARNING VIA L2 REGRESSION ON LOGITS We train shallow mimic nets using data labeled by an ensemble of deep teacher nets trained on the original 1-hot CIFAR-10 training data. The deep teacher models are trained in the usual way using softmax outputs and cross-entropy cost function. Following Ba and Caruana (2014), the student mimic models are not trained with cross-entropy on the ten p values where pk = ezk=Pj ezj output by the softmax layer from the deep teacher model, but instead are trained on the un-normalized log probability values z (the logits) before the softmax activation. Training on the logarithms of predicted probabilities (logits) helps provide the dark knowledge that regularizes students by placing emphasis on the relationships learned by the teacher model across all of the outputs. As in Ba and Caruana (2014), the student is trained as a regression problem given training data {(x(1);z(1)),.,(x(T);z(T))}: L(W) = 1T X t jjg(x(t);W) z(t)jj22; (1) where W represents all of the weights in the network, and g(x(t);W) is the model prediction on the tth training data sample. 2.3 USING A LINEAR BOTTLENECK TO SPEED UP TRAINING A shallow net has to have more hidden units in each layer to match the number of parameters in a deep net. Ba and Caruana (2014) found that training these wide, shallow mimic models with backpropagation was slow, and introduced a linear bottleneck layer between the input and non-linear layers to speed learning. The bottleneck layer speeds learning by reducing the number of parameters that must be learned, but does not make the model deeper because the linear terms can be absorbed back into the non-linear weight matrix after learning. See Ba and Caruana (2014) for details. To match their experiments we use linear bottlenecks when training student models with 0 or 1 convolutional layers, but did not find the linear bottlenecks necessary when training student models with more than 1 convolutional layer. 2.4 BAYESIAN HYPERPARAMETER OPTIMIZATION The goal of this work is to determine empirically if shallow nets can be trained to be as accurate as deep convolutional models using a similar number of parameters in the deep and shallow models. If we succeed in training a shallow model to be as accurate as a deep convolutional model, this provides an existence proof that shallow models can represent and learn the complex functions learned by deep convolutional models. If, however, we are unable to train shallow models to be as accurate as deep convolutional nets, we might fail only because we did not train the shallow nets well enough. In all our experiments we employ Bayesian hyperparameter optimization using Gaussian process regression to ensure that we thoroughly and objectively explore the hyperparameters that govern learning. The implementation we use is Spearmint (Snoek et al., 2012). The hyperparameters we optimize with Bayesian optimization include the initial learning rate, momentum, scaling of the initial random weights, scaling of the inputs, and terms that determine the width of each of the network’s layers (i.e. number of convolutional filters and neurons). More details of the hyperparameter optimization can be found in Sections 2.5, 2.7, 2.8 and in the Appendix. 2.5 TRAINING DATA AND DATA AUGMENTATION The CIFAR-10 (Krizhevsky, 2009) data set consists of a set of natural images from 10 different object classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The dataset is a labeled subset of the 80 million tiny images dataset (Torralba et al., 2008) and is divided into 50,000 train and 3 Under review as a conference paper at ICLR 2017 10,000 test images. Each image is 32 32 pixels in 3 color channels, yielding input vectors with 3072 dimensions. We prepared the data by subtracting the mean and dividing by the standard deviation of each image vector. We train all models on a subset of 40,000 images and use the remaining 10,000 images as the validation set for the Bayesian optimization. The final trained models only used 80% of the theoretically available training data (as opposed to retraining on all of the data after hyperparameter optimization). We employ the HSV-data augmentation technique as described by Snoek et al. (2015). Thus we shift hue, saturation and value by uniform random values: h U( Dh;Dh); s U( Ds;Ds); v U( Dv;Dv). Saturation and value values are scaled globally: as U( 11+As;1 + As);av U( 11+Av ;1 + Av). The five constants Dh;Ds;Dv;As;Av are treated as additional hyperparameters in the Bayesian hyperparameter optimization. All training images are mirrored left-right randomly with a probability of 0:5. The input images are further scaled and jittered randomly by cropping windows of size 24 24 up to 32 32 at random locations and then scaling them back to 32 32. The procedure is as follows: we sample an integer value S U(24;32) and then a pair of integers x;y U(0;32 S). The transformed resulting image is R = fspline;3(I[x : x + S;y : y + S]) with I denoting the original image and fspline;3 denoting the 3rd order spline interpolation function that maps the 2D array back to 32 32 (applied to the three color channels separately). All data augmentations for the teacher models are computed on the fly using different random seeds. For student models trained to mimic the ensemble (see Section 2.7 for details of the ensemble teacher model), we pre-generated 160 epochs worth of randomly augmented training data, evaluated the ensemble’spredictions(logits)onthesesamples, andsavedalldataandpredictionstodisk. Allstudent models thus see the same training data in the same order. The parameters for HSV-augmentation in this case had to be selected beforehand; we chose to use the settings found with the best single model (Dh = 0:06;Ds = 0:26;Dv = 0:20;As = 0:21;Av = 0:13). Pre-saving the logits and augmented data is important to reduce the computational cost at training time, and to ensure that all student models see the same training data Bec