# Designing Neural Network Architectures using Reinforcement Learning.pdf

Under review as a conference paper at ICLR 2017 DESIGNING NEURAL NETWORK ARCHITECTURES USING REINFORCEMENT LEARNING Bowen Baker, Otkrist Gupta, Nikhil Naik Bergstra et al., 2013; Domhan et al., 2015) on automated or computer-aided neural network design, new CNN architectures or network design ele- ments are still primarily developed by researchers using new theoretical insights or intuition gained from experimentation. In this paper, we seek to automate the process of CNN architecture selection through a meta- modeling procedure based on reinforcement learning. We construct a novel Q-learning agent whose goal is to discover CNN architectures that performs well on a given machine learning task with no human intervention. The learning agent is given the task of sequentially picking layers of a CNN model. By discretizing and limiting the layer parameters to choose from, the agent is left with a finite but large space of model architectures to search from. The agent learns through random exploration and slowly begins to exploit its findings to select higher performing models using the - greedy strategy (Mnih et al., 2015). The agent receives the validation accuracy on the given machine learning task as the reward for selecting an architecture. We expedite the learning process through repeated memory sampling using experience replay (Lin, 1993). We refer to this Q-learning based meta-modeling method as MetaQNN, which is summarized in Figure 1. We conduct experiments with a space of model architectures consisting of only standard convolution, pooling, and fully connected layers using three standard image classification datasets: CIFAR-10, SVHN, and MNIST. The learning agent discovers CNN architectures that beat all existing networks 1 Under review as a conference paper at ICLR 2017 Agent Samples Network Topology Agent Learns From MemoryTrain Network Store in Replay Memory R Q Sample Memory Update Q-Values Conv Conv Pool Softmax Topology: C(64,5,1) C(128,3,1) P(2,2) SM(10) Performance: 93.3% R Figure 1: Designing CNN Architectures withQ-learning: The agent begins by sampling a Con- volutional Neural Network (CNN) topology conditioned on a predefined behavior distribution and the agent’s prior experience (left block). That CNN topology is then trained on a specific task; the topology description and performance, e.g. validation accuracy, are then stored in the agent’s mem- ory (middle block). Finally, the agent uses its memories to learn about the space of CNN topologies through Q-learning (right block). designed only with the same layer types (e.g., Springenberg et al. (2014); Srivastava et al. (2015)). In addition, their performance is competitive against network designs that include complex layer types and training procedures (e.g., Clevert et al. (2015); Lee et al. (2016)). Finally, the MetaQNN selected models comfortably outperform previous automated network design methods (Stanley Bergstra et al., 2013). The top network designs discovered by the agent on one dataset are also competitive when trained on other datasets, indicating that they are suited for transfer learning tasks. Moreover, we can generate not just one, but several varied, well-performing network designs, which can be ensembled to further boost the prediction performance. 2 RELATED WORK Designing neural network architectures: Research on automating neural network design goes back to the 1980s when genetic algorithm-based approaches were proposed to find both architec- tures and weights (Schaffer et al., 1992). However, to the best of our knowledge, networks designed with genetic algorithms, such as those generated with the NEAT algorithm (Stanley motivated by screening methods in genetics, Pinto et al. (2009) proposed a high-throughput network selection approach where they randomly sample thousands of architectures and choose promising ones for further training. In recent work, Saxena Domhan et al., 2015) and hyperparameters (Snoek et al., 2012; Swersky et al., 2013). Notably, Bergstra et al. (2013) proposed a meta-modeling approach based on Tree of Parzen Estimators (TPE) (Bergstra et al., 2011) to choose both the type of layers and hyperparameters of feed-forward networks; however, they fail to match the performance of hand- crafted networks. Reinforcement Learning: Recently there has been much work at the intersection of reinforcement learning and deep learning. For instance, methods using CNNs to approximate theQ-learning utility function (Watkins, 1989) have been successful in game-playing agents (Mnih et al., 2015; Silver et al., 2016) and robotic control (Lillicrap et al., 2015; Levine et al., 2016). These methods rely on phases of exploration, where the agent tries to learn about its environment through sampling, and exploitation, where the agent uses what it learned about the environment to find better paths. In traditional reinforcement learning settings, over-exploration can lead to slow convergence times, yet over-exploitation can lead to convergence to local minima (Kaelbling et al., 1996). However, in the case of large or continuous state spaces, the -greedy strategy of learning has been empirically shown to converge (Vermorel Mnih et al., 2015). We incorporate these techniques—Q-learning, the -greedy strategy and experience replay—in our algorithm design. 3 BACKGROUND Our method relies on Q-learning, a type of reinforcement learning. We now summarize the theoret- ical formulation of Q-learning, as adopted to our problem. Consider the task of teaching an agent to find optimal paths as a Markov Decision Process (MDP) in a finite-horizon environment. Con- straining the environment to be finite-horizon ensures that the agent will deterministically terminate in a finite number of time steps. In addition, we restrict the environment to have a discrete and finite state spaceS as well as action spaceU. For any state si 2S, there is a finite set of actions, U(si) U, that the agent can choose from. In an environment with stochastic transitions, an agent in state si taking some action u2U(si) will transition to state sj with probability ps0js;u(sjjsi;u), which may be unknown to the agent. At each time step t, the agent is given a reward rt, dependent on the transition from state s to s0 and action u. rt may also be stochastic according to a distribution prjs0;s;u. The agent’s goal is to maximize the total expected reward over all possible trajectories, i.e., maxTi2T RTi, where the total expected reward for a trajectoryTi is RTi =P(s;u;s0)2Ti Erjs;u;s0[rjs;u;s0]: (1) Though we limit the agent to a finite state and action space, there are still a combinatorially large number of trajectories, which motivates the use of reinforcement learning. We define the maximiza- tion problem recursively in terms of subproblems as follows. For any state si 2S and subsequent action u2U(si), we define the maximum total expected reward to be Q (si;u). Q ( ) is known as the action-value function and individual Q (si;u) are know as Q-values. The recursive maximiza- tion equation, which is known as Bellman’s Equation, can be written as Q (si;u) = Esjjsi;u Erjsi;u;sj[rjsi;u;sj] + maxu02U(sj)Q (sj;u0) : (2) In many cases, it is impossible to analytically solve Bellman’s Equation (Bertsekas, 2015), but it can be formulated as an iterative update Qt+1(si;u) = (1 )Qt(si;u) + rt + maxu02U(sj)Qt(sj;u0) : (3) Equation 3 is the simplest form of Q-learning proposed by Watkins (1989). For well formulated problems, limt!1Qt(s;u) = Q (s;u), as long as each transition is sampled infinitely many times (Bertsekas, 2015). The update equation has two parameters: (i) is a Q-learning rate which determines the weight given to new information over old information, and (ii) is the discount fac- tor which determines the weight given to short-term rewards over future rewards. The Q-learning algorithm is model-free, in that the learning agent can solve the task without ever explicitly con- structing an estimate of environmental dynamics. In addition, Q-learning is off policy, meaning it can learn about optimal policies while exploring via a non-optimal behavioral distribution, i.e. the distribution by which the agent explores its environment. We choose the behavior distribution using an -greedy strategy (Mnih et al., 2015). With this strat- egy, a random action is taken with probability and the greedy action, maxu2U(si)Qt(si;u), is chosen with probability 1 . We anneal from 1!0 such that the agent begins in an exploration phase and slowly starts moving towards the exploitation phase. In addition, when the exploration cost is large (which is true for our problem setting), it is beneficial to use the experience replay technique for faster convergence (Lin, 1992). In experience replay, the learning agent is provided with a memory of its past explored paths and rewards. At a given interval, the agent samples from the memory and updates its Q-values via Equation 3. 4 DESIGNING NEURAL NETWORK ARCHITECTURES WITH Q-LEARNING We consider the task of training a learning agent to sequentially choose neural network layers. Figure 2 shows feasible state and action spaces (a) and a potential trajectory the agent may take along with the CNN architecture defined by this trajectory (b). We model the layer selection process as a Markov Decision Process with the assumption that a well-performing layer in one network should also perform well in another network. We make this assumption based on the hierarchical nature of 3 Under review as a conference paper at ICLR 2017 Layer 1Layer 2 w11(1)w12(1)w13(1)w 21(1)w22(1)w23(1)w 31(1)w32(1)w33(1) Input Convolution 64 Filters3x3 Receptive Field1x1 StridesMax Pooling Softmax InputC(64,3,1)P(2,2)C(64,3,1)G G G G P(2,2) StateAction InputC(64,3,1)P(2,2)C(64,3,1)G G G G Layer 1Layer 2C(64,3,1)C(64,3,1) G G G G Layer N-1Layer N P(2,2)P(2,2)P(2,2)(a) (b) Figure 2: Markov Decision Process for CNN Architecture Generation: Figure 2(a) shows the full state and action space. In this illustration, actions are shown to be deterministic for clarity, but they are stochastic in experiments. C(n;f;l) denotes a convolutional layer with n filters, receptive field size f, and stride l. P(f;l) denotes a pooling layer with receptive field size f and stride l. G denotes a termination state (Softmax/Global Average Pooling). Figure 2(b) shows a path the agent may choose, highlighted in green, and the corresponding CNN topology. the feature representations learned by neural networks with many hidden layers (LeCun et al., 2015). The agent sequentially selects layers via the -greedy strategy until it reaches a termination state. The CNN architecture defined by the agent’s path is trained on the chosen learning problem, and the agent is given a reward equal to the validation accuracy. The validation accuracy and architecture description are stored in a replay memory. Experiences are sampled periodically from the replay memory to update Q-values via Equation 3. The agent follows an schedule which determines its shift from exploration to exploitation. Our method requires three main design choices: (i) reducing CNN layer definitions to simple state tuples, (ii) defining a set of actions the agent may take, i.e., the set of layers the agent may pick next given its current state, and (iii) balancing the size of the state-action space—and correspondingly, the model capacity—with the amount of exploration needed by the agent to converge. We now describe the design choices and the learning process in detail. 4.1 THE STATE SPACE Each state is defined as a tuple of all relevant layer parameters. We allow five different types of lay- ers: convolution (C), pooling (P), fully connected (FC), global average pooling (GAP), and softmax (SM), though the general method is not limited to this set. Table 1 shows the relevant parameters for each layer type and also the discretization we chose for each parameter. Each layer has a parameter layer depth (shown as Layer 1;2;::: in Figure 2). Adding layer depth to the state space allows us to constrict the action space such that the state-action graph is directed and acyclic (DAG) and also allows us to specify a maximum number of layers the agent may select before terminating. Each layer type also has a parameter called representation size (R-size). Convolutional nets pro- gressively compress the representation of the original signal through pooling and convolution. The presence of these layers in our state space may lead the agent on a trajectory where the intermediate signal representation gets reduced to a size that is too small for further processing. For example, five 2 2 pooling layers each with stride 2 will reduce an image of initial size 32 32 to size 1 1. At this stage, further pooling, or convolution with receptive field size greater than 1, would be mean- ingless and degenerate. To avoid such scenarios, we add the R-size parameter to the state tuple s, which allows us to restrict actions from states with R-size n to those that have a receptive field size less than or equal to n. To further constrict the state space, we chose to bin the representation sizes into three discrete buckets. However, binning adds uncertainty to the state transitions: depending on the true underlying representation size, a pooling layer may or may not change the R-size bin. As a result, the action of pooling can lead to two different states, which we model as stochasticity in state transitions. Please see Figure A1 in appendix for an illustrated example. 4.2 THE ACTION SPACE We restrict the agent from taking certain actions to both limit the state-action space and make learn- ing tractable. First, we allow the agent to terminate a path at any point, i.e. it may choose a termi- 4 Under review as a conference paper at ICLR 2017 LayerType LayerParameters ParameterValues Convolution(C) i Layerdepth f Receptive fieldsize ‘ Stride d # receptive fields n Representationsize then u = argmaxu2U(S[ 1])Q[(S[ 1];u)] s0 = TRANSITION(S[ 1];u) else u UniformfU(S[ 1])g s0 = TRANSITION(S[ 1];u) end if U:append(u) ifu != terminate then S:append(s0) end if end while returnS, U Algorithm 3 UPDATE Q VALUES(Q, S, U, accuracy) Q[S[ 1];U[ 1]] = (1 )Q[S[ 1];U[ 1]] + accuracy for i = length(S) 2 to 0 do Q[S[i];U[i]] = (1 )Q[S[i];U[i]] + maxu2U(S[i+1])Q[S[i+ 1];u] end for returnQ 12 Under review as a conference paper at ICLR 2017 B REPRESENTATION SIZE BINNING As mentioned in Section 4.1 of the main text, we introduce a parameter called representation size to prohibit the agent from taking actions that can reduce the intermediate signal representation to a size that is too small for further processing. However, this process leads to uncertainties in state transitions, as illustrated in Figure A1, which is handled by the standard Q-learning formulation. P(2,2) R-size: 18 R-size bin: 1 R-size: 9 R-size bin: 1 (a) P(2,2) R-size: 9 R-size bin: 1 R-size: 14 R-size bin: 1 (b) States Actions p 1 2 p R-size bin: 1 R-size bin: