Under review as a conference paper at ICLR 2017 TOWARDS PRINCIPLED METHODS FOR TRAINING GENERATIVE ADVERSARIAL NETWORKS Martin Arjovsky L′eon Bottou Courant Institute of Mathematical Sciences Facebook AI Research

[email protected] [email protected] ABSTRACT The goal of this paper is not to introduce a single algorithm or method, but to make theoretical steps towards fully understanding the training dynamics of gen- erative adversarial networks. In order to substantiate our theoretical analysis, we perform targeted experiments to verify our assumptions, illustrate our claims, and quantify the phenomena. This paper is divided into three sections. The first sec- tion introduces the problem at hand. The second section is dedicated to studying and proving rigorously the problems including instability and saturation that arize when training generative adversarial networks. The third section examines a prac- tical and theoretically grounded direction towards solving these problems, while introducing new tools to study them. 1 INTRODUCTION Generative adversarial networks (GANs)(Goodfellow et al., 2014a) have achieved great success at generating realistic and sharp looking images. However, they are widely general methods, now starting to be applied to several other important problems, such as semisupervised learning, stabi- lizing sequence learning methods for speech and language, and 3D modelling. (Denton et al., 2015; Radford et al., 2015; Salimans et al., 2016; Lamb et al., 2016; Wu et al., 2016) However, they still remain remarkably difficult to train, with most current papers dedicated to heuris- tically finding stable architectures. (Radford et al., 2015; Salimans et al., 2016) Despite their success, there is little to no theory explaining the unstable behaviour of GAN training. Furthermore, approaches to attacking this problem still rely on heuristics that are extremely sensitive to modifications. This makes it extremely hard to experiment with new variants, or to use them in new domains, which limits their applicability drastically. This paper aims to change that, by providing a solid understanding of these issues, and creating principled research directions towards adressing them. It is interesting to note that the architecture of the generator used by GANs doesn’t differ signifi- cantly from other approaches like variational autoencoders (Kingma Huszar, 2015). However, the problem is far from closed. Generative adversarial networks are formulated in two steps. We first train a discriminator D to maximize L(D;g ) = Ex Pr[logD(x)] + Ex Pg[log(1 D(x))] (1) One can show easily that the optimal discriminator has the shape D (x) = Pr(x)P r(x) +Pg(x) (2) and that L(D ;g ) = 2JSD(PrkPg) 2 log 2, so minimizing equation (1) as a function of yields minimizing the Jensen-Shannon divergence when the discriminator is optimal. In theory, one would expect therefore that we would first train the discriminator as close as we can to optimality (so the cost function on better approximates the JSD), and then do gradient steps on , alternating these two things. However, this doesn’t work. In practice, as the discriminator gets better, the updates to the generator get consistently worse. The original GAN paper argued that this issue arose from saturation, and switched to another similar cost function that doesn’t have this problem. However, even with this new cost function, updates tend to get worse and optimization gets massively unstable. Therefore, several questions arize: Why do updates get worse as the discriminator gets better? Both in the original and the new cost function. Why is GAN training massively unstable? Is the new cost function following a similar divergence to the JSD? If so, what are its properties? Is there a way to avoid some of these issues? The fundamental contributions of this paper are the answer to all these questions, and perhaps more importantly, to introduce the tools to analyze them properly. We provide a new direction designed to avoid the instability issues in GANs, and examine in depth the theory behind it. Finally, we state a series of open questions and problems, that determine several new directions of research that begin with our methods. 2 Under review as a conference paper at ICLR 2017 2 SOURCES OF INSTABILITY The theory tells us that the trained discriminator will have cost at most 2 log 2 2JSD(PrkPg). However, in practice, if we just train D till convergence, its error will go to 0, as observed in Figure 1, pointing to the fact that the JSD between them is maxed out. The only way this can happen is if the distributions are not continuous1, or they have disjoint supports. One possible cause for the distributions not to be continuous is if their supports lie on low dimen- sional manifolds. There is strong empirical and theoretical evidence to believe that Pr is indeed extremely concentrated on a low dimensional manifold (Narayanan 1] has accuracy 1 if it takes the value 1 on a set that contains the support of Pr and value 0 on a set that contains the support of Pg. Namely, Pr[D(x) = 1] = 1 and Pg[D(x) = 0] = 1. Theorem 2.1. If two distributions Pr and Pg have support contained on two disjoint compact sub- setsMandP respectively, then there is a smooth optimal discrimator D : X ! [0;1] that has accuracy 1 andrxD (x) = 0 for all x2M[P. Proof. The discriminator is trained to maximize Ex Pr[logD(x)] + Ex Pg[log(1 D(x))] SinceMandP are compact and disjoint, 0 a if and only if Pr+ Pg+ , and ba and the second term will have the strength to lower the probability of this too likely samples. Finally, if there’s an area around x that has the same probability to come from Pg than Pr, the gradient contributions between the two terms will cancel, therefore stabilizing the gradient when Pr is similar to Pg. There is one important problem with taking gradient steps exactly of the form (4), which is that in that case, D will disregards errors that lie exactly in g(Z), since this is a set of measure 0. However,g will be optimizing its cost only on that space. This will make the discriminator extremely susceptible to adversarial examples, and will render low cost on the generator without high cost on the discriminator, and lousy meaningless samples. This is easilly seen when we realize the term inside the expectation of equation (4) will be a positive scalar timesrx log(1 D (x))r g (z), which is the directional derivative towards the exact adversarial term of Goodfellow et al. (2014b). Because of this, it is important to backprop through noisy samples in the generator as well. This will yield a crucial benefit: the generator’s backprop term will be through samples on a set of positive measure that the discriminator will care about. Formalizing this notion, the actual gradient through the generator will now be proportional tor JSD(Pr+ kPg+ ), which will make the two noisy distributions match. As we anneal the noise, this will make Pr and Pg match as well. For completeness, we show the smooth gradient we get in this case. The proof is identical to the one of Theorem 3.2, so we leave it to the reader. 10 Under review as a conference paper at ICLR 2017 Corollary 3.2. Let ; 0 N(0; 2I) and ~g (z) = g (z) + 0, then Ez p(z); 0 [r log(1 D (~g (z)))] (5) = Ez p(z); 0 a(z) Z M P (~g (z) y)r k~g (z) yk2 dPr(y) b(z) Z P P (~g (z) y)r k~g (z) yk2 dPg(y) = 2r JSD(Pr+ kPg+ ) In the same as with Theorem 3.2, a and b will have the same properties. The main difference is that we will be moving all our noisy samples towards the data manifold, which can be thought of as moving a small neighbourhood of samples towards it. This will protect the discriminator against measure 0 adversarial examples. Proof of theorem 3.2. Since the discriminator is assumed fixed when backproping to the generator, the only thing that depends on is g (z) for every z. By taking derivatives on our cost function Ez p(z) [r log(1 D (g (z)))] = Ez p(z) r log Pg+ (g (z))P r+ (g (z)) +Pg+ (g (z)) = Ez p(z) [r logPg+ (g (z)) r log (Pg+ (g (z)) +Pr+ (g (z)))] = Ez p(z) r Pg+ (g (z)) Pg+ (g (z)) r Pg+ (g (z)) +r Pr+ (g (z)) Pg+ (g (z)) +Pr+ (g (z)) = Ez p(z) 1 Pg+ (g (z)) +Pr+ (g (z))r [ Pr+ (g (z))] 1 Pg+ (g (z)) +Pr+ (g (z)) Pr+ (g (z)) Pg+ (g (z))r [ Pg+ (g (z))] Let the density of be 1Ze kxk 2 2 2 . We now define a(z) = 12 2 1P g+ (g (z)) +Pr+ (g (z)) b(z) = 12 2 1P g+ (g (z)) +Pr+ (g (z)) Pr+ (g (z)) Pg+ (g (z)) Trivially, a and b are positive functions. Since b = aPr+ Pg+ , we know that b a if and only if Pr+ Pg+ , and bd. Let i(x) = xi be the projection onto the i-th coordinate. If x is a critical point of , since the coordinates of are independent, then x has to be a critical point of a i. By a consequence of the Morse Lemma, the critical points of i are isolated, and therefore so are the ones of , meaning that there is at most a countable number of them. Since maps the non-critical points onto a d dimensional manifold (because it acts as an embedding) and the countable number of critical points into a countable number of points (or 0 dimensional manifolds), the proof is finished. Proof of Lemma 2. For now we assume thatMandP are without boundary. If dimM+ dimP d it is known that under arbitrarilly small perturbations defined as the ones in the statement of this Lemma, the two dimensions will intersect only transversally with probability 1 by the General Position Lemma. If dimM+ dimPd, we will show that with probability 1,M+ andP+ 0 will not intersect, thereby getting our desired result. Let us then assume dimM+ dimPd. Note that ^M\ ^P6= ;if and only if there are x2M;y2P such that x + = y + 0, or equivalently x y = 0 . Therefore, ^M and ^P intersect if and only if 0 2M P. Since ; 0 are independent continuous random variables, the difference is also continuous. IfM P has measure 0 in Rd then P( 0 2M P) = 0, concluding the proof. We will therefore show thatM P has measure 0. Let f :M P!Rd be f(x;y) = x y. If m and p are the dimensions ofMand P, then f is a smooth function between an m + p-dimensional manifold and a d dimensional one. Clearly, the image of f isM P. Therefore, M P = f(fz2M Pjrank(dzf) m+pg) [f(fz2M Pjrank(dzf) = m+pg) The first set is the image of the critical points, namely the critical values. By Sard’s Lemma, this set has measure 0. Let’s call A = fz 2M Pjrank(dzf) = m + pg. Let z be an element of A. By the inverse function theorem, there is a neighbourhood Uz M P of z such that fjUz is an embedding. Since every manifold has a countable topological basis, we can cover A by countable sets Uzn, where n 2 N. We will just note them by Un. Since fjUn is an embedding, f(Un) is an m+p-dimensional manifold, and since m+pd, this set has measure 0 in Rd. Now, f(A) = Sn2Nf(Un), which therefore has measure 0 in Rd, finishing the proof of the boundary free case. 15 Under review as a conference paper at ICLR 2017 Now we consider the case whereMandP are manifolds with boundary. By a simple union bound, P ; 0( ~Mperfectly aligns with ~P) P ; 0(Int ~Mperfectly aligns with Int ~P) + P ; 0(Int ~Mperfectly aligns with @ ~P) + P ; 0(@ ~Mperfectly aligns with Int ~P) + P ; 0(@ ~Mperfectly aligns with @ ~P) = 0 where the last equality arizes when combining the facts that ~IntM= + IntM= Int ( +M) = Int ~M(and analogously for the boundary andP), that the boundary and interiors ofMandP are boundary free regular submanifolds of Rd without full dimension, and then applying the boundary free case of the proof. Proof of Lemma 3. Let m = dimMand p = dimP. We again consider first the case whereM andPare manifolds without boundary. If m+pd, thenL=;so the statement is obviously true. If m+p d, thenMandP intersect transversally. This implies thatLis a manifold of dimension m + p d m;p. SinceLis a submanifold of bothMandP that has lower dimension, it has measure 0 on both of them. We now tackle the case whereMandPhave boundaries. Let us remember thatM= IntM[@M and the union is disjoint (and analogously forP). By using elementary properties of sets, we can trivially see that L=M\P = (IntM\IntP)[(IntM\@P)[(@M\IntP)[(@M\@P) where the unions are disjoint. This is the disjoint union of 4 strictly lower dimensional manifolds, by using the first part of the proof. Since each one of these intersections has measure 0 on either the interior or boundary ofM(again, by the first part of the proof), and interior and boundary are contained in M, each one of the four intersections has measure 0 in M. Analogously, they have measure 0 inP, and by a simple union bound we see thatLhas measure 0 inMandP finishing the remaining case of the proof. Proof of Theorem 3.1. We first need to show that PX+ is absolutely continuous. Let A be a Borel set with Lebesgue measure 0. Then, by the fact that and X are independent, we know by Fubini PX+ (A) = Z Rd P (A x) dPX(x) = Z Rd 0 dPX(x) = 0 Where we used the fact that if A has Lebesgue measure zero, then so does A x and since P is absolutely continuous, P (A x) = 0. Now we calculate the density of PX+ . Again, by using the independence of X and , for any Borel set B we know PX+ (B) = Z Rd P (B y) dPX(y) = Ey PX [P (B y)] = Ey Px Z B y P (x)dx = Ey Px Z B P (x y)dx = Z B Ey Px [P (x y)]dx Therefore, PX+ (B) = RBPX+ (x)dx for our proposed PX+ and all Borel sets B. By the unique- ness of the Radon-Nikodym theorem, this implies the proposed PX+ is the density of PX+ . The equivalence of the formula changing the expectation forRMPX is trivial by the definition of expec- tation and the fact that the support of PX lies onM. 16 Under review as a conference paper at ICLR 2017 B FURTHER CLARIFICATIONS In this appendix we further explain some of the terms and ideas mentioned in the paper, which due to space constrains, and to keep the flow of the paper, couldn’t be extremely developed in the main text. Some of these have to do with notation, others with technical elements of the proofs. On the latter case, we try to convey more intuition than we previously could. We present these clarifications in a very informal fashion in the following item list. There are two different but very related properties a random variable can have. A random variable X is said to be continuous if P(X = x) = 0 for all single points x2X. Note that a random variable concentrated on a low dimensional manifold such as a plane can have this property. However, an absolutely continuous random variable has the following property: if a set A has Lebesgue measure 0, then P(X2A) = 0. Since points have measure 0 with the Lebesgue measure, absolute continuity implies continuity. A random variable that’s supported on a low dimensional manifold therefore will not be absolutely continuous: let M a low dimensional manifold be the support of X. Since a low dimensional manifold has 0 Lebesgue measure, this would imply P(X 2 M) = 0, which is an absurd since Mwas the support of X. The property of X being absolutely continuous can be shown to be equivalent to X having a density: the existence of a function f : X