Technical Report
Department of Computer Science
and Engineering
University of Minnesota
4-192 Keller Hall
200 Union Street SE
Minneapolis, MN 55455-0159 USA
TR 18-012
 
  On Applications of GANs and Their Latent Representations 
Cameron Fabbri, Junaed Sattar
July 9, 2018
Powered by TCPDF (www.tcpdf.org)
On Applications of GANs and Their Latent
Representations
Cameron Fabbri
Computer Science and Engineering
University of Minnesota
Minneapolis, MN 55455
fabbr013@umn.edu
Junaed Sattar
Computer Science and Engineering
University of Minnesota
Minneapolis, MN 55455
junaed@umn.edu
Abstract
This report describes various applications of Generative Adversarial Networks
(GANs) for image generation, image-to-image translation, and vehicle control.
With this, we also investigate the role played by the computed latent space, and
show various ways of exploiting this space for controlled image generation and
exploration. We show one pure generative method which we call AstroGAN that is
able to generate realistic images of galaxies from a set of galaxy morphologies. Two
image-to-image translation methods are also displayed: StereoGAN, which is able
to generate a pair of stereo images given a single image; Underwater GAN, which
is able to restore distorted imagery exhibited in underwater environments. Lastly,
we show a generative model for generating actions in a simulated self-driving car
environment.
1 Introduction
Vision is a very powerful tool used across many different academic and industrial communities
including healthcare, natural sciences, entertainment, and commerce. Images are able to capture
various levels of abstraction that are easy for a human to decipher, but may be more challenging for a
machine. For example, given an image of some scene, a human may be able to classify the scene as
“dangerous” or “not dangerous”, despite the many abstract details in the image causing them to make
that decesion. Classically, for a machine to perform the same task, it must be able to extract explicit
details from the image that would cause it decide on one or the other.
With the ever growing amount of data, it is important to enable automated approaches for analyzing
and extracting useful information from these images. Recent advances in deep learning [14] has
enabled a variety of successful approaches towards doing just this. Machines are now able to, with
close or better than human accuracy in some areas, reason and extract meaningful details and abstract
concepts from images given a sufficiently large amount of data, such as image classification [13].
While this may appear to be a purely discriminative task, we present three reasons for which a
generative model can be of use. Some generative processes are mostly methods to be used early in
the automation pipeline, with the goal of enhancing the discriminative methods being used to extract
information. For each case we point to a specific application able to overcome and address the issue
at hand.
Missing Data Although we live in the age of (seemingly endless) information, there are still
environments and situations in which certain data is underrepresented. Thinking about image datasets
in a probabilistic sense as samples coming from some unknown distribution, none if any are uniform.
In other words, many times we have outliers in datasets, and these outliers in fact may be the most
interesting data points. There are many environments, especially in the natural sciences, in which it is
very difficult to obtain a sufficient amount of data across all types of subdomains. These difficulties
may arise from simply the lack of data naturally available, or the high cost that comes with collecting
it.
The field of astronomy is very interested in capturing images of galaxies, as these images capture
detailed morphologies able to give clues about the evolution such galaxies [16, 17]. Furthermore,
due to the finite speed of light, images of galxies at different distances are able to capture a younger
universe. While many of these morphological features are due to the properties exhibited by the
galaxy (e.g., the presence of spiral arms), others may be due to the line of sight. This line of sight
also plays a major role in missing data, as because we cannot simply travel to another location of the
universe to capture a different view, we are stuck with a single view for each galaxy. If we are to
capture a galaxy with extremely rare morphological properties, then we unfortunately have to deal
with the fact that this may be the only viewpoint available. Due to many machine learning algorithms
requiring a vast amount of data to learn from, this makes it difficult to train these algorithms to
automatically extract information from these sorts of outliers. In Section 6 we display a method to
generate images of galaxies given a set of morphological features.
A very different form of missing data may come from sensor failure or sensor deprivation. In the
field of robotics, stereo vision is a commonly used form of perception. Given a pair of stereo images,
as well as the camera intrinsics and extrinsics, one is able to effectively calculate the distance to an
object of interest in the frame. However, given a robot with only a single camera, or a case in which
one of these cameras fail, this calculation is unable to be performed. Section 4 shows a method
able to create a pair of stereo images given only a single image. Given the image captured by the
left camera, we are able to generate what would be seen by the right camera, and vice versa. This
application is general, and can be applied to systems with only a single camera, allowing such systems
to hallucinate a second camera.
Corrupted Data Many environments are naturally noisy in the visual domain. This noise may
come from the environment itself, or manmade alterations aimed at improving vision (such as night
vision or infrared). In either case, this noise causes many vision related tasks such as classification,
tracking, or segmentation to suffer. For this reason, we are interested in using generative modeling
to enhance these images such that they can be used further down the autonomy pipeline. Section 5
focuses on the underwater domain, for which distortion is especially prevalent.
In the underwater domain, light refraction, absorption, and scattering from suspended particles can
greatly affect optics. Due to the absorption of red wave lengths by the water, images tend to have
a blue or green hue. This effect worsens as one goes deeper and more red light is absorbed. This
distortion is extremely non-linear in nature, and is affected by a large number of factors, such as the
amount of light present, the amount of particles in the water, the time of day, and the camera being
used. Not only may this distortion may cause difficulty in vision-based tasks, visual inspection by
humans becomes more difficult as the quality of the images degrades.
Correct Answers Tasks such as classification or regression naturally have one correct answer or
label. However, in some applications, there may be more than one viable choice. Consider playing a
video game in which you are driving a car around in a city. The field of Reinforcement Learning is
interested in finding an optimal policy in which to do so, in order to maximize some reward. However,
we argue that depending on the task, there may not be an optimal policy, and therefore no optimal
action to take at a certain timestep.
We look into the task of automatically driving a car in a simulation with absolutely no goal, other
than to mimic a human player. This leaves us with a situation in which for a single input (e.g., an
intersection) in which more than one choice may be correct (e.g., turning left or right). Rather than
constrain the solution space by using regression, we formulate the problem using GANs in order to
generate a realistic action given a stack of n frames as input.
Section 2 gives a brief overview of the core design behind each method. Section 3 gives an overview
of our network architecture used in StereoGAN and UGAN. Section 4 presents StereoGAN, a method
to generate a stereo pair from a single image. Section 5 presents Underwater GAN, a method to
enhance and restore distorted underwater images. Section 6 presents AstroGAN, which is able to
generate galaxies conditioned on a set of morphological features. Section 7 presents Driving GAN
(DGAN), which aims to produce actions conditioned on frames. Finally, we conclude in Section 8
2
2 Background
Here we give a very brief background on Generative Adversarial Networks (GANs) and some of their
variants. We do not make any theoretical claims or improvements; rather, we explore applications in
which GANs proove to be very useful, and in some cases necessary over previous generative models.
Readers are directed towards [8, 1, 20, 2, 22, 9] for further details in the theoretical domain.
Generative Adversarial Networks Generative Adversarial Networks (GANs) [8] represent a class
of generative models that are based on game theory. Given two differentiable functions modeled as
neural networks, a generator G attempts to generate data points given input noise that are able to fool
a discriminator D. The discriminator is given either real data points or data points produced by G,
and aims to classify them as real or fake. This is formally defined as the following minimax function:
min
G
max
D
Ex∼Pr [log D(x)] + Ez∼Pz [log (1−D(G(z)))] (1)
where Pz is a prior on the input noise (e.g., z ∼ N (0, 1)) and Pr is the true dataset.
Conditional GANs Conditional GANs (cGANs) [18] introduce a simple method to give varying
amounts of control to the image generation process by some extra information y (e.g a class label).
This is done by simply feeding y to the generator and discriminator networks, along with z and x,
respectively. The new objective function then becomes:
min
G
max
D
Ex,y∼P[log D(x, y)] + Ez∼P,y∼P[log (1−D(G(z, y)))] (2)
Wasserstein GAN It has been shown that even in very simple scenarios the original GAN for-
mulation is very unstable. The discriminator as a classifier may not supply useful gradients for
the generator [2], and the vanishing gradient problem stops G from learning. On the other hand,
the EM distance does not suffer from these problems of vanishing gradients. The EM distance or
Wasserstein-1 is defined as:
W (Pr,Pg) = infγ∈∏(Pr,Pg)E(x,y)∼γ [||x− y||] (3)
where
∏
(Pr,Pg) denotes the set of joint distributions γ(x, y) whose marginals are Pr and Pg,
respectively [2]. Because the infimum is very troublesome, [2] instead proposes to approximate W
given a set of K-Lipschitz functions f by solving the following:
max
w∈W
Ex∼Pr [fw(x)]− Ez∼Pz [fw(Gθ(z))] (4)
Improved Wasserstein GANs In order to enforce the Lipschitz constraint without clipping the
weights of the discriminator, [9] instead penalizes the gradient, leading to a WGAN with gradient
penalty (WGAN-GP). More formally, a function is 1-Lipschitz only if it has gradients with a norm
of 1 almost everywhere. In order to enforce this, WGAN-GP directly constrains the norm of the
discriminator’s output with respect to its input. This leads to the new objective function:
max
w∈W
Ez∼p(z)[fw(Gθ(z))]− Ex∼Pr [fw(x)] + λExˆ∼Pxˆ [(||∇xˆfw(xˆ)||2 − 1)2] (5)
where Pxˆ is defined as sampling uniformaly along straight lines between pairs of sampled from
the true data distribution, Pr, and the distribution assumed by the generator, Pg = Gθ(z). This is
motivated by the intracability of of enforcing the unit gradient norm constraint everywhere. Because
the optimal discriminator consists of straight lines connecting the two distributions (see [9] for more
details), the constraint is enforced uniformaly along these lines. As with the original WGAN, the
discriminator is updated n times for every 1 update of the generator.
Before turning to our applications, we go over the architecture of the generator used in StereoGAN
(Section 4) and UGAN (Section 5). The discriminator used in all three applications is the same, and
we utilize the Improved Wasserstein GAN formulation.
3
3 Network Details
Generator At their core, StereoGAN and UGAN are image-to-image translation methods. As such,
their main differences lie within the task at hand, leaving the design of the architecture to remain
the same. Our generator network is a fully convolutional encoder-decoder, similar to the work of
[11], which is designed as a “U-Net” [21] due to the structural similarity between input and output.
Encoder-decoder networks downsample (encode) the input via convolutions to a lower dimensional
embedding, which is then upsampled (decode) via transpose convolutions to reconstruct an image.
The advantage of using a “U-Net” comes from explicitly preserving spatial dependencies produced
by the encoder, as opposed to relying on the embedding to contain all of the information. This is done
by the addition of “skip connections”, which concatenate the activations produced from a convolution
layer i in the encoder to the input of a transpose convolution layer n− i+ 1 in the decoder, where n
is the total number of layers in the network. Each convolutional layer in our generator uses kernel
size 4× 4 with stride 2. Convolutions in the encoder portion of the network are followed by batch
normalization [10] and a leaky ReLU activation with slope 0.2, while transpose convolutions in the
decoder are followed by a ReLU activation [19] (no batch norm in the decoder). Exempt from this
is the last layer of the decoder, which uses a TanH nonlinearity to match the input distribution of
[−1, 1]. Recent work has proposed Instance Normalization [27] to improve quality in image-to-image
translation tasks, however we observed no added benefit.
Discriminator Our fully convolutional discriminator is modeled after that of [20], except no batch
normalization is used. This is due to the fact that WGAN-GP penalizes the norm of the discriminator’s
gradient with respect to each input individually, which batch normalization would invalidate. Our
discriminator is modeled as a PatchGAN [11, 15], which discriminates at the level of image patches.
As opposed to a regular discriminator, which outputs a scalar value corresponding to real or fake, our
PatchGAN discriminator outputs a 32× 32× 1 feature matrix, which provides a metric for high-level
frequencies.
4 StereoGAN
The use of stereo vision is very common in many computer vision applications, especially in the
field of robotics. By utilizing knowledge about the camera parameters, one is able to extract 3D
information for a given scene from a pair of images. Stereo matching, also known as disparity
mapping, aims to produce a depth map for an image to extract the depth for each pixel. Classically
this is done using some feature matching algorithm such as Sift or ORB.
Recently there have been machine learning approaches to the problem of stereo matching [31, 12].
Often dealing with hardware may be an issue, whether it be from cost, maintenance, or design.
Towards this end, many techniques have been geared towards depth map prediction from a single
image [5, 6]. Most similar to our approach is the work of [7], who proposed a depth estimation
method by generating a disparity image. Given a pair of ground truth stereo images they aim to
generate the right image given the left and introduce a novel loss in order to enforce consistency
between disparities. The work of [29] has a similar approach, but for the task of creating stereo pairs
for 3D movies.
These methods tailored their loss functions towards the task of depth estimation. We take a different
approach. Our goal is to be able to generate a stereo pair given a single image for use in any task
involving stereo vision. Different from past methods, we train a deep convolutional neural network to
generate the left image given the right as well as generate the right image given the left. As discussed
in our experiments, being able to generate both shows to capture interesting 3D properties within the
model that can be exploited.
Approach As the name suggests, we use a Generative Adversarial Network as our generative
model. GANs are able to produce sharp images not exhibited when using a pixel-wise metric such as
L2. Past methods such as [7] that used a loss tailored to disparity were not interested in the visual
output, so as long as the generated image improved the disparity calculation, it was good enough.
We are interested in this as a general approach for computing stereo images, so we do care about the
image quality.
4
IL IR
R
ea
l
G
en
er
at
ed
Figure 1: Samples generated by StereoGAN compared with ground truth. The first column displays
IL, and the second column displays IR. Top: group truth. Bottom: generated.
Generator Our architecture is similar to siamese networks [4], except we do not use shared weights
in the generators. Each generator is modeled from the pix2pix architecture [11]. We use two
generators: G1, which takes as input a right image and outputs a left, and G2, which takes as input a
left image and outputs a right. Let IL be an image seen from the left camera, and IR be an image
seen from the right camera. Our two generators are then defined as:
IL
′
= G1(I
R; θG1) (6)
IR
′
= G2(I
L; θG2), (7)
where IL
′
and IR
′
are the generated left and right images respectively, and θG1 , θG2 represent the
weight parameters for each generator.
Discriminator There are multiple ways we could design our discriminator. Unlike a nonconditional
GAN in which the discriminator simply takes as input a single image, we want to condition our
generated image on a second image, namely either the left or right in the stereo pair (a true image).
We decided on simply stacking the channels, left image on top of the right. We need to provide D
with real samples as well as fake samples coming from G. Because we have two generators, we need
to send D fake samples from both. This is shown formally in Equation 8.
We use the Improved Wasserstein method as discussed in Section 2. For sake of space and simplicity
in notation, we define the gradient penalty as LGP = λGPExˆ∼Pxˆ [(||∇xˆD(xˆ)||2 − 1)2], where Pxˆ
is defined as sampling along straight lines between points from the data distribution Pr and the
generator distribution Pg . We also consider the L1 loss as well, defined as LL1 = 12 (||IL − IL
′ ||1 +
||IR − IR′ ||1). Our final objective function is then:
LSGAN (G,D) = E[D(IL ⊕ IR)]− 1
2
E[D(IL ⊕ IR′) +D(IL′ ⊕ IR)] + LGP + LL1 (8)
where ⊕ is concatenation along the image channels, and IL′ , IR′ are defined in 6 and 7 respectively.
We use a 12 in Equation 8 because we are receiving two losses from D on our generated data, so we
want to average the loss between the two.
5
Figure 2: Feature matching using an ORB detector. Top Row: Left and right real images. Bottom
Row: Left and right generated images.
Experiments We display the results of the generated images on a held out test set. We use the New
College Dataset [25], which consists of stereo images taken at New College, Oxford. We show a
qualitative comparison of our generated images with the ground truth, as well as a comparison of
running a stereo match algorithm. Figure 1 shows a comparison of our generated images with ground
truth. Where as a simple homography would be able to rotate the image, it is not able to capture
translation. Our method however, is able to capture translation, as well as inpainting along the edges
of the images.
Because our method is able to generate both left and right images, an interesting experiment we can
run is the case of already having a stereo pair, and generating images outside of the camera’s field of
view. Consider Figure 3. The green lines can be though of as the true stereo images, coming from
for example a mobile robot surveying a scene. If we are interested in something like structure from
motion, then having more than just two images is desirable. Instead of using G1 to generate IL from
IR, we can instead use IL as an input, and compute a new IL (and vice versa with G2). This is
shown explicitly in Figure 4.
A more comprehensive metric is to compare to past methods for computing disparity, but we leave
that for our future work.
Figure 3: Diagram of generating images outside of the camera’s field of view. Green lines represent
the true viewpoints of the two cameras, and the blue lines are the synthesized images. IL
′
2 = G1(I
L1)
and IR
′
2 = G2(I
R1)
Figure 4: Images generated outside of the camera’s field of view. The images correspond to those
in Figure 3. The two center images are the true stereo pair, and the two images on the ends are
generated. Our method is able to inpaint outside of the scene. For example, the wall in front of the
cart in the far left image was generated even though it cannot be seen in the two original images. This
transformation can be seen easier in video format: https://i.imgur.com/QOS4au4.gif
6
5 Underwater GAN
Underwater images distorted by lighting or other circumstances lack ground truth, which is a necessity
for previous colorization approaches [30]. Furthermore, the distortion present in an underwater image
is highly nonlinear; simple methods such as adding a hue to an image do not capture all of the
dependencies. Physics based models have been designed in order to capture this distortion, but
are unable to generalize given the extreme differences two separate environments may exhibit [23].
Towards this end of capturing the extreme nonlinear distortions without designing environment
specific models, we use CycleGAN [32] as a distortion model in order to generate paired images for
training. CycleGAN is an unsupervised image-to-image style translation method (i.e., it does not need
paired samples). Given two domains X and Y , CycleGAN learns a mapping G : X → Y such that
images sampled fromG(X) appear to have come from Y , as well as a mapping F : Y → X . In order
to ensure the translated image still contains properties from the original image, a constraint on the
model is the cycle consistency loss, F (G(X)) = X (and vice versa). Given a domain of underwater
images with no distortion, and a domain of underwater images with distortion, CycleGAN is able
to perform style transfer. Given an undistorted image, CycleGAN distorts it such that it appears to
have come from the domain of distorted images. These pairs are then used in our algorithm for image
reconstruction and enhancement.
Dataset Generation Depth, lighting conditions, camera model, and physical location in the under-
water environment are all factors that affect the amount of distortion an image will be subjected to.
Under certain conditions, it is possible that an underwater image may have very little distortion, or
none at all. We let IC be an underwater image with no distortion, and ID be the same image with
distortion. Our goal is to learn the function f : ID → IC . Because of the difficulty of collecting
underwater data, more often than not only ID or IC exist, but not both.
To circumvent the problem of insufficient image pairs, we use CycleGAN to generate ID from
IC , which gives us a paired dataset of images. Given two datasets X and Y , where IC ∈ X and
ID ∈ Y , CycleGAN learns a mapping F : X → Y . Figure 5 shows paired samples generated from
CycleGAN. From this paired dataset, we train a generator G to learn the function f : ID → IC . It
should be noted that during the training process of CycleGAN, it simultaneously learns a mapping
G : Y → X , which is similar to f . In Section 5, we compare images generated by CycleGAN with
images generated through our approach. Because paired image-to-image translation is a simpler
problem, our method is able to outperform CycleGAN for this task.
Figure 5: Paired samples of ground truth and distorted images generated by CycleGAN. Top row:
Ground truth. Bottom row: Generated samples.
Methodology We use the Improved Training of WGAN as described in Section 2. Our network
architecture is described in Section 3. Conditioned on a distorted image ID, the generator is trained
to produce an image to try and fool the discriminator, which is trained to distinguish between the true
non-distorted underwater images and the supposed non-distorted images produced by the generator.
Given IC ∈ X , ID ∈ Y , and G and D both deep neural networks, our objective then becomes
LWGAN (G,D) = E[D(IC)]− E[D(G(ID))] + λGPExˆ∼Pxˆ [(||∇xˆD(xˆ)||2 − 1)2], (9)
7
A B C D E F G
O
ri
gi
na
l
U
G
A
N
Figure 6: Samples from our ImageNet testing set. The network can both recover color and also
correct color if a small amount is present.
where Pxˆ is defined as samples along straight lines between pairs of points coming from the true data
distribution and the generator distribution, and λGP is a weighing factor. In order to give G some
sense of ground truth, as well as capture low level frequencies in the image, we also consider the L1
loss
LL1 = E[||IC −G(ID)||1]. (10)
Combining these, we get our final objective function for our network, which we call Underwater
GAN (UGAN),
L∗UGAN = min
G
max
D
LWGAN (G,D) + λ1LL1(G). (11)
Experiments We experimented on distorted underwater images taken from a test set. Given that
these images do not have ground truth, quantitative results may be difficult to acquire. Figure 6
display qualitative results. Not only is UGAN able to restore completely distorted images, it is also
able to preserve correct color and restore only parts of the image that have been distorted (Column
G).
For a quantitative result, we look toward local image patch statistics, specifically the mean and
standard deviation of a patch. The standard deviation gives us a sense of blurriness because it defines
how far the data deviates from the mean. In the case of images, this would suggest a blurring effect
due to the data being more clustered toward one pixel value. Table 1 shows the mean and standard
deviations of the RGB values for the local image patches seen in Figure 8. Despite qualitative
evaluation showing our methods are much sharper, quantitatively they show only slight improvement
over CycleGAN. Other metrics such as entropy are left as future work.
Latent Exploration Here, we discuss our insights to the inner workings of the model. With a
normal autoencoder, the latent embedding would contain all of the information about the image. One
can perform certain operations on this embedding, and see a change in the pixel space, as seen in
Section 6. However, UGAN makes use of skip connections due to the spatial similarity between
input and output. What this means is the latent embedding is no longer forced to contain every bit of
information about the input.
Intuitively, one would expect the skip connections to contain information dealing with the image
structure, and the embedding to contain color content. However, we concluded this is not the case.
Figure 7: Interpolation in the image space. The far left is the original image, and the far right is
corrected by UGAN. These intermediate images may be used as more training samples.
8
Original CycleGAN UGAN
Figure 8: Local image patches extracted for qualitative and quantitative comparisons, shown in Table
1. Each patch was resized to 64× 64, but shown enlarged for viewing ability (best seen in PDF).
Our experiment was set up as follows. We took a clean underwater image IC , and distorted it using
CycleGAN to create ID. Then we used these as input to the generator network, and saved out
their latent embeddings. Formally, the embedding for IC and ID respectively is eC ∈ R512 and
eD ∈ R512. Linear interpolation was used to interpolate between eC and eD. These values were sent
through the decoder, along with the respective skip connections. However, there was no change in
the output image. We then sampled randomly from a Gaussian distribution in a range [−1, 1] to use
as the embeddings, and found the same conclusion. This shows that the embedding plays a minimal
role, and most information is contained elsewhere in the model. What this insight provides us is a
way to try and reduce the number of model parameters without losing accuracy.
Table 1: Mean and Standard Deviation Metrics
Method/
Patch
Original CycleGAN UGAN
Red 0.43 ± 0.23 0.42 ± 0.22 0.44 ± 0.23
Blue 0.51 ± 0.18 0.57 ± 0.17 0.57 ± 0.17
Green 0.36 ± 0.17 0.36 ± 0.14 0.37 ± 0.17
Orange 0.3 ± 0.15 0.25 ± 0.12 0.26 ± 0.13
Interestingly, we can linearly interpolate in the image space. Figure 7 shows interpolation between
two images. While not particularily useful, it is still visually interesting. An idea for future work is to
use these interpolants as training points in order to capture a wider variety of distortions.
6 AstroGAN
This section presents AstroGAN, a generative model able to generate new, unseen images of galaxies
given a morphological description. This method allows astronomers to generate a galaxy with a
known set of morphological features in different viewpoints. Unlike many other natural images, the
field of astronomy contains images of galaxies in which only a single view is available due to the
extreme distance between us and the observed galaxy. Two different galaxies may contain the same
morphology, but appear to be visually different depending on the line of sight. These differences
can cause learning algorithms to suffer, as many state of the art machine learning techniques now
require vast amounts of data. AstroGAN provides a step towards artifically generating galaxies with
a specific morphology in order to improve or verify machine classifiers. A natural extension to our
model is the introduction of an encoder network, in order to allow one to perform modifications to a
specific galaxy of interest. We display our methods on two different datasets.
9
Figure 9: Interpolation along the redshift attribute using the EFIGI dataset. Each row contains the same z
and (first four) y values, and linearly interpolates along the redshift attribute. Despite the model never seeing a
redshift value outside of [2.635e-5, 0.08245], it is able to smoothly interpolate between [0.005, 0.1].
EFIGI The first dataset we use to evaluate our model is the EFIGI (“Extraction de Formes Idealisées
de Galaxies en Imagerie”) catalogue [3], which contains 4458 galaxies with detailed morphologies
classified by 10 expert astronomers. Each morphology contains 16 shape attributes, plus one attribute
accounting for redshift. From these, including the redshift attribute, we choose those with the most
visual impact, leaving us with 5 attributes. Further details on the attribute values can be found in [3].
We use 4058 of these in our training set, and hold out 400 for our test set.
Arm Strength The arm strength attribute measures the relative strength of the spiral arms in terms
of the flux fraction relative to the entire galaxy. This value ranges from 0 to 1 in increments of 0.25,
with 0 having very weak or no spiral arms, and 1 having the highest contribution of spiral arms.
Arm Curvature The arm curvature attribute measures the average intrinsic curvature of the spiral
arms. This value ranges from 0 to 1 in increments of 0.25, with 0 having wide open spiral arms, and
1 having the tightly wound spiral arms.
Visible Dust The visible dust attribute measures the strength of the features revealing the presence
of dust, including obscuration and diffusion of star light. This value ranges from 0 to 1 in increments
of 0.25, with 0 having no dust, and 1 having a high amount of dust.
Multiplicity The multiplicity attribute quantifies the abundance of galaxies in the neighborhood
of the central galaxy. This value ranges from 0 to 1 in increments of 0.25, with 0 having no nearby
galaxies, and 1 having four or more nearby galaxies.
Redshift The redshift attribute measures the cosmological redshift exhibited by the galaxy. This is
a continuous value with a range of [2.635e-5, 0.08245].
While the first four of these attributes are able to be discretized, redshift is inherently continuous.
The combination of discrete and continuous attributes displays the flexibility our model exhibits in
capturing these properties. An important factor in the detection of morphological features is the effect
redshift can have on a galaxy. Human lifespans prevent viewing the same galaxy at vastly different
redshift values, therefore determining what certain morphological features may be visible as a galaxy
evolves is not straightforward. Figure 9 displays linear interpolation along the redshift attribute, while
the rest of the galaxy structure is kept the same. Visually, one can see the change in morphological
features, as the galaxy begins to dim at higher redshift values. Although the model is never given the
same galaxy at different redshift values (because of the impossibility of acquiring that data), there is
enough information contained in the individual data points for the model to learn its effect on the
visibility of morphological features. Surprisingly, the model was able to extrapolate the effect as the
attribute value varied outside the range actually present in the training data (e.g., Fig. 9).
10
Real ←−−−−−−− Generated −−−−−−−→ Real ←−−−−−−− Generated −−−−−−−→
Figure 10: Two sets of samples from the EFIGI dataset and generated samples using our GAN. Samples in the
left columns are real images from the true dataset. For the generated samples, each row uses the same y value
(attribute vector) as the true image in the left column, and each column uses a different z value. The attributes y
used to generate were not available during training, and held out in a separate test set.
Figure 10 displays novel galaxies generated using attributes from the test set. We compare these gen-
erated galaxies with true images coming from the test set with the same attribute. Qualitative results
show that the generated galaxies all exhibit similar morphological features, while also displaying a
fair amount of diversity (i.e. the generated galaxies are not the same exact image). Additionally, we
show interpolation along individual attributes contained in y in Figure 12. The smooth interpolation
along the manifold allows for new data not observed in the train or test set to be generated.
Galaxy Zoo The Galaxy Zoo project [17] is an ensamble of citizen science projects that aim to
use the crowd to assist in the morphological classification of galaxies. In a typical project, each
galaxy is inspected by multiple volunteers, who answer a number of questions pertaining the object’s
appearance. The questions are organized in a tree-based structure, with the first questions dividing the
galaxies into broad morphological classes before subsequent questions look at increasingly detailed
aspects of a galaxy’s appearance. An example of a Galaxy Zoo decision tree is shown in visual form
in Figure 1 of Simmons et al. (2017). As a result of the full set of questions, a continuous attribute
y ∈ R37 is associated with each galaxy, and the degree of consistency between the classifications
provided by volunteers provides a measure of the precision of their aggregate classification (see [28]
for more information). Of the 61, 578 images we use 60, 000 for training and 1, 578 for testing. Figure
11 displays instances generated from attributes contained in our test set. The network architecture is
identical of that used on the EFIGI dataset.
Real ←−−−−−−− Generated −−−−−−−→ Real ←−−−−−−− Generated −−−−−−−→
Figure 11: Two sets of samples from the Galaxy Zoo dataset and generated samples using our GAN. Samples
in the left columns are real images from the true dataset. For the generated samples, each row uses the same
y value (attribute vector) as the true image in the left column, and each column uses a different z value. The
attributes y used to generate were not available during training, and held out in a separate test set.
11
Table 2: RMSE of Galaxy Zoo
Network Dataset Data RMSE
Inception Resnet v2 Galaxy Zoo Real 0.2126
Inception Resnet v2 Galaxy Zoo Generated 0.2200
Inception Resnet v2 Galaxy Zoo Both 0.2152
Alexnet Galaxy Zoo Real 0.1697
Alexnet Galaxy Zoo Generated 0.1775
Alexnet Galaxy Zoo Both 0.1750
Image Quality Assessment Determining the quality of images generated by a GAN is difficult
due to the lack of an explicit objective function. We explore various ways of quantitatively and
qualitatively analyzing our generated images. We perform a qualitative comparison by generating
samples from the Galaxy Zoo and EFIGI datasets and conditioning on attributes from our test set.
Figure 11 shows that conditioned on a real attribute, our model is able to generate different novel
galaxies which all exhibit the same visual morphology related to that attribute.
Predicting Morphologies We trained two popular networks to predict the galaxy morphologies
given an image to assess how accurately our generator was capturing the attributes. For both datasets,
we trained Alexnet [13] and Inception Resnet [26] on real data, generated data, and a combination of
the two. To evaluate, we calculate the root mean squared error over our holdout test sets. For both
datasets, we found Alexnet to outperform Inception Resnet.
Table 3 shows the RMSE results for the EFIGI dataset. While training on the generated data performs
slightly worse on the test set, it is a very small amount. Interestingly, using both the generated and
real data shows to outperform using only the real data. Table 2 shows the RMSE results for the
Galaxy Zoo dataset. Again, training on the real data outperforms the generated data, but only by a
small margin, showing that our generated galaxies exhibit accurate morphological features.
Table 3: RMSE of EFIGI
Network Dataset Data RMSE
Inception Resnet v2 EFIGI Real 0.4042
Inception Resnet v2 EFIGI Generated 0.4065
Inception Resnet v2 EFIGI Both 0.4072
Alexnet EFIGI Real 0.3703
Alexnet EFIGI Generated 0.3875
Alexnet EFIGI Both 0.3681
z1
z2
z3
Figure 12: Interpolation samples from the EFIGI dataset. Each row uses a constant z value, but a different
y value. First row: interpolation along the arm curvature attribute. Second row: Interpolation along the arm
strength attribute. Third row: Interpolation along the dust attribute.
Future work includes training on a larger dataset, allowing the model to be more flexible in the
morphologies that it captures. Futhermore, as these images are quite small (64 × 64), we aim to
increase the resolution for a more practical impact.
12
7 DGAN
This section presents some initial findings on our inprogress work on self driving cars using GANs.
Rather than take a reinforcement learning approach, we structure the problem without a specific goal
in mind: train an agent to drive a car in simulation such that it is indistinguishable from a human
player. This leads away from a discriminative approach and more towards a generative model able to
capture a distribution of realistic actions.
Simulation We use the video game Grand Theft Auto V (GTAV) for our simulation environment.
With realistic graphics, modifications (mods) able to change the weather, hundreds of diffirent types
of cars and motorcycles, and a multitude of landscapes including desert, city, and rural, it is the
perfect sandbox environment for teaching an agent. While just a simulation, the work of [24] showed
that they were able to improve the realism of generated images by introducing an “enhancer” network.
This work could be applied here in order to improve the already very realistic graphics for training on
close to real-world data.
Capturing Data We capture data by recording a human play the game, driving around the map
with no end location or goal in mind. For each frame, we capture the keys pressed during that frame.
Possible actions are: W, A, S, D, WA, WD, DS, DA, NO_KEY, leaving us with a one-hot action
vector y ∈ R9 (WASD correspond to arrow keys on a keyboard with W=up, S=down, etc.). An in
game day takes 48 minutes in real time, meaning that by having a person play the game for an hour
we can obtain data across varying lighting conditions. Changing the weather to rainy or cloudy, as
well as driving the many different types of cars available can expand our dataset even further 1.
Approach While this project is early in implementation, we have been training a vanilla conditional
GAN (not WGAN like the rest of this reoprt) to generate an action given a series of n = 4 frames.
The viewpoint given to the network is a 3rd person view of the vehicle, as seen in Figure 13.
Figure 13: Two screenshots from GTAV showing very different environment, weather, and lighting
conditions. The networks are trained using this viewpoint as input.
While this viewpoint can be changed, it is more natural for a human player and therefore a good first
step. The network architecture for the generator and discriminator are the same, and have the same
design as the discriminator as seen in [20]. We currently do not have any preliminary results.
8 Conclusion
This report presented three applications of Generative Adversarial Networks for both image generation
and image-to-image translation. Qualitative, as well as quantitative results were shown. Our current
and future work explores the following:
• UGAN: extend it in order to be end-to-end trainable.
• UGAN: improve the distorter network to capture a wider variety of underwater scenes.
• StereoGAN: Compute disparity using 2+ images to compare with state of the art.
• AstroGAN: Improve the sharpness and quality, as well as extend to further datasets.
• DGAN: Incorporate goals such as a destination to test the generalization of the network.
1A full list of vehicles, planes, bikes, and boats that are available in-game (without mods) can be found here:
http://grandtheftauto.net/gta5/vehicles
13
References
[1] Martin Arjovsky and Léon Bottou. Towards principled methods for training generative adver-
sarial networks. arXiv preprint arXiv:1701.04862, 2017.
[2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint
arXiv:1701.07875, 2017.
[3] Anthony Baillard, Emmanuel Bertin, Valérie De Lapparent, Pascal Fouqué, Stéphane Arnouts,
Yannick Mellier, Roser Pelló, J-F Leborgne, Philippe Prugniel, Dmitry Makarov, et al. The
efigi catalogue of 4458 nearby galaxies with detailed morphology. Astronomy & Astrophysics,
532:A74, 2011.
[4] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-
convolutional siamese networks for object tracking. In European conference on computer vision,
pages 850–865. Springer, 2016.
[5] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image
using a multi-scale deep network. In Advances in neural information processing systems, pages
2366–2374, 2014.
[6] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single
view depth estimation: Geometry to the rescue. In European Conference on Computer Vision,
pages 740–756. Springer, 2016.
[7] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth
estimation with left-right consistency. In CVPR, volume 2, page 7, 2017.
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural
information processing systems, pages 2672–2680, 2014.
[9] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville.
Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
[10] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the
32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine
Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.
[11] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with
conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
[12] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham
Bachrach, and Adam Bry. End-to-end learning of geometry and context for deep stereo
regression. CoRR, vol. abs/1703.04309, 2017.
[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems, pages
1097–1105, 2012.
[14] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436,
2015.
[15] Chuan Li and Michael Wand. Precomputed real-time texture synthesis with markovian gen-
erative adversarial networks. In European Conference on Computer Vision, pages 702–716.
Springer, 2016.
[16] Simon Lilly, David Schade, Richard Ellis, Olivier Le Fevre, Jarle Brinchmann, Laurence Tresse,
Roberto Abraham, Francois Hammer, David Crampton, Matthew Colless, et al. Hubble space
telescope imaging of the cfrs and ldss redshift surveys. ii. structural parameters and the evolution
of disk galaxies to z˜ 11. The Astrophysical Journal, 500(1):75, 1998.
14
[17] Chris J Lintott, Kevin Schawinski, Anže Slosar, Kate Land, Steven Bamford, Daniel Thomas,
M Jordan Raddick, Robert C Nichol, Alex Szalay, Dan Andreescu, et al. Galaxy zoo: mor-
phologies derived from visual inspection of galaxies from the sloan digital sky survey. Monthly
Notices of the Royal Astronomical Society, 389(3):1179–1189, 2008.
[18] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint
arXiv:1411.1784, 2014.
[19] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th international conference on machine learning (ICML-10), pages
807–814, 2010.
[20] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with
deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[21] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for
biomedical image segmentation. In International Conference on Medical Image Computing
and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
[22] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training gans. In Advances in Neural Information Processing Systems,
pages 2234–2242, 2016.
[23] Yoav Y Schechner and Nir Karpel. Clear underwater vision. In Computer Vision and Pattern
Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference
on, volume 1, pages I–I. IEEE, 2003.
[24] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, and Russ Webb.
Learning from simulated and unsupervised images through adversarial training. In The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, page 6, 2017.
[25] Mike Smith, Ian Baldwin, Winston Churchill, Rohan Paul, and Paul Newman. The new college
vision and laser data set. The International Journal of Robotics Research, 28(5):595–599, 2009.
[26] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4,
inception-resnet and the impact of residual connections on learning. In AAAI, volume 4, page 12,
2017.
[27] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing
ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
[28] Kyle W Willett, Chris J Lintott, Steven P Bamford, Karen L Masters, Brooke D Simmons,
Kevin RV Casteels, Edward M Edmondson, Lucy F Fortson, Sugata Kaviraj, William C Keel,
et al. Galaxy zoo 2: detailed morphological classifications for 304 122 galaxies from the sloan
digital sky survey. Monthly Notices of the Royal Astronomical Society, 435(4):2835–2860,
2013.
[29] Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully automatic 2d-to-3d video conver-
sion with deep convolutional neural networks. In European Conference on Computer Vision,
pages 842–857. Springer, 2016.
[30] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European
Conference on Computer Vision, pages 649–666. Springer, 2016.
[31] Yiran Zhong, Yuchao Dai, and Hongdong Li. Self-supervised learning for stereo matching with
self-improving ability. arXiv preprint arXiv:1709.00930, 2017.
[32] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image
translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
15