PermNet: Permuted Convolutional Neural Network
A THESIS
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
Rishabh Mehta
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
MASTER OF SCIENCE
Ju Sun, Zhi-Li Zhang
May, 2021
© Rishabh Mehta 2021
ALL RIGHTS RESERVED
Acknowledgements
I would like to thank my advisors, Professor Ju Sun and Professor Zhi-Li Zhang for
their continued technical guidance and for bestowing the motivation throughout the
project period. I would also like to thank my committee member Professor Changhyun
Choi for his help in thesis reviewing process. A special shoutout to Professor Ju Sun
for uttering the words ”permutation” during some technical discussion, which sparked
the idea of permuted convolutions, as well as for helping me come up with the exact
formulation. I would also like to thank all my friends in my academic life for bearing
with my constant barrage of presentations and technical discussions with them and
giving me helpful advice.
i
Dedication
To my parents, my sister, my friends and the wonderful advisors at UMN for motivating
me to stay on top of the hectic academic life and for supporting my attempt to pursue
risky, yet rewarding project for thesis.
ii
Abstract
Convolution filters in CNNs extract patterns from input by aggregating information
across height, width and channel dimensions. Information aggregation across height and
width dimensions performed using depthwise convolution, helps identify neighborhood
patterns and hence is very intuitive. However the method in which channel dimension
information is aggregated by channel summation seems mathematically simplistic and
out of convenience. In this project we attempt to improve the channel dimension aggre-
gation operations. The first approach introduces weighted summation channel aggre-
gation in convolutions. The second approach introduces permuted convolutions which
attempt to perform psuedo-width scaling by generating new constrained filters from
existing filters. Implementing permuted convolutions comes with many challenges such
as permutation explosion, stochasticity, higher memory and computation requirements.
To resolve these issues, we come up with multiple variants of permuted convolutions
and present their advantages and disadvantages. Lastly, we provide empirical results
showcasing the performance of weighted channel summation networks and permuted
convolution networks, present our findings and recommendations for future work.
iii
Contents
Acknowledgements i
Dedication ii
Abstract iii
List of Tables vi
List of Figures vii
1 Introduction 1
1.0.1 Introduction to network scaling . . . . . . . . . . . . . . . . . . . 2
2 Convolution Background 5
3 Related Work 12
4 Problem Statement and Motivation 13
5 Methodology 16
5.0.1 WeightedNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.0.2 PermNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6 Experiments and Analysis 28
6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Training & System Details . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.3 Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
iv
6.3.1 ResNet-18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3.2 Reduced ResNet-18 (RResNet-18) . . . . . . . . . . . . . . . . . 29
6.3.3 LeNet-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3.4 SmallCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.4 PermNet implementation & running time . . . . . . . . . . . . . . . . . 30
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7 Findings and Recommendations 33
8 Future Work 35
8.1 PermImitatorNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9 Conclusion 38
References 39
v
List of Tables
6.1 Performance of different models on CIFAR-100 dataset. The metric used
for comparison is the top-1 class prediction accuracy. . . . . . . . . . . . 32
vi
List of Figures
1.1 Model scaling. (a) The baseline network to be scaled. (b-d) Networks
with width, depth and input image resolution scaled respectively. Source:
EfficientNet[1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Typical convolution in CNN. Source: EfficientNet[1] . . . . . . . . . . . 6
2.2 Depthwise convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Relationship between normal convolution and depthwise convolution . . 7
2.4 Depthwise separable convolution . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Grouped convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Channel shuffling approach to counter drawbacks of grouped convolution.
Source: ShuffleNet [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.1 An example of weighted convolution in WeightedNet. . . . . . . . . . . . 18
5.2 PermIterWeightedNet: Permuted convolution visualized. The convolu-
tion filters are depthwise convolved with the input. The output channels
of depthwise convolution are then randomly shuffled with constraints.
The figure showcases one such shuffling scenario with shuffled channel
numbers. The final output is obtained by performing 1x1 group convo-
lution with number of groups equal to number of convolution filters used
during depthwise convolution. . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3 PermIterNet: Permuted convolution visualized. The convolution filters
are depthwise convolved with the input. The output channels of depth-
wise convolution are then randomly shuffled with constraints. The figure
showcases one such shuffling scenario with shuffled channel numbers. The
final output is obtained by summing up channels for each filter in the layer. 22
vii
5.4 PermShuffleNet: Shuffled convolution visualized. The filters are con-
volved with the input in typical convolution fashion. The convolved
output channels are then randomly shuffled to obtain the final output
of shuffled convolution layer. The figure showcases one such shuffling
scenario with shuffled channel numbers. . . . . . . . . . . . . . . . . . . 23
5.5 PermDeterWeightedNet: Deterministic unmutable permuted convolutions,
with weighted summation across filter channels. . . . . . . . . . . . . . 24
5.6 PermDeterNet: Deterministic unmutable permuted convolutions, with
typical summation across filter channels. . . . . . . . . . . . . . . . . . . 25
5.7 PermAutoMultiNet: Deterministic permuted convolution with learnable
permutations. The sparsity constraint is imposed on multiple 1x1 convo-
lutions with the help of l1 loss. . . . . . . . . . . . . . . . . . . . . . . . 26
5.8 PermAutoNet: Deterministic permuted convolution with learnable per-
mutations. The sparsity constraint is imposed on the grouped 1x1 con-
volutions with the help of l1 loss. . . . . . . . . . . . . . . . . . . . . . . 27
8.1 PermImitatorNet: The PermDeterNet attempts to imitate the expert
network. This is achieved with the help of loss function added to the
network that computes filter weights differences between additional con-
strained filters in PermNet and the extra filters in pre-trained baseline
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
viii
Chapter 1
Introduction
The field of Computer Vision has taken a huge leap since the AlexNet [3] architecture
proposed for ImageNet 2012 challenge [4]. By achieving the state of the art performance,
AlexNet authors showcased the advantages of convolution based neural architectures
and brought Convolutional Neural Networks (CNNs) to the limelight. After that, every
first winner architecture since 2012 on ImageNet challenge [4] has been based on CNN.
Researchers started experimenting with deeper and high capacity CNNs, such as VGG
[5] and Inception [6] architectures. There have been many ideas proposed to improve
the learning power of deep CNN since. In 2016, the winning ResNet [7] architecture
introduced the concept of residual connections which provide highways for gradients
during backpropogation and control the complexity of the network by allowing networks
to learn identity mapping. The DesneNet [8, 9] architecture later built on the ResNet
[7] architecture by extending residual connection highways from any layer to all of its
succeeding layers inside the DenseBlock. The EfficientNet [1] architectures proposed by
Google in 2019 provided deeper understanding into scaling different dimensions of CNN
and achieved state of the art results on then discontinued Imagenet challenge.
As the power of deep CNNs started being realized, researchers started creating scaled
up versions of CNNs for various vision tasks, which seem to outperform their shallow
counterparts. On the other hand, some researchers have been focusing on taking the
other side of trend, intending to create lower complexity CNNs that can be run on
resource constrained devices such as edge devices and mobile phones. The field of
network compression, creating low memory and computation requiring models without
1
2significant drop in model accuracy, has taken off in recent years as tech industry giants
attempt to bring the power of deep neural networks to mobile phones and edge devices to
perform various tasks with better quality and performance. Some seminal architectures
proposed in the field such as MobileNet [10, 11] and SqueezeNet [12] have brought
about a revolution in edge computing. Today people use various network compression
techniques such as model pruning [13, 14], quantization [15, 16, 13] and parallelism [15]
to create very compact CNNs with good accuracy and performance on the task at hand.
This progress can be seen in all kinds of visual prediction tasks such as super resolution,
object detection, object classification and object tracking. Taking the case of single
image super resolution, the state of the art networks in the field are tens of millions of
parameters large [17, 18, 19]. However, recent progress in network compression has led
to creation of very compact CNNs with few hundred thousand parameters that perform
almost equally well as state of the art architectures [20].
1.0.1 Introduction to network scaling
The creation of higher complexity or lower complexity CNNs by changing the dimensions
of the baseline network is called network scaling. As EfficientNet [1] paper points it
out, There are three different dimensions through which networks can be scaled up or
down. These dimensions are network depth, network width and input resolution. This
dimension scaling method can be visualized in Figure 1.1, where baseline network has
been scaled up in three different dimensions. Network depth can be scaled by increasing
the number of layers of CNN. Network width is scaled by increasing the number of
convolution filters in each individual convolution layer. Input resolution can be scaled by
using the higher resolution images as input. All three upscaling dimensions help increase
network complexity. The increase in network complexity can be understood in two terms
- increase in number of learnable parameters of the network and increase in computation,
signified by number of floating point operations (FLOP) of the network. The number of
parameters of a fully convolutional network scale linearly with the network depth and
in a squared relationship with the network width. The number of parameters aren’t
affected by the input resolution. However the FLOP increase in a squared relationship
with the scale of input resolution. However in most computer vision tasks, we do not
have higher resolution input images available and hence height and width of the network
3are the two most important model scaling dimensions.
Figure 1.1: Model scaling. (a) The baseline network to be scaled. (b-d) Networks with
width, depth and input image resolution scaled respectively. Source: EfficientNet[1]
The important thing to note here is that the number of parameters are most affected
by increase in width of the network due to the squared proportionality. Hence width
scaling helps improve network capacity, however at the cost of parameter increase.
With permuted convolutions, it is possible to generate additional filters, which share
the weights with base filters in the network without any additional parameter costs.
Since we only simulate the additional filters without adding any new weights in the
network, we call this process as psuedo-width scaling. Psuedo-width scaling might help
increase the network capacity passively. However this needs to be proven and we perform
empirical comparisons to identify if this is true. In this paper we intend to compare
PermNets with their corresponding CNN counterparts and understand their learning
ability differences. We focus on the image classification problem for the scope of this
paper and perform all experiments on the popular CIFAR-10 [21] dataset.
The main contributions of this project are as follows:
• Proposing a change to convolution layer with weighted summation of filter chan-
nels instead of direct summation performed in typical convolution.
4• Proposal of permuted convolutions to simulate width scaling effect, as well as
discussion of various architectures of networks employing permuted convolutions
- called PermNets.
The further chapters have been divided as follows:
• Chapter 2 introduces various convolution types that will be used to come up with
permuted convolution formulation.
• Chapter 3 covers the related work done in this field. In this section we will
talk about various architectures proposed for making networks more compact and
suited for mobile devices as well as discuss an architecture that is closest to the
PermNet.
• Chapter 4 details the problem statement and the motivation behind it.
• In Chapter 5 discusses in detail the architectures of weighted channel summation
network (WeightedNet) and PermNet variants.
• Chapter 6 describes the empirical results of the experiments performed to under-
stand the performance of PermNet and WeightedNet.
• Chapter 7 presents our findings and recommendations about training PermNet.
• Chapter 8 discusses the future work we consider important to make PermNet
practical.
• Chapter 9 discusses our conclusion about the project.
Chapter 2
Convolution Background
The purpose of performing convolution is to extract useful features from the input.
In Convolutional Neural Network, different features are extracted through convolution
using filters whose weights are automatically learned during training. There have been
many variants of convolution proposed in the image processing community. We will
discuss the ones that are relevant to understand permuted convolution architecture.
Normal Convolution
The typical convolution in CNN extracts features from height, width and channel di-
mension from input. However this process can be divided into two processes - pattern
extraction across height and width dimension, and channel summation across depth
dimension. During the convolution process, rigorously speaking, cross correlation is
performed between each channel of input and corresponding channel of convolution fil-
ter. The cross-correlation output contains extracted features from height and width
dimensions of the input. In the next step, the output of cross-correlation from each
channel is then summed up across the channel dimension to obtain the final output of
the convolution filter. Hence through this channel summation phase, we aggregated ex-
tracted patterns by cross-correlation and obtain the final pattern extracted from height,
width and channel dimension by the convolution filter. Figure 1.2 highlights the typical
convolution process. The number of parameters of a typicaal convolution layer would
be equal to multiplication of number of input channels, filter height, filter width and
5
6number of filters in the layer.
Figure 2.1: Typical convolution in CNN. Source: EfficientNet[1]
Depthwise convolution
Depthwise convolution is a sub-part of typical convolution. In depthwise convolution,
each channel of input is convolved with each channel of filter. Hence, each 2d surface
from both input and filter is convolved separately. Finally the output of depthwise con-
volution is the concatenation of all convolved channels. For each filter, the number of
output channels through depthwise convolution are the same as number of input chan-
nels. Depthwise convolution only extracts patterns from height and width dimensions.
However it does not aggregate these patterns through channel summation. Figure 1.3
represents the depthwise convolution operation. The number of parameters of depthwise
convolution are the same as number of parameter of normal convolution.
7Figure 2.2: Depthwise convolution
Figure 2.3: Relationship between normal convolution and depthwise convolution
An intuitive relationship between depthwise and normal convolution is that normal
convolution first performs depthwise convolution and then performs channel summation.
This is evident from Figure 1.4, where we can see that by summing up the channels
that are generated as output by depthwise convolution, we obtain the result of normal
8convolution. Hence depthwise separable convolution is, simply put, normal convolution
without the channel summation process.
Depthwise separable convolution
Depthwise separable convolution were popularized by Xception [22] and MobileNet
[10, 11] papers. Depthwise separable convolution work in two stages. In the first stage,
depthwise convolution is performed, with only one 3d filter. In the second stage, we ag-
gregate the channels of depthwise convolved output using point-wise convolutions, also
called separable convolution. Point wise convolutions make use of filters of height and
width 1. Intuitively, pointwise convolution perform weighted summation across channel
dimension. In its typical formulation, depthwise separable convolution makes use of
only one FxF filter, where F is greater than 1. This FxF filter is used during depthwise
convolution. Then multiple point-wise convolution filters are utilized to generate final
output channels. The advantage of using depthwise separable convolution is that it can
help reduce the parameters of a convolution layer significantly. This reduction in param-
eter count occurs due to the fact that we offload the output channel generation task on
point-wise convolution instead of keeping it on FxF convolution. Point-wise convolution
has filter size 1, and hence its parameter count is much smaller than that of the normal
convolution, given the other criteria stay the same. Due to its lower parameter count,
depthwise separable convolutions have become very popular in the domain of network
compression and for building networks that run on resource-constrained devices.
9Figure 2.4: Depthwise separable convolution
Grouped convolution
First proposed by authors of AlexNet [3] for allowing network to be trained on two
separate GPUs in parallel, grouped convolution provides parallel path for convolution
layer computation. In grouped convolution, both input and filters are divided in indi-
vidual groups that are then convolved separately. The output of various groups are then
concatenated together to obtain the final output. Grouped convolution can help reduce
the parameters of the convolution layer. The number of parameters of the layer are
inversely proportional to number of groups. The reduction in parameter comes due to
the fact that input and output channels for each group get reduced by number of groups
generated. The other advantage of grouped convolution is efficient learning. Since the
convolutions are divided into several paths, each path can be handled separately by
different GPUs in parallel. However one big disadvantage of grouped convolution is
that the information across channel dimension in separate groups is not aggregated by
the network. Each filter group only handles information passed down from the fixed
portion in the previous layers. For examples in Figure 1.7, on the left, the first filter
10
group (red) only process information that is passed down from the first 1/3 of the input
channels. Similar story occurs for green and blue groups. As such, each filter group is
only limited to learning a few specific features. This property blocks information flow
between channel groups and weakens representations during training. To overcome this
problem, we apply the channel shuffle.
Figure 2.5: Grouped convolution
Shuffled grouped convolution
To improve upon the grouped convolution, the idea of channel shuffle is that we want
to mix up the information from different filter groups. In Figure 1.7, on the right, the
feature map is obtained after applying the first grouped convolution GConv1 with 3
filter groups. Before feeding this feature map into the second grouped convolution, the
channels in each group are divided into several subgroups. Then these subgroups are
mixed up.By performing channel shuffling operation, information can now flow between
channel groups, increasing the representation of input during training. The ShuffleNet
[2] authors have adopted shuffled grouped convolution for their architecture and popu-
larized the method since. We will take inspiration from the idea of channel shuffling to
come up with permuted convolution formulation.
11
Figure 2.6: Channel shuffling approach to counter drawbacks of grouped convolution.
Source: ShuffleNet [2]
Chapter 3
Related Work
Since AlexNet paper popularized CNN, convolution layers have become primary mod-
ules to extract patterns from visual data. As network sizes kept increasing, a new
field of network compression started forming to allow these state of the art CNNs to
run on resource constrained devices. There have been many approaches published for
building compact models. The most popular and novel methods have been proposed
in MobileNet [10, 11], Xception [22] and SqueezeNet [12] papers. The Xception [22]
paper popularized the concept of depthwise separable convolution, which can help re-
duce convolution parameters. As discussed before, depthwise separable convolution can
help reduce network parameters by offloading more computation to cheaper 1x1 convo-
lutions. The Squeezenet [12] paper makes heavy use of 1x1 filters to reduce channels in
the network, keeping parameter count low. The MobileNet [10, 11] papers build on this
idea by performing depthwise separable convolution in all of the residual blocks in the
network.
Another related approach that has motivated this work is introduced in the Shuf-
fleNet [2] paper, where output of filter channels are shuffled in order to alleviate draw-
backs of grouped convolution. The shuffled grouped convolution operation used by
ShuffleNet can be seen visually in Figure 2.6. An important thing to note here is that
although the channel shuffling operation in ShuffleNet seems stochastic, it actually is
deterministic. Hence there is not stochasticity introduced in the network due to shuf-
fling operation. The current project builds on the ideas of depthwise convolution and
channel shuffling introduced in MobileNet [10, 11] and Shufflenet [2] papers respectively.
12
Chapter 4
Problem Statement and
Motivation
This primary objective of this project is to expand the capacity of CNN by performing
pseudo width scaling without introducing any new parameters in the network. We
also want to change the convolution operation in CNNs and bring in mathematically
more sound operations to it, especially during channel information aggregation step.
The exact task we tackle in this paper is to attempt to improve upon the accuracy
of state of the art image classification CNNs by permuting convolution filter channels
to generate additional constrained filters. The intuition behind being able to achieve
these objectives is that permuted convolutions can extract more complex patterns from
input images than typical convolution operations do by permitting sharing of extracted
information across filters.
Recent state of the art CNNs have large parameters count, mostly due to large
number of convolution filters. There has been an ever increasing interest to make
neural models work on mobile phones and edge devices. Since these devices are resource-
constrained, in terms of both memory and computation, neural model compression is
of fundamental importance to make large CNNs work on such devices. If we can scale
up the small network using permuted convolutions, it might be able to achieve similar
accuracy as a larger network. And in this case, we have been able to achieve model
compression on the larger network with the help of permuted convolutions. Hence,
13
14
model compression has served as a good motive to us when coming up with permuted
convolution formulation.
The other motive behind using permuted convolutions is more technical. As we can
see in Figure 2.3, depthwise convolution along with channel summation is equivalent to
performing normal convolution. If we leave the channel summation part out for now and
only focus on depthwise convolution output, we see that depthwise convolution has only
extracted patterns across height and width dimensions from input. Consider an RGB
input to the convolution layer. Then, after depthwise convolution with multiple filters,
we can see that we have extracted patterns from R, G and B channels separately with
each filters. The next step is to aggregate the information across channel dimension.
However, now some interesting opportunities come up. Considering that each channel of
each filter convolved depthwise with the input separately, all of the channels in output
of depthwise convolution are independent. Now the question arises that which R, G
and B convolved channel should we aggregate. The simplistic answer would be that we
aggregate convolved R, G and B channels of each filter. However, upon careful thought
it becomes clear that we can aggregate any convolved R, G and B channels without
having to impose that these channels come from the same filter. Effectively, we should
be able to aggregate convolved R, G and B channels from separate filters, since as we
established before, all of the channels in depthwise convolution output are independent.
Hence, there need not be any constraint that we can only aggregate convolved channels
of each filter. We should be able to aggregate convolved channels across different filters.
This deduction serves as the intuition behind permuted convolutions and PermNet.
The motive for WeightedNet is to counter mathematical simplicity that exists in
channel aggregation phase of convolution. As evident from Figure 2.1 and Figure 2.3,
normal convolution is equalivalent to performing depthwise convolution and then chan-
nel summation across depthwise convolved output. While the intuition behind depthwise
convolution is to extract patterns across height and width dimensions, the exact intu-
ition behind channel summation is unclear. While we would like to aggregate patterns
across channel dimension in order to account for correlation in patterns among channels,
the summation operations seems to be applied due to its simplicity. By summing up
information across channels, we are introducing our bias that pattern extracted across
each channel is equally important. However this assumption is not supported by any
15
observation or evidence. When training neural networks, it is better to let the network
figure out such operations with the help of backpropagation rather than us imposing
strict rules on them. This serves as our intuition behind replacing channel summation
operation with learnable filters and create the weighted convolutions. While weighted
convolution may sound similar to depthwise separable convolution, there are key differ-
ences between them that we will discuss in later sections.
Chapter 5
Methodology
As discussed in motivation section, there is an opportunity to modify the channel sum-
mation procedure in normal convolution. We propose multiple different networks to
achieve this. The WeightedNet network that we propose introduces weighted summa-
tion across channels in convolution. Then, deriving upon the idea of permuted convolu-
tions, we propose PermNet. We also propose various variants of PermNet that attempt
to address certain disadvantages of the PermNet architecture.
5.0.1 WeightedNet
Normal convolution operation can be divided in two steps. The first step is the ex-
traction of patterns across height and width dimension. The second step is to perform
channel summation across depth dimension. WeightedNet replaces this second step
with weighted summation across channel dimension instead of direct summation. As
discussed in motivation section, direct summation of filter channels is utilized in normal
convolution for simplicity and faster computation. However direct summation opera-
tion introduces the bias from our end that each convolved output channel of filter is
equally important in extracting useful patterns. However the basis for such assump-
tion is mathematical simplicity and it has not been proven to be true. We believe that
instead of introducing such bias ourselves, we can let the network figure out what the
weight of each channel should be. Hence, taking inspiration from depthwise separable
convolution, we introduce weighted convolutions.
16
17
We implement weighted convolution with the help of depthwise convolution and
grouped 1x1 convolution, with number of groups equal to number of FxF , where F > 1,
filters present in the layer. Here the task of grouped 1x1 convolution is to perform
weighted summation aggregation across channel dimension for each filter separately and
concatenate the results. The 1x1 grouped convolution has learnable parameters and the
motivation behind using as it is purely implementation ease and better performance
with machine learning libraries, such as Pytorch, than other possible approaches.
While the inspiration for weighted convolutions has been taken from depthwise sepa-
rable convolution, they differ quite a bit in their actual formulation. In its most common
form depthwise separable convolution uses only one FxF filter, with F > 1. Most of the
responsibilities to generate output channels fall on multiple 1x1 separable convolutions.
In the case of weighted convolution, there can be multiple FxF filters, with F > 1.
Most of the computational responsibilities still lie on FxF filters. Also, instead of using
multiple 1x1 filters, we only use one filter that is divided in groups. The number of
groups would be equal to number of FxF filters in the convolution layer.
However, in its most general form, depthwise separable convolution can have mul-
tiple FxF filters, with F > 1. In this case, multiple 1x1 separable filters are used in
order to generate final output. However, the while many similarities exist between such
depthwise separable convolution and weighted convolution, the primary difference still
remains. This difference is that weighted convolution network uses grouped 1x1 convo-
lution to generate output channels, while depthwise separable convolution uses normal
1x1 convolution filters.
The intuitive difference between depthwise separable convolution and weighted con-
volution is that we try to generate additional linear filters from a base filter in the
former case, while we try to perform weighted summation across channel dimension
in the later case. Hence the primary aim of the depthwise separable convolution is to
reduce number of parameters of the network, and for weighted convolution the aim is
to perform better pattern aggregation across channel dimension.
18
Figure 5.1: An example of weighted convolution in WeightedNet.
5.0.2 PermNet
We propose PermNet architecture that uses permuted convolution layers in place of
typical convolution layers. We will talk about the idea behind permuted convolutions
as well as discuss various architectures that implement it.
Permuted Convolution
Consider an input image with R, G and B channels. Consider passing this image to a
convolution layer with multiple FxF , where F > 1 filters. Let’s consider breaking down
convolution operation here. The first phase would be to extract patterns across height
and width dimensions from image. We can achieve this using depthwise convolution.
Now we would have convolved R, G and B channels coming from each of the filter in the
layer. Now the question arises, if we can aggregate these channels in a novel manner.
We know that each channel of each filter was convolved independently while performing
depthwise convolution. We can aggregate the R channel convolved output of one filter
with B and G channel convolved outputs from other filters, since all of these channels
were convolved separately. Hence, there is no reason for us to only aggregate convolved
R, G and B channels of each filter themselves. We can mix-and-match the convolved
R, G and B channels across filters. When we aggregate outputs of different channels
of different filters, we generate a permutation of aggregated channels. There can be
19
many such possible permutations in a convolution layer. Each new permutation that
we generate, is effectively simulating a new filter, whose channels come from existing
filters. We call this new filter as constrained-filter. It is called such due to the fact that
such filter has the constraint that all of its channels come from existing filters in the
convolution layer, that we call base filters. Hence constrained filters do not have their
own weights and do not contribute to parameter increase. However since we created
new filters out of base filters and can utilize them in the network, we have effectively
scaled the network width. Since this scaled network width is actually a simulated width
scaling, we called it pseudo-width scaling.
Stochastic PermNet
Stochastic PermNets are the variants of PermNet that have stochasticity in their archi-
tecture. We use random permutation with certain constraints to implement permuted
convolutions in these variants. The problem with such architectures is that at inference
time we would sample a permutation again, which may not be reflective of all the per-
mutations learned by the network. However, the learned network during training would
be the average of all the permutations explored during training. The inference time per-
mutation selection problem introduced by stochastic PermNet is primary reason why
we would search for deterministic architectures next.
The stochastic variants will attempt to generate new permutations iteratively and
the final network will be the aggregate of all the permutations explored. We will now
discuss the three variants of stochastic PermNet.
• PermIterWeightedNet:
The PermIterWeightedNet generates a newly permuted constrained filter at every
iteration during training. Similarly during evaluation phase, it generates a con-
strained filter for each testing batch. A permuted constrained filter is generated
in three steps. The first step is to perform depthwise convolution on input x with
all n filters to be used in a convolution layer. Let its output be depthwise(x), thus
extracting information across height and width dimensions. The second step is
to perform random permutation on the output of depthwise convolution depth-
wise(x). Albeit, this random permutation process follows some constraints that
20
are discussed in the subsection below. Let the permuted output be permuted(x).
The third step then is to combine channels from permuted(x) using 1x1 filters,
hence extracting information in channel dimension. The 1x1 convolutions here
are grouped. Grouped convolutions were first introduced in AlexNet paper [3].
The number of group would be equal to the number of channels of input x. The
output of 1x1 convolution represents the output of permuted convolution layer.
– Constraints for permuted convolutions: The constraint imposed on random
permutations is necessary to make sure the order of channels in generated
constrained filter matches that of original convolution filter. More precisely,
ith channel of generated constrained filter should be obtained from ith channel
of one of the original convolution filters.
– Permutation explosion: The total number of permutations possible in a con-
volution layer with n convolution filters, with each filter of depth d equals to
[n+ k∗(n−1)∗n2 ]. As one can see, this figure is of the order O(n
2), compared to
O(n) for typical CNN convolution layer. As PermIterWeightedNet becomes
deeper, the number of total permutations possible in the PermIterWeighted-
Net grows rapidly. This problem arises due to the fact that total number
of permutations of PermIterWeightedNet is the multiplication of number
of permutations of each convolution layer in its architecture. In a typical
ResNet-18 block where there are two convolution filters with 64 filters each,
if we permute all channels of filter the number of permutations of that resid-
ual block crosses 16 million. And since there are four such residual blocks
in ResNet-18, the total permutations possible in ResNet-18 is over 65 bil-
lion. For ResNet-50, the total number of permutations is unimaginably high.
There is no way we could train such large networks while exhausting every
possible permutation.
– BoundedPermNet: Bounded permuted convolutions. To counter the per-
mutation explosion problem, there needs to be set an upper bound on total
number of permutations performed in PermIterWeightedNet. There are mul-
tiple ways to achieve this. They are discussed below.
21
– Bound on channels permuted: In this approach only specific number of chan-
nels are permuted. This approach consumes no additional memory, however
some channels which are permuted might be construed by network differently,
either more important to learn or less important to learn. Understanding the
effects of such bounded permutation is part of the future work.
– Bound on total permutations: In this approach, all the channels of convo-
lution filters are permuted, albeit an upper limit is set on total number of
permutations allowed for each layer. Upon crossing the limit, the network
would regenerate permutations that have already been explored. Additional
memory is required to store the permutations explored before hitting the
upper limit. The memory required is directly in proportional to the sum of
number of permutations allowed in each layer. Implementing this approach
is part of the future work.
Figure 5.2: PermIterWeightedNet: Permuted convolution visualized. The convolution
filters are depthwise convolved with the input. The output channels of depthwise con-
volution are then randomly shuffled with constraints. The figure showcases one such
shuffling scenario with shuffled channel numbers. The final output is obtained by per-
forming 1x1 group convolution with number of groups equal to number of convolution
filters used during depthwise convolution.
• PermIterNet:
22
The PermIterNet has similar architecture to PermIterWeightedNet, except for a
minor change. PermIterNet uses direct summation across filter channels instead of
weighted summation using 1x1 convolution in PermIterWeightedNet. The direct
summation approach helps reduce parameters further while reducing the complex-
ity of the network. The architecture of PermIterNet is shown in Figure below.
Figure 5.3: PermIterNet: Permuted convolution visualized. The convolution filters are
depthwise convolved with the input. The output channels of depthwise convolution
are then randomly shuffled with constraints. The figure showcases one such shuffling
scenario with shuffled channel numbers. The final output is obtained by summing up
channels for each filter in the layer.
• PermShuffleNet: This approach is one simple variation of PermNet. In this ap-
proach no additional constrained filters are generated. However instead of per-
muting individual channels of filters, entire filters are permuted. Hence, at every
iteration, a different arrangement of original filters is obtained. This approach is
different than ShuffleNet [2] since the ShuffleNet architecture uses group convolu-
tions in general and rearranges specific channels across groups after convolution
operation. While PermShuffleNet evidently shuffles all channels randomly with-
out any restrictions and does not work just in the context of group convolutions.
BoundedPermShuffleNet architecture follows similar logic as BoundedPermNet
and shuffles only a specific number of filters to generate a new arrangement of
filters.
23
Figure 5.4: PermShuffleNet: Shuffled convolution visualized. The filters are convolved
with the input in typical convolution fashion. The convolved output channels are then
randomly shuffled to obtain the final output of shuffled convolution layer. The figure
showcases one such shuffling scenario with shuffled channel numbers.
Deterministic PermNet
As seen with stochastic PermNet architectures, performing permutations with the help
of a random variable can lead to difficult inference time permutation choices. The
stochasticity was introduced so that we can generate new permutations iteratively.
However, if we can figure out the ways to generate permutations without introduc-
ing stochasticity in the network, we can create deterministic networks that still benefit
from permuted convolutions.
We will now discuss three different deterministic architectures that we experimented
with for implementing permuted convolutions.
• PermDeterWeightedNet: PermDeterWeightedNet is a deterministic PermNet ar-
chitecture. For this architecture, we generate some random permutations initially
but then later fix them. To improve the learning capability of the network, we still
keep our base filters, but simply add additional constrained filters in the network.
These additional filters would be kept fixed throughout the training duration of
the network. The number of constrained filters to generate is a hyperparameter.
For our experiments, we keep the number of constrained filters equal to number
24
of base filters. Hence, we have increased width to 2x the initial width with the
help of fixed permutations.
The next step after generating the immutable permutations, the next step would
be to aggregate channels of each filter. For this architecture, we go with weighted
summation approach for filter channels with 1x1 grouped convolution, taking in-
spiration from WeightedNet. However, since we have increased the width of our
network, the number of output channels after 1x1 grouped convolution would be
more than the number of output channels we obtain with baseline network. To
keep comparison fair, we should have the same number of output channels in
both baseline and our competing network. To reduce the number of output chan-
nels, pointwise convolutions are used, which are a popular way in community to
achieve channel reduction. The convolution layer of PermDeterWeightedNet can
be visualized in Figure 5.5.
Figure 5.5: PermDeterWeightedNet: Deterministic unmutable permuted convolutions,
with weighted summation across filter channels.
• PermDeterNet: PermDeterNet is just another variant of PermDeterWeightedNet.
In this case, the only difference is that we replace the weighted channel summation
25
of filter channels in PermDeterWeightedNet to direct summation. PermDeterNet
is just an extreme case of PermDeterWeightedNet, where all of weights of 1x1
group convolution are equal to 1. The other modules of the architecture are the
same as PermDeterWeightedNet.
Figure 5.6: PermDeterNet: Deterministic unmutable permuted convolutions, with typ-
ical summation across filter channels.
• PermAutoMultiNet: The deterministic architectures we discussed till now fix the
permutations while training. These permutations are chosen randomly initially.
However choosing random permutations introduces bias on our part that the cho-
sen permutations would perform well. In PermAutoMultiNet, we let the network
figure out which permutations it wants to use. We make this process completely
deterministic with the help of imposed constraints on the network. We first per-
form depthwise convolution with multiple filters and extract patterns. Next, we
use 1x1 convolutions with the sparsity constraint. Intuitively, each 1x1 convolu-
tion we use selects a subset of channels from the depthwise convolved channels.
These selected channels generate a permutation. The important thing to ensure
here is that 1x1 convolution weights are sparse. We impose sparsity constraint
26
with the help of l1 loss function applied on the weights of 1x1 convolutions.
Hence, with the help of learnable sparse 1x1 convolutions, we can let the network
learn what permutations it wants to use. The visual representation of generating
such deterministic permutations has been shown in Figure 5.7.
Figure 5.7: PermAutoMultiNet: Deterministic permuted convolution with learnable
permutations. The sparsity constraint is imposed on multiple 1x1 convolutions with the
help of l1 loss.
• PermAutoNet: This architecture is another variant of PermAutoMultiNet. A
problem arising from PermAutoMultiNet is that due to 1x1 convolutions being
applied on a large depthwise convolved output tensor, the number of parame-
ters increase drastically. For context, the reduced ResNet-18 architecture we will
discuss in experiments has 714,260 parameters. In comparison, PermAutoMulti
ResNet-18 has parameter count of 9,102,644. Hence this is more than 12x in-
crease in parameter count simply due to multiple 1x1 convolutions being applied
on depthwise convolved output.
We can try to reduce this parameter count by means of reducing the number of 1x1
convolutions we use for convolving with depthwise convolution output. However,
then the problem becomes that we do not get the exact number of output channels
we desire. To counter this, we came up with using only one 1x1 convolution,
however dividing it in the number of groups equal to number of output channels
27
desired. Now, in cases of all the networks we would experiment with, all of them
have the desired number of output channels after 1x1 convolutions equal to number
of FxF , where F > 1 filters used. This works in our advantage since now we are
simply performing weighted summation across filter channels for each filter, albeit
with sparseness. While we cannot make a theoretical case for such convolution,
we would still like to experiment with it since it helps us reduce parameter count
of PermAutoMultiNet significantly.
Figure 5.8: PermAutoNet: Deterministic permuted convolution with learnable permu-
tations. The sparsity constraint is imposed on the grouped 1x1 convolutions with the
help of l1 loss.
Chapter 6
Experiments and Analysis
In this chapter we provide the comparison between different PermNets and baseline
models.
6.1 Dataset
We focus on the task of image classification and choose CIFAR-100 [21] as our pri-
mary dataset for experimentation. CIFAR-100 has 50k training and 10k testing images,
distributed among 100 popular object classes.
6.2 Training & System Details
We train multiple baseline networks such as ResNet-18, ReducedResNet-18, SmallCNN
and LeNet as we explore the effect of permuted convolutions on different architectures.
We implemented each model using PyTorch and trained them on a GPU cluster. We
train our models for 2000 epochs, unless we see them converge faster, in which case
we use early stopping. The learning rates have been varied depending on models,
however the most general learning rate is 1e-3. We use cross entropy loss for calculating
classification error. We use SGD optimizer with momentum 0.9 and weight decay of
5e-4. For some experiments, we also use Adam optimizer without any weight decay.
We also use a learning rate scheduler that reduces learning rate by a factor of 0.1
when it identifies plateau loss surface within 10 iterations. We make use of batch size
28
29
of 250 for the experiments. For PermAutoNet, we use regularization constant of 5e-3
multiplied with the l1 loss calculated on the weights of point wise convolution during
loss calculation. The GPU cluster we train on contains 4 Nvidia Quadro RTX 6000
GPUs, with each GPU having 24 GB memory. However, each model was trained on a
single GPU core in the cluster. We make use of CometML as our experiment logging
and tracking tool.
.
6.3 Baseline Model
6.3.1 ResNet-18
We make use of ResNet-18 architecture. The model contains initial convolution, 4 layers
and a linear layer at the end for classification. Each layer contains two residual blocks.
Each block consists of two convolution layers, both of which are followed by batch
normalization layers. The residual skip connection origins at the beginning of the block
and ends at the end of the block. The number of channels in first layer is equal to 32.
The number of channels are doubled in each consecutive layer. Hence, the fourth layer
has 256 channels in its blocks.
6.3.2 Reduced ResNet-18 (RResNet-18)
We reduce the width of the ResNet-18 architecture for faster training. Similar to ResNet-
18, the model contains initial convolution, 4 layers and a linear layer at the end for
classification. Each layer contains two residual blocks. Each block consists of two
convolution layers, both of which are followed by batch normalization layers. The
residual skip connection origins at the beginning of the block and ends at the end of
the block. The number of channels in first layer is equal to 16. The number of channels
are doubled in each consecutive layer. Hence, the fourth layer has 128 channels in its
blocks. We treat Reduced ResNet-18 as baseline and for simplicity, will call it RResNet-
18 throughout the paper.
30
6.3.3 LeNet-5
The architecture of LeNet-5 is same as the one proposed in original LeNet [23] paper,
with the only difference being that we change the final linear layer to output predictions
for 100 classes since we are using CIFAR-100 dataset.
6.3.4 SmallCNN
SmallCNN is a simple CNN made of convolution, maxpooling and linear layers. The
architecture starts with two pairs of convolution and pooling layers.Finally, we have a
linear layer for prediction of class probabilities. The first and second convolution layers
have 5 and 10 filters respectively with kernel size 3, stride 1 and same padding.
6.4 PermNet implementation & running time
The PermNets have been implemented in Pytorch. We made a few changes to depthwise
convolution with multiple filters in Pytorch once we figured out that Pytorch performs
channel expansion during depthwise convolution. We want the convolved output chan-
nels from each filter to be in the same order as filters themselves. Channel expansion
performed by Pytorch goes against what we are trying to achieve in code. Hence, to fix
that issue, we perform rearrangement of channels. This additional operation increases
the training time of the network, especially if the number of channels to be rearranged
is large. PermNets that use weighted channel summation do not use bias for their FxF ,
where F > 1 convolutions. However, PermNets using typical channel summation do
use bias for all convolutions. PermNets generally take more time to train than base-
line networks due to additional operations introduced. The non-deterministic PermNets
with weighted summation take almost 3-5x training time when compared to baseline
networks. The deterministic PermNets are a bit faster and take about 2x training time
as compared to baseline.
31
6.5 Results
In this section, we provide the quantitative comparison between various PermNet vari-
ants and baseline architectures. We list the various fields that identify differences be-
tween architectures and their training procedure. These fields include learning rate
(LR), optimizer used, learning rate scheduler usage, batch-normalization usage, permu-
tation type and top-1 accuracy achieved.
For simplicity, in the permutation field, we would use square brackets to describe
the number of channels we permute in each master layer and the master layers would be
separated by a comma. On the other hand, for PermDeterNet and PermDeterWeight-
edNet, we would provide the number of additional constrained filters we generate for
each master layer in curly brackets. The master layers are separated by comma. Inside
the master layer, the permutation is applied in same manner for all of the convolution
layers inside it.
The Table 6.1 showcases our emperical results.
32
Model LR Optimizer LRS Batchnorm Permutation Accuracy
RResNet-18 0.001 SGD Y Y None 64.38
Weighted RResNet-18 0.001 SGD Y Y None 63.7
PermIterWeighted RResNet-18 0.001 SGD Y Y [0,1,1,1] 58.79
PermIterWeighted RResNet-18 0.001 SGD Y Y [0,2,2,2] 60.36
PermIterWeighted RResNet-18 0.001 SGD Y Y [0,5,5,5] 59
PermIterWeighted RResNet-18 0.001 SGD Y Y [0,10,10,10] 59.62
PermDeter RResNet-18 0.001 SGD Y Y {16,32,64,128} 56.38
PermDeter RResNet-18 0.0001 SGD Y Y {16,32,64,128} 46.16
PermDeter RResNet-18 0.001 SGD Y N {16,32,64,128} 54.31
PermDeter RResNet-18 0.001 Adam Y Y {16,32,64,128} 61.43
PermDeterWeighted RResNet-18 0.001 SGD Y Y {16,32,64,128} 57.41
PermAuto RResNet-18 0.001 SGD Y Y None 57.14
PermAutoMulti RResNet-18 0.001 SGD Y Y None 66.01
LeNet-5 0.001 SGD Y Y None 41.7
LeNet-5 0.001 SGD N Y None 38.85
Weighted LeNet-5 0.001 SGD Y Y None 41.82
PermDeter LeNet-5 0.001 SGD Y Y {6,16} 42.86
PermAuto LeNet-5 0.001 SGD Y Y None 41.71
SmallCNN 0.001 SGD Y Y None 36.99
Weighted SmallCNN 0.001 SGD Y Y None 34.25
PermDeter SmallCNN 0.001 SGD Y Y {5,10} 35.56
PermAuto SmallCNN 0.001 SGD Y Y None 34.14
ResNet-18 0.001 SGD Y Y None 94.02
PermShuffle ResNet-18 0.001 SGD Y Y [0,64,64,0] 94.09
PermShuffle ResNet-18 0.001 SGD Y Y [64,64,64,64] 89.88
PermShuffle ResNet-18 0.001 SGD Y Y [5,5,5,5] 93.94
PermShuffle ResNet-18 0.001 SGD Y Y [0,100,100,0] 92.24
PermShuffle ResNet-18 (transfer) 0.001 SGD Y Y [5,5,5,5] 93.5
Table 6.1: Performance of different models on CIFAR-100 dataset. The metric used for
comparison is the top-1 class prediction accuracy.
Chapter 7
Findings and Recommendations
Upon, analysis of the results we can try to infer some general patterns that we see in
permuted convolutions. The intuitive trend of having more permutations and training
longer does not necessarily hold true in the experimentation. While the reason behind it
is unclear, there are some general trends and findings that we do see through experiments
that can lead to better training paradigm.
• Enabling permutations or shuffling in initial convolutional layers can have adverse
effect on training. More experimentation is needed to visualize what features the
PermNet learns in initial layers when permutations are turned on.
• Intuitively, batch normalization [24] can have adverse effect on PermNet training.
However the results on PermDeter RResNet-18 indicate the otherwise.
• PermNets and WeightedNets generally take longer to train compared to other
CNN counterparts due to additional constrained permutations or tensor rear-
rangement operations. The memory footprint is also larger due to creation of in-
termediate depthwise convolution output in each layer. PermShuffleNet is the only
exception among these bunch in regard to memory usage. However its training
time is slighly longer than that of the baseline network due to shuffling operation.
• We see that by permuting more filters for PermIterNet and PermShuffleNet vari-
ants, we do not see improved performance. In fact there is no general trend that
33
34
we see that could help us understand how number of permutations we consider
affects the performance of the network.
• Some PermNets such as PermAutoMulti RResNet-18 achieve better accuracy than
baseline architectures. However due to the tremendous increase in number of
parameters of the network due to multiple 1x1 convolutions applied on depthwise
convolved output, the gains in accuracy are deceiving and such networks are not
practical.
• PermShuffleNet achieves very slight improvement over baseline networks, however
the accuracy gain is insignificant compared to the additional training time we re-
quire from the network. There is also no generalized pattern regarding the number
of filters we permute. Hence identifying the network that actually performs better
than baseline requires large hyperparameter search grids, which are impractical.
• PermAutoNet achieves equivalent accuracy as baseline for some of the networks,
but not for others. This is a surprising result that warrants further study. More
hyperparameter tuning needed.
• From the results, we can conclude that none of the PermNets that we explore
are significantly better than baseline. Hence we must try to find better PermNet
architectures or try to utilize permuted convolutions in fundamentally different
manner. We would discuss one such recommendation in future work.
Chapter 8
Future Work
In this section, we go over a few suggestions that we have for future work on permuted
convolutions. We inferred from the results section that the PermNets that we we have
explored do not give us satisfactory increase in accuracy as compared to baseline. One
of the reasons can be that the additional generated filters fail to deliver on the promise
of performing psuedo-width scaling.
However, the idea of permuted convolutions to generate additional filters may still
be useful if we decide to flip the problem. With the current architectures, the additional
filters are unable to help network learn better features. To alleviate this problem, we
can take help from an expert network. With this intuition, we propose the following
architecture as future work.
8.1 PermImitatorNet
Consider a network with width w for its convolution layer. This network has been
trained optimally on the task at hand and hence is the expert network. Now consider
a pre-trained version of such network that will be utilize to come up with optimal
PermNet. The task here is to perform model compression using width reduction. The
PermNet network we will consider will have lesser width than the expert network. We
know that most CNNs are overparameterized and hence there is good scope for us to
come up with ways to counter that inefficiency.
In this case, consider a PermNet that has half the width (w/2) as the expert network.
35
36
Let’s call this new PermNet as PermImitatorNet, since its task would be to imitate the
expert network. PermImitatorNet network is simply a PermDeterNet variant which
has been subject to handling an additional loss function. In this case, this new loss is
calculated by calculating the loss between the weights of additional constrained filters
in PermImitatorNet and the extra w/2 filters in the expert network. By providing
such additional loss to the network, we are forcing the additional constrained filters of
the PermNet to learn similar patterns as extra filters in expert network. Hence, we
are attempting to learn similar patterns as the expert network, however with half the
number of parameters due to the usage of permuted convolutions.
Figure 8.1: PermImitatorNet: The PermDeterNet attempts to imitate the expert net-
work. This is achieved with the help of loss function added to the network that computes
filter weights differences between additional constrained filters in PermNet and the extra
filters in pre-trained baseline network.
PermImitatorNet would require extensive hyperparameter tuning and appropriate
number of constrained filters for it to perform much better than similarly parameter-
ized networks. Since such PermImitatorNets could possibly lead to extreme network
37
compression, this area deserves more research efforts.
Chapter 9
Conclusion
In this project, we explored how channel summation process in typical convolution in-
troduces implicit bias that seems counter intuitive. We try to counter this bias with
the help of the weighted summation process introduced in WeightedNet architecture.
Then we explored how channel summation process in typical convolution can be gener-
alized to allow permuted convolutions that attempt to facilitate information exchange
among filters. We discussed how PermNets need to overcome both theoretical and im-
plementation challenges in order for them to be practical. To this end, we discussed
both stochastic and deterministic variants of PermNet. We performed extensive exper-
imentation to compare PermNet with their typical CNN counterparts. We present our
findings regarding the performance of PermNets. We discussed how PermNets struggle
to beat baseline architectures, and even when they do beat baselines, the parameter
costs or the training time increase makes them less viable for real world usage in their
current formats. We then propose to flip the problem upside down by suggesting to fo-
cus on achieving network compression with PermNets in place of attempting to achieve
accuracy gain with the variants. To this end, we suggest the extensive study of a new
variant of PermNet called PermImitatorNet.
38
References
[1] Mingxing Tan and Quoc Le. EfficientNet: Rethinking model scaling for convolu-
tional neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,
Proceedings of the 36th International Conference on Machine Learning, volume 97
of Proceedings of Machine Learning Research, pages 6105–6114. PMLR, 09–15 Jun
2019.
[2] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An ex-
tremely efficient convolutional neural network for mobile devices, 2017, 1707.01083.
[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou,
and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems,
volume 25. Curran Associates, Inc., 2012.
[4] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexan-
der C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
[5] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. CoRR, abs/1409.1556, 2014.
[6] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going
deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR),
2015.
39
40
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition, 2015, 1512.03385.
[8] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger.
Densely connected convolutional networks. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2017.
[9] Gao Huang, Zhuang Liu, Geoff Pleiss, Laurens Van Der Maaten, and Kilian Wein-
berger. Convolutional networks with dense connectivity. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2019.
[10] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,
Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient con-
volutional neural networks for mobile vision applications, 2017, 1704.04861.
[11] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-
Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks, 2019,
1801.04381.
[12] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J.
Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer pa-
rameters and ¡0.5mb model size, 2016, 1602.07360.
[13] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and con-
nections for efficient neural network. In Advances in Neural Information Processing
Systems (NIPS), pages 1135–1143, 2015.
[14] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep
neural networks with pruning, trained quantization and huffman coding. Interna-
tional Conference on Learning Representations (ICLR), 2016.
[15] Frederick Tung and Greg Mori. Clip-q: Deep network compression learning by in-
parallel pruning-quantization. In 2018 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 7873–7882, 2018.
[16] Tianzhe Wang, Kuan Wang, Han Cai, Ji Lin, Zhijian Liu, Hanrui Wang, Yujun
Lin, and Song Han. Apq: Joint search for network architecture, pruning and
41
quantization policy. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), June 2020.
[17] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. En-
hanced deep residual networks for single image super-resolution, 2017, 1707.02921.
[18] Hao Dong, Akara Supratak, Luo Mai, Fangde Liu, Axel Oehmichen, Simiao Yu,
and Yike Guo. TensorLayer: A Versatile Library for Efficient Deep Learning De-
velopment. ACM Multimedia, 2017.
[19] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao,
and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial
networks. In The European Conference on Computer Vision Workshops (ECCVW),
September 2018.
[20] W. Yifan, F. Perazzi, B. McWilliams, A. Sorkine-Hornung, O Sorkine-Hornung,
and C. Schroers. A fully progressive approach to single-image super-resolution. In
CVPR Workshops, June 2018.
[21] Alex Krizhevsky. Learning multiple layers of features from tiny images. University
of Toronto, 05 2012.
[22] Franc¸ois Chollet. Xception: Deep learning with depthwise separable convolutions,
2017, 1610.02357.
[23] Yann Lecun, Le´on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based
learning applied to document recognition. In Proceedings of the IEEE, pages 2278–
2324, 1998.
[24] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep net-
work training by reducing internal covariate shift. In Francis Bach and David
Blei, editors, Proceedings of the 32nd International Conference on Machine Learn-
ing, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille,
France, 07–09 Jul 2015. PMLR.