程序代写代做代考 deep learning algorithm MobileNets: Efficient Convolutional Neural Networks for Mobile Vision

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision
Applications

Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko
Weijun Wang Tobias Weyand Marco Andreetto Hartwig Adam

Google Inc.
{howarda,menglong,bochen,dkalenichenko,weijunw,weyand,anm,hadam}@google.com

Abstract

We present a class of efficient models called MobileNets
for mobile and embedded vision applications. MobileNets
are based on a streamlined architecture that uses depth-
wise separable convolutions to build light weight deep
neural networks. We introduce two simple global hyper-
parameters that efficiently trade off between latency and
accuracy. These hyper-parameters allow the model builder
to choose the right sized model for their application based
on the constraints of the problem. We present extensive
experiments on resource and accuracy tradeoffs and show
strong performance compared to other popular models on
ImageNet classification. We then demonstrate the effective-
ness of MobileNets across a wide range of applications and
use cases including object detection, finegrain classifica-
tion, face attributes and large scale geo-localization.

1. Introduction
Convolutional neural networks have become ubiquitous

in computer vision ever since AlexNet [19] popularized
deep convolutional neural networks by winning the Ima-
geNet Challenge: ILSVRC 2012 [24]. The general trend
has been to make deeper and more complicated networks
in order to achieve higher accuracy [27, 31, 29, 8]. How-
ever, these advances to improve accuracy are not necessar-
ily making networks more efficient with respect to size and
speed. In many real world applications such as robotics,
self-driving car and augmented reality, the recognition tasks
need to be carried out in a timely fashion on a computation-
ally limited platform.

This paper describes an efficient network architecture
and a set of two hyper-parameters in order to build very
small, low latency models that can be easily matched to the
design requirements for mobile and embedded vision ap-
plications. Section 2 reviews prior work in building small

models. Section 3 describes the MobileNet architecture and
two hyper-parameters width multiplier and resolution mul-
tiplier to define smaller and more efficient MobileNets. Sec-
tion 4 describes experiments on ImageNet as well a variety
of different applications and use cases. Section 5 closes
with a summary and conclusion.

2. Prior Work
There has been rising interest in building small and effi-

cient neural networks in the recent literature, e.g. [16, 34,
12, 36, 22]. Many different approaches can be generally
categorized into either compressing pretrained networks or
training small networks directly. This paper proposes a
class of network architectures that allows a model devel-
oper to specifically choose a small network that matches
the resource restrictions (latency, size) for their application.
MobileNets primarily focus on optimizing for latency but
also yield small networks. Many papers on small networks
focus only on size but do not consider speed.

MobileNets are built primarily from depthwise separable
convolutions initially introduced in [26] and subsequently
used in Inception models [13] to reduce the computation in
the first few layers. Flattened networks [16] build a network
out of fully factorized convolutions and showed the poten-
tial of extremely factorized networks. Independent of this
current paper, Factorized Networks[34] introduces a similar
factorized convolution as well as the use of topological con-
nections. Subsequently, the Xception network [3] demon-
strated how to scale up depthwise separable filters to out
perform Inception V3 networks. Another small network is
Squeezenet [12] which uses a bottleneck approach to design
a very small network. Other reduced computation networks
include structured transform networks [28] and deep fried
convnets [37].

A different approach for obtaining small networks is
shrinking, factorizing or compressing pretrained networks.
Compression based on product quantization [36], hashing

ar
X

iv
:1

70
4.

04
86

1v
1

[c
s.

C
V

]
17

A
pr

2
01

Proprietary + Confidential

Landmark Recognition

Finegrain ClassificationObject Detection

MobileNets

Photo by Sharon VanderKaay (CC BY 2.0)

Photo by Juanedc (CC BY 2.0) Photo by HarshLight (CC BY 2.0)

Face Attributes

Google Doodle by Sarah Harrison

Figure 1. MobileNet models can be applied to various recognition tasks for efficient on device intelligence.

[2], and pruning, vector quantization and Huffman coding
[5] have been proposed in the literature. Additionally var-
ious factorizations have been proposed to speed up pre-
trained networks [14, 20]. Another method for training
small networks is distillation [9] which uses a larger net-
work to teach a smaller network. It is complementary to
our approach and is covered in some of our use cases in
section 4. Another emerging approach is low bit networks
[4, 22, 11].

3. MobileNet Architecture
In this section we first describe the core layers that Mo-

bileNet is built on which are depthwise separable filters.
We then describe the MobileNet network structure and con-
clude with descriptions of the two model shrinking hyper-
parameters width multiplier and resolution multiplier.

3.1. Depthwise Separable Convolution
The MobileNet model is based on depthwise separable

convolutions which is a form of factorized convolutions
which factorize a standard convolution into a depthwise
convolution and a 1⇥1 convolution called a pointwise con-
volution. For MobileNets the depthwise convolution ap-
plies a single filter to each input channel. The pointwise
convolution then applies a 1⇥1 convolution to combine the
outputs the depthwise convolution. A standard convolution
both filters and combines inputs into a new set of outputs
in one step. The depthwise separable convolution splits this
into two layers, a separate layer for filtering and a separate
layer for combining. This factorization has the effect of
drastically reducing computation and model size. Figure 2
shows how a standard convolution 2(a) is factorized into a
depthwise convolution 2(b) and a 1⇥ 1 pointwise convolu-
tion 2(c).

A standard convolutional layer takes as input a DF ⇥

DF ⇥ M feature map F and produces a DF ⇥ DF ⇥ N
feature map G where DF is the spatial width and height
of a square input feature map1, M is the number of input
channels (input depth), DG is the spatial width and height of
a square output feature map and N is the number of output
channel (output depth).

The standard convolutional layer is parameterized by
convolution kernel K of size DK⇥DK⇥M⇥N where DK
is the spatial dimension of the kernel assumed to be square
and M is number of input channels and N is the number of
output channels as defined previously.

The output feature map for standard convolution assum-
ing stride one and padding is computed as:

Gk,l,n =
X

i,j,m

Ki,j,m,n · Fk+i�1,l+j�1,m (1)

Standard convolutions have the computational cost of:

DK ·DK ·M ·N ·DF ·DF (2)

where the computational cost depends multiplicatively on
the number of input channels M , the number of output
channels N the kernel size Dk ⇥ Dk and the feature map
size DF ⇥ DF . MobileNet models address each of these
terms and their interactions. First it uses depthwise separa-
ble convolutions to break the interaction between the num-
ber of output channels and the size of the kernel.

The standard convolution operation has the effect of fil-
tering features based on the convolutional kernels and com-
bining features in order to produce a new representation.
The filtering and combination steps can be split into two
steps via the use of factorized convolutions called depthwise

1We assume that the output feature map has the same spatial dimen-
sions as the input and both feature maps are square. Our model shrinking
results generalize to feature maps with arbitrary sizes and aspect ratios.

separable convolutions for substantial reduction in compu-
tational cost.

Depthwise separable convolution are made up of two
layers: depthwise convolutions and pointwise convolutions.
We use depthwise convolutions to apply a single filter per
each input channel (input depth). Pointwise convolution, a
simple 1⇥1 convolution, is then used to create a linear com-
bination of the output of the depthwise layer. MobileNets
use both batchnorm and ReLU nonlinearities for both lay-
ers.

Depthwise convolution with one filter per input channel
(input depth) can be written as:

Ĝk,l,m =
X

i,j

K̂i,j,m · Fk+i�1,l+j�1,m (3)

where K̂ is the depthwise convolutional kernel of size
DK ⇥ DK ⇥ M where the mth filter in K̂ is applied to
the mth channel in F to produce the mth channel of the
filtered output feature map Ĝ.

Depthwise convolution has a computational cost of:

DK ·DK ·M ·DF ·DF (4)
Depthwise convolution is extremely efficient relative to

standard convolution. However it only filters input chan-
nels, it does not combine them to create new features. So
an additional layer that computes a linear combination of
the output of depthwise convolution via 1 ⇥ 1 convolution
is needed in order to generate these new features.

The combination of depthwise convolution and 1 ⇥ 1
(pointwise) convolution is called depthwise separable con-
volution which was originally introduced in [26].

Depthwise separable convolutions cost:

DK ·DK ·M ·DF ·DF +M ·N ·DF ·DF (5)

which is the sum of the depthwise and 1⇥ 1 pointwise con-
volutions.

By expressing convolution as a two step process of filter-
ing and combining we get a reduction in computation of:

DK ·DK ·M ·DF ·DF +M ·N ·DF ·DF
DK ·DK ·M ·N ·DF ·DF

=
1

N
+

D2K

MobileNet uses 3⇥ 3 depthwise separable convolutions
which uses between 8 to 9 times less computation than stan-
dard convolutions at only a small reduction in accuracy as
seen in Section 4.

Additional factorization in spatial dimension such as in
[16, 31] does not save much additional computation as very
little computation is spent in depthwise convolutions.

…

N
1

(a) Standard Convolution Filters

…

N
1

(b) Depthwise Convolutional Filters

…

N
1

Figure 2. The standard convolutional filters in (a) are replaced by
two layers: depthwise convolution in (b) and pointwise convolu-
tion in (c) to build a depthwise separable filter.

3.2. Network Structure and Training
The MobileNet structure is built on depthwise separable

convolutions as mentioned in the previous section except for
the first layer which is a full convolution. By defining the
network in such simple terms we are able to easily explore
network topologies to find a good network. The MobileNet
architecture is defined in Table 1. All layers are followed by
a batchnorm [13] and ReLU nonlinearity with the exception
of the final fully connected layer which has no nonlinearity
and feeds into a softmax layer for classification. Figure 3
contrasts a layer with regular convolutions, batchnorm and
ReLU nonlinearity to the factorized layer with depthwise
convolution, 1 ⇥ 1 pointwise convolution as well as batch-
norm and ReLU after each convolutional layer. Down sam-
pling is handled with strided convolution in the depthwise
convolutions as well as in the first layer. A final average
pooling reduces the spatial resolution to 1 before the fully
connected layer. Counting depthwise and pointwise convo-
lutions as separate layers, MobileNet has 28 layers.

It is not enough to simply define networks in terms of a
small number of Mult-Adds. It is also important to make
sure these operations can be efficiently implementable. For

3×3 Depthwise Conv

1×1 Conv

ReLU

3×3 Conv

ReLU

Figure 3. Left: Standard convolutional layer with batchnorm and
ReLU. Right: Depthwise Separable convolutions with Depthwise
and Pointwise layers followed by batchnorm and ReLU.

instance unstructured sparse matrix operations are not typ-
ically faster than dense matrix operations until a very high
level of sparsity. Our model structure puts nearly all of the
computation into dense 1⇥ 1 convolutions. This can be im-
plemented with highly optimized general matrix multiply
(GEMM) functions. Often convolutions are implemented
by a GEMM but require an initial reordering in memory
called im2col in order to map it to a GEMM. For instance,
this approach is used in the popular Caffe package [15].
1⇥1 convolutions do not require this reordering in memory
and can be implemented directly with GEMM which is one
of the most optimized numerical linear algebra algorithms.
MobileNet spends 95% of it’s computation time in 1 ⇥ 1
convolutions which also has 75% of the parameters as can
be seen in Table 2. Nearly all of the additional parameters
are in the fully connected layer.

MobileNet models were trained in TensorFlow [1] us-
ing RMSprop [33] with asynchronous gradient descent sim-
ilar to Inception V3 [31]. However, contrary to training
large models we use less regularization and data augmen-
tation techniques because small models have less trouble
with overfitting. When training MobileNets we do not use
side heads or label smoothing and additionally reduce the
amount image of distortions by limiting the size of small
crops that are used in large Inception training [31]. Addi-
tionally, we found that it was important to put very little or
no weight decay (l2 regularization) on the depthwise filters
since their are so few parameters in them. For the ImageNet
benchmarks in the next section all models were trained with
same training parameters regardless of the size of the model.

3.3. Width Multiplier: Thinner Models
Although the base MobileNet architecture is already

small and low latency, many times a specific use case or
application may require the model to be smaller and faster.
In order to construct these smaller and less computationally
expensive models we introduce a very simple parameter ↵
called width multiplier. The role of the width multiplier ↵ is
to thin a network uniformly at each layer. For a given layer

Table 1. MobileNet Body Architecture
Type / Stride Filter Shape Input Size
Conv / s2 3⇥ 3⇥ 3⇥ 32 224⇥ 224⇥ 3
Conv dw / s1 3⇥ 3⇥ 32 dw 112⇥ 112⇥ 32
Conv / s1 1⇥ 1⇥ 32⇥ 64 112⇥ 112⇥ 32
Conv dw / s2 3⇥ 3⇥ 64 dw 112⇥ 112⇥ 64
Conv / s1 1⇥ 1⇥ 64⇥ 128 56⇥ 56⇥ 64
Conv dw / s1 3⇥ 3⇥ 128 dw 56⇥ 56⇥ 128
Conv / s1 1⇥ 1⇥ 128⇥ 128 56⇥ 56⇥ 128
Conv dw / s2 3⇥ 3⇥ 128 dw 56⇥ 56⇥ 128
Conv / s1 1⇥ 1⇥ 128⇥ 256 28⇥ 28⇥ 128
Conv dw / s1 3⇥ 3⇥ 256 dw 28⇥ 28⇥ 256
Conv / s1 1⇥ 1⇥ 256⇥ 256 28⇥ 28⇥ 256
Conv dw / s2 3⇥ 3⇥ 256 dw 28⇥ 28⇥ 256
Conv / s1 1⇥ 1⇥ 256⇥ 512 14⇥ 14⇥ 256

5⇥ Conv dw / s1 3⇥ 3⇥ 512 dw 14⇥ 14⇥ 512
Conv / s1 1⇥ 1⇥ 512⇥ 512 14⇥ 14⇥ 512

Conv dw / s2 3⇥ 3⇥ 512 dw 14⇥ 14⇥ 512
Conv / s1 1⇥ 1⇥ 512⇥ 1024 7⇥ 7⇥ 512
Conv dw / s2 3⇥ 3⇥ 1024 dw 7⇥ 7⇥ 1024
Conv / s1 1⇥ 1⇥ 1024⇥ 1024 7⇥ 7⇥ 1024
Avg Pool / s1 Pool 7⇥ 7 7⇥ 7⇥ 1024
FC / s1 1024⇥ 1000 1⇥ 1⇥ 1024
Softmax / s1 Classifier 1⇥ 1⇥ 1000

Table 2. Resource Per Layer Type
Type Mult-Adds Parameters
Conv 1⇥ 1 94.86% 74.59%
Conv DW 3⇥ 3 3.06% 1.06%
Conv 3⇥ 3 1.19% 0.02%
Fully Connected 0.18% 24.33%

and width multiplier ↵, the number of input channels M be-
comes ↵M and the number of output channels N becomes
↵N .

The computational cost of a depthwise separable convo-
lution with width multiplier ↵ is:

DK ·DK · ↵M ·DF ·DF + ↵M · ↵N ·DF ·DF (6)

where ↵ 2 (0, 1] with typical settings of 1, 0.75, 0.5 and
0.25. ↵ = 1 is the baseline MobileNet and ↵ < 1 are reduced MobileNets. Width multiplier has the effect of re- ducing computational cost and the number of parameters quadratically by roughly ↵2. Width multiplier can be ap- plied to any model structure to define a new smaller model with a reasonable accuracy, latency and size trade off. It is used to define a new reduced structure that needs to be trained from scratch. 3.4. Resolution Multiplier: Reduced Representa- tion The second hyper-parameter to reduce the computational cost of a neural network is a resolution multiplier ⇢. We ap- Table 3. Resource usage for modifications to standard convolution. Note that each row is a cumulative effect adding on top of the previous row. This example is for an internal MobileNet layer with DK = 3, M = 512, N = 512, DF = 14. Layer/Modification Million Million Mult-Adds Parameters Convolution 462 2.36 Depthwise Separable Conv 52.3 0.27 ↵ = 0.75 29.6 0.15 ⇢ = 0.714 15.1 0.15 ply this to the input image and the internal representation of every layer is subsequently reduced by the same multiplier. In practice we implicitly set ⇢ by setting the input resolu- tion. We can now express the computational cost for the core layers of our network as depthwise separable convolutions with width multiplier ↵ and resolution multiplier ⇢: DK ·DK ·↵M · ⇢DF · ⇢DF +↵M ·↵N · ⇢DF · ⇢DF (7) where ⇢ 2 (0, 1] which is typically set implicitly so that the input resolution of the network is 224, 192, 160 or 128. ⇢ = 1 is the baseline MobileNet and ⇢ < 1 are reduced computation MobileNets. Resolution multiplier has the ef- fect of reducing computational cost by ⇢2. As an example we can look at a typical layer in Mo- bileNet and see how depthwise separable convolutions, width multiplier and resolution multiplier reduce the cost and parameters. Table 3 shows the computation and number of parameters for a layer as architecture shrinking methods are sequentially applied to the layer. The first row shows the Mult-Adds and parameters for a full convolutional layer with an input feature map of size 14⇥ 14⇥ 512 with a ker- nel K of size 3 ⇥ 3 ⇥ 512 ⇥ 512. We will look in detail in the next section at the trade offs between resources and accuracy. 4. Experiments In this section we first investigate the effects of depth- wise convolutions as well as the choice of shrinking by re- ducing the width of the network rather than the number of layers. We then show the trade offs of reducing the net- work based on the two hyper-parameters: width multiplier and resolution multiplier and compare results to a number of popular models. We then investigate MobileNets applied to a number of different applications. 4.1. Model Choices First we show results for MobileNet with depthwise sep- arable convolutions compared to a model built with full con- volutions. In Table 4 we see that using depthwise separa- ble convolutions compared to full convolutions only reduces Table 4. Depthwise Separable vs Full Convolution MobileNet Model ImageNet Million Million Accuracy Mult-Adds Parameters Conv MobileNet 71.7% 4866 29.3 MobileNet 70.6% 569 4.2 Table 5. Narrow vs Shallow MobileNet Model ImageNet Million Million Accuracy Mult-Adds Parameters 0.75 MobileNet 68.4% 325 2.6 Shallow MobileNet 65.3% 307 2.9 Table 6. MobileNet Width Multiplier Width Multiplier ImageNet Million Million Accuracy Mult-Adds Parameters 1.0 MobileNet-224 70.6% 569 4.2 0.75 MobileNet-224 68.4% 325 2.6 0.5 MobileNet-224 63.7% 149 1.3 0.25 MobileNet-224 50.6% 41 0.5 Table 7. MobileNet Resolution Resolution ImageNet Million Million Accuracy Mult-Adds Parameters 1.0 MobileNet-224 70.6% 569 4.2 1.0 MobileNet-192 69.1% 418 4.2 1.0 MobileNet-160 67.2% 290 4.2 1.0 MobileNet-128 64.4% 186 4.2 accuracy by 1% on ImageNet was saving tremendously on mult-adds and parameters. We next show results comparing thinner models with width multiplier to shallower models using less layers. To make MobileNet shallower, the 5 layers of separable filters with feature size 14 ⇥ 14 ⇥ 512 in Table 1 are removed. Table 5 shows that at similar computation and number of parameters, that making MobileNets thinner is 3% better than making them shallower. 4.2. Model Shrinking Hyperparameters Table 6 shows the accuracy, computation and size trade offs of shrinking the MobileNet architecture with the width multiplier ↵. Accuracy drops off smoothly until the archi- tecture is made too small at ↵ = 0.25. Table 7 shows the accuracy, computation and size trade offs for different resolution multipliers by training Mo- bileNets with reduced input resolutions. Accuracy drops off smoothly across resolution. Figure 4 shows the trade off between ImageNet Accu- racy and computation for the 16 models made from the cross product of width multiplier ↵ 2 {1, 0.75, 0.5, 0.25} and resolutions {224, 192, 160, 128}. Results are log linear with a jump when models get very small at ↵ = 0.25. Figure 4. This figure shows the trade off between computation (Mult-Adds) and accuracy on the ImageNet benchmark. Note the log linear dependence between accuracy and computation. Figure 5. This figure shows the trade off between the number of parameters and accuracy on the ImageNet benchmark. The colors encode input resolutions. The number of parameters do not vary based on the input resolution. Figure 5 shows the trade off between ImageNet Ac- curacy and number of parameters for the 16 models made from the cross product of width multiplier ↵ 2 {1, 0.75, 0.5, 0.25} and resolutions {224, 192, 160, 128}. Table 8 compares full MobileNet to the original GoogleNet [30] and VGG16 [27]. MobileNet is nearly as accurate as VGG16 while being 32 times smaller and 27 times less compute intensive. It is more accurate than GoogleNet while being smaller and more than 2.5 times less computation. Table 9 compares a reduced MobileNet with width mul- tiplier ↵ = 0.5 and reduced resolution 160⇥ 160. Reduced MobileNet is 4% better than AlexNet [19] while being 45⇥ smaller and 9.4⇥ less compute than AlexNet. It is also 4% better than Squeezenet [12] at about the same size and 22⇥ less computation. Table 8. MobileNet Comparison to Popular Models Model ImageNet Million Million Accuracy Mult-Adds Parameters 1.0 MobileNet-224 70.6% 569 4.2 GoogleNet 69.8% 1550 6.8 VGG 16 71.5% 15300 138 Table 9. Smaller MobileNet Comparison to Popular Models Model ImageNet Million Million Accuracy Mult-Adds Parameters 0.50 MobileNet-160 60.2% 76 1.32 Squeezenet 57.5% 1700 1.25 AlexNet 57.2% 720 60 Table 10. MobileNet for Stanford Dogs Model Top-1 Million Million Accuracy Mult-Adds Parameters Inception V3 [18] 84% 5000 23.2 1.0 MobileNet-224 83.3% 569 3.3 0.75 MobileNet-224 81.9% 325 1.9 1.0 MobileNet-192 81.9% 418 3.3 0.75 MobileNet-192 80.5% 239 1.9 Table 11. Performance of PlaNet using the MobileNet architec- ture. Percentages are the fraction of the Im2GPS test dataset that were localized within a certain distance from the ground truth. The numbers for the original PlaNet model are based on an updated version that has an improved architecture and training dataset. Scale Im2GPS [7] PlaNet [35] PlaNet MobileNet Continent (2500 km) 51.9% 77.6% 79.3% Country (750 km) 35.4% 64.0% 60.3% Region (200 km) 32.1% 51.1% 45.2% City (25 km) 21.9% 31.7% 31.7% Street (1 km) 2.5% 11.0% 11.4% 4.3. Fine Grained Recognition We train MobileNet for fine grained recognition on the Stanford Dogs dataset [17]. We extend the approach of [18] and collect an even larger but noisy training set than [18] from the web. We use the noisy web data to pretrain a fine grained dog recognition model and then fine tune the model on the Stanford Dogs training set. Results on Stanford Dogs test set are in Table 10. MobileNet can almost achieve the state of the art results from [18] at greatly reduced compu- tation and size. 4.4. Large Scale Geolocalizaton PlaNet [35] casts the task of determining where on earth a photo was taken as a classification problem. The approach divides the earth into a grid of geographic cells that serve as the target classes and trains a convolutional neural network on millions of geo-tagged photos. PlaNet has been shown to successfully localize a large variety of photos and to out- perform Im2GPS [6, 7] that addresses the same task. We re-train PlaNet using the MobileNet architecture on the same data. While the full PlaNet model based on the In- ception V3 architecture [31] has 52 million parameters and 5.74 billion mult-adds. The MobileNet model has only 13 million parameters with the usual 3 million for the body and 10 million for the final layer and 0.58 Million mult-adds. As shown in Tab. 11, the MobileNet version delivers only slightly decreased performance compared to PlaNet despite being much more compact. Moreover, it still outperforms Im2GPS by a large margin. 4.5. Face Attributes Another use-case for MobileNet is compressing large systems with unknown or esoteric training procedures. In a face attribute classification task, we demonstrate a syner- gistic relationship between MobileNet and distillation [9], a knowledge transfer technique for deep networks. We seek to reduce a large face attribute classifier with 75 million parameters and 1600 million Mult-Adds. The classifier is trained on a multi-attribute dataset similar to YFCC100M [32]. We distill a face attribute classifier using the MobileNet architecture. Distillation [9] works by training the classi- fier to emulate the outputs of a larger model2 instead of the ground-truth labels, hence enabling training from large (and potentially infinite) unlabeled datasets. Marrying the scal- ability of distillation training and the parsimonious param- eterization of MobileNet, the end system not only requires no regularization (e.g. weight-decay and early-stopping), but also demonstrates enhanced performances. It is evi- dent from Tab. 12 that the MobileNet-based classifier is re- silient to aggressive model shrinking: it achieves a similar mean average precision across attributes (mean AP) as the in-house while consuming only 1% the Multi-Adds. 4.6. Object Detection MobileNet can also be deployed as an effective base net- work in modern object detection systems. We report results for MobileNet trained for object detection on COCO data based on the recent work that won the 2016 COCO chal- lenge [10]. In table 13, MobileNet is compared to VGG and Inception V2 [13] under both Faster-RCNN [23] and SSD [21] framework. In our experiments, SSD is evaluated with 300 input resolution (SSD 300) and Faster-RCNN is compared with both 300 and 600 input resolution (Faster- RCNN 300, Faster-RCNN 600). The Faster-RCNN model evaluates 300 RPN proposal boxes per image. The models are trained on COCO train+val excluding 8k minival images 2The emulation quality is measured by averaging the per-attribute cross-entropy over all attributes. Table 12. Face attribute classification using the MobileNet archi- tecture. Each row corresponds to a different hyper-parameter set- ting (width multiplier ↵ and image resolution). Width Multiplier / Mean Million Million Resolution AP Mult-Adds Parameters 1.0 MobileNet-224 88.7% 568 3.2 0.5 MobileNet-224 88.1% 149 0.8 0.25 MobileNet-224 87.2% 45 0.2 1.0 MobileNet-128 88.1% 185 3.2 0.5 MobileNet-128 87.7% 48 0.8 0.25 MobileNet-128 86.4% 15 0.2 Baseline 86.9% 1600 7.5 Table 13. COCO object detection results comparison using differ- ent frameworks and network architectures. mAP is reported with COCO primary challenge metric (AP at IoU=0.50:0.05:0.95) Framework Model mAP Billion Million Resolution Mult-Adds Parameters deeplab-VGG 21.1% 34.9 33.1 SSD 300 Inception V2 22.0% 3.8 13.7 MobileNet 19.3% 1.2 6.8 Faster-RCNN VGG 22.9% 64.3 138.5 300 Inception V2 15.4% 118.2 13.3 MobileNet 16.4% 25.2 6.1 Faster-RCNN VGG 25.7% 149.6 138.5 600 Inception V2 21.9% 129.6 13.3 Mobilenet 19.8% 30.5 6.1 Figure 6. Example objection detection results using MobileNet SSD. and evaluated on minival. For both frameworks, MobileNet achieves comparable results to other networks with only a fraction of computational complexity and model size. 4.7. Face Embeddings The FaceNet model is a state of the art face recognition model [25]. It builds face embeddings based on the triplet loss. To build a mobile FaceNet model we use distillation to train by minimizing the squared differences of the output Table 14. MobileNet Distilled from FaceNet Model 1e-4 Million Million Accuracy Mult-Adds Parameters FaceNet [25] 83% 1600 7.5 1.0 MobileNet-160 79.4% 286 4.9 1.0 MobileNet-128 78.3% 185 5.5 0.75 MobileNet-128 75.2% 166 3.4 0.75 MobileNet-128 72.5% 108 3.8 of FaceNet and MobileNet on the training data. Results for very small MobileNet models can be found in table 14. 5. Conclusion We proposed a new model architecture called Mo- bileNets based on depthwise separable convolutions. We investigated some of the important design decisions leading to an efficient model. We then demonstrated how to build smaller and faster MobileNets using width multiplier and resolution multiplier by trading off a reasonable amount of accuracy to reduce size and latency. We then compared dif- ferent MobileNets to popular models demonstrating supe- rior size, speed and accuracy characteristics. We concluded by demonstrating MobileNet’s effectiveness when applied to a wide variety of tasks. As a next step to help adoption and exploration of MobileNets, we plan on releasing mod- els in Tensor Flow. References [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org, 1, 2015. 4 [2] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. CoRR, abs/1504.04788, 2015. 2 [3] F. Chollet. Xception: Deep learning with depthwise separa- ble convolutions. arXiv preprint arXiv:1610.02357v2, 2016. 1 [4] M. Courbariaux, J.-P. David, and Y. Bengio. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014. 2 [5] S. Han, H. Mao, and W. J. Dally. Deep compression: Com- pressing deep neural network with pruning, trained quantiza- tion and huffman coding. CoRR, abs/1510.00149, 2, 2015. 2 [6] J. Hays and A. Efros. IM2GPS: estimating geographic in- formation from a single image. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2008. 7 [7] J. Hays and A. Efros. Large-Scale Image Geolocalization. In J. Choi and G. Friedland, editors, Multimodal Location Estimation of Videos and Images. Springer, 2014. 6, 7 [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. arXiv preprint arXiv:1512.03385, 2015. 1 [9] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 2, 7 [10] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. 7 [11] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural net- works with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016. 2 [12] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 1mb model size. arXiv preprint arXiv:1602.07360, 2016. 1, 6 [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 1, 3, 7 [14] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. 2 [15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- tional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 4 [16] J. Jin, A. Dundar, and E. Culurciello. Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, 2014. 1, 3 [17] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011. 6 [18] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei. The unreasonable ef- fectiveness of noisy data for fine-grained recognition. arXiv preprint arXiv:1511.06789, 2015. 6 [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 1, 6 [20] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speeding-up convolutional neural net- works using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014. 2 [21] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. arXiv preprint arXiv:1512.02325, 2015. 7 [22] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor- net: Imagenet classification using binary convolutional neu- ral networks. arXiv preprint arXiv:1603.05279, 2016. 1, 2 [23] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 7 [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015. 1 [25] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- fied embedding for face recognition and clustering. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015. 8 [26] L. Sifre. Rigid-motion scattering for image classification. PhD thesis, Ph. D. thesis, 2014. 1, 3 [27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1, 6 [28] V. Sindhwani, T. Sainath, and S. Kumar. Structured trans- forms for small-footprint deep learning. In Advances in Neural Information Processing Systems, pages 3088–3096, 2015. 1 [29] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016. 1 [30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. 6 [31] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015. 1, 3, 4, 7 [32] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 7 [33] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2), 2012. 4 [34] M. Wang, B. Liu, and H. Foroosh. Factorized convolutional neural networks. arXiv preprint arXiv:1608.04337, 2016. 1 [35] T. Weyand, I. Kostrikov, and J. Philbin. PlaNet - Photo Ge- olocation with Convolutional Neural Networks. In European Conference on Computer Vision (ECCV), 2016. 6, 7 [36] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. arXiv preprint arXiv:1512.06473, 2015. 1 [37] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pages 1476–1483, 2015. 1

Published by admin

Leave a Reply Cancel reply