程序代写代做代考 android deep learning AI chain python Java algorithm IOS Approximate Computing for Deep Learning in TensorFlow

Approximate Computing for Deep Learning in TensorFlow
Abstract
Nowadays, many machine learning techniques are applied on the smart phone to do things like image classificatin, audio recognization and object detection to make smart phone even smarter. Since deep learning has achieved the best result in many fields. More and more people want to use deep neural netowrk model in the smart phone. However, deep neural netowrk model can be large, need large amount of computation that takes too much time and power. There are a few methods of approximate computing proposed to address this problem in recent years. The method I use in this paper is mobilenet model using tensorflow which is just published by Google in this year. I will conduct experiments to show whether mobilenet can decrease model size, increase speed while at the same time keep decent accuracy. I will compare metrics of the mobilenet with other traditional models such as VGG model. I will also how the parameters of width multiplier and resolution multiplier impact the trade off between model size, speed and accuracy.
Introduction
Motivation
In recent years, machine learning technique, especially deep learning which uses multiple layers of artificial neural networks has achieved remarkable breakthroughs in many fields. From image classification to Game AI (AlphaGo), deep learning all has the best performance.
At the same time, more and more people use smart phone. With no doubt, AI techniques such as deep learning will make smart phone even smarter. Functions such as face recognization, audio recognization and image classification will be added to many mobile apps.
Deep learning model training part can be done offline in the server clusters. For the inference part, although we can send the data through network to the server, and the server does the prediction and reply with the result. In some cases, if the data is sensitive, the client may wish not to send out to servers. One example is the bank card number recognization application. Even without security concern, network traffic can be slow and expensive, building reliable servers increase the operation cost.
So if we can do prediction on the smart phone, then there is no data security concern, no network traffic delay and cost, no need to maintain a potentially expensive server cluster. But this approach also has its drawbacks. It needs to store the model in the smart phone’s limited storage and inference computing in the mobile can be slow and cost battery power.
Deep neural network typically has many layers with large number of parameters. It needs large storage and large number of math operations. For example, one traditional image classification model VGG has about 100 million parameters, need more than 1GB to store the model and takes more than 10000 million Mult-Add operations. Thus it is not fit in the mobile phone.
To use deep learning models in the mobile phone, we must find a way to significantly decrease the model size and the number of computing operations to make the model file resonable small and computing fast with less power. In the mean time, we don’t want the performance too bad. We need to find a suitable trade-off between them.
Achieved results
Dissertation outline
Background
Relevant work
SqueezeNet
It decreases filter size from 3×3 to 1×1 and also use fewer input channels to the filters to decrease the number of parameters. The SqueezeNet model achieves the same level accuracy with ALEXNET but using 50 times fewer parameters.
Deep Compression
It first prunes the model connections and only keeps most important connection to reduce parameters by 9-13 times. Then it quantizes the weights so that we can use only 5 bits to represent a weight instead of 32 bits. Finally it uses Huffman coding to reduce the model further. This method compresses AlextNet model by 35 times from 240MB to 6.9MB, increases the speed by 3-4 times and costs 3-7 times fewer power.
Flattened Convolutional Neural Networks
It replaces 3D filters in conventional convolutional networks with consecutive sequence of 1-D filters in all 3 dimensions to reduce the parameters. The feedforward computation is about 2 times faster with this approach.
Factorized Convolutional Neural Networks
It factors the convolution operation by unravelling the convolutional layer with a sequence of combination of single channel convolution and linear channel projection. It achieves similar accuracy but with much fewer computatin compared with traditional deep convolutional neural networks models.
Quantization
To reduce model size and increase inference speed, we can quantize the weights so that instead of using 32 bits floating number we can use fewer bits such as 8 bits.
During the training phase, in each step, the parameters of neural networks adjusts a little using back propagation and gradient descent algorithm which requires high-precision number format such as 32 bits floating number. So we can not directly use compact format for training the model.
But we can quantize a already trained model for inference. Quantization for deep networks typically doesn’t decrease the accuracy of inference. Because deep networks are often very robust and good at ignoring the noise including the precision error noise introduced by quantization.
One direct benefit of quantization is to make the model file much smaller. Because the model file mainly stores the numbers of weight parameters, so using 8 bits instead of 32 bits will require only about 25% of storage previously needed. Another benefit of quantization is to make the inference computation faster and use less power. Because using less bits save memory bandwidth, save RAM access time and more operations done in one cycle for SIMD instructions.
One simple way to quantize is to store the minimum and maximum values of the floating numbers set, then using an integer to represent the floating number. For example, if we use 8 bits to represent floating numbers in the range [-20.0, 50]. Then 0 represents -20.0, 255 represents 50.0, 128 represents 35.0 and so on.
Deep learning
Tensorflow
Methods
Depthwise Separable Convolution
The depthwise separable convolutions factorize the conventional convolution with a depthwise convolution followed by a pointwise convolution.
The depthwise convolution is done independently for each channel of the input where a single filter is applied. The pointwise convolution is the same with conventional convolution operatin but with kernel size 1×1 which is why it is called pointwise. It combines the features from depthwise convolution linearly to create new features.
Thus the depth separable convolution has the effect of filtering input channel through depthwise convolution and then combining features to create new ones through pointwise convolution. The effects are exactly the same with contentional convolution. The difference is that contentional convolution achieves this using a single step, whereas depth separable convolution uses two separate steps.
Through the separation of feature filtering and feature combining, depthwise separable convolution reduces the amount of computation tremendously.
Assuming the input has size where is the input width, is the height and is the number of input channels. The filer has size and the number of filters is . With stride as 1 and zero padding, the output of conventional convolution will has size . The elements of are computed as follows:

It takes $O(W H M N w h) $
For depthwise convolution, we use one filter for each input channel. The filter has size . The output of the depthwise convolution has size .
It is computed as follows:

It takes $O(W H M w h) $
Then for the pointwise convolution, it uses filters, takes the output of depthwise convolution and generates output of size . It takes $O(W H M N) $
In total, depthwise separable convolution takes
The time ratio between depthwise separable convolution and conventional convolution is . For a typical convolution, where , we can get about 9 times speed up.
Softmax classifier
Suppose we have classes, for sample , we have computed a score vector of elements. is the class score of sample for class . Larger score indicates it is more likely for to belong that class.
We can convert class score to class probability with following formulae.

That is for each score, take its exponation and then divided by sum of exponation to normalize the value to 0-1. We want the loss is small when the predicted probability for correct class is larger. We can take negative log of to get the loss where be the correct class of . The loss for sample is as follows:

Optimization
The training process is to use optimization algorithm to update the parametes so that total loss is minimized.
Most common used optimization algorithm for neural network is gradient descent.

is the parameter vector, is the gradient, is the learning rate. That is, the parameter is updated though the negative direction of its gradient and the step size is controled by the learning rate.
When the training data is huge, for example ImageNet has over 10 millions of image. Computing the gradient for total loss over entire data set is costly. In this situation, we need to use mini-batch gradient descent. In this method, we sample a small data set and then use this small data set to compute the gradient and do the parameter updating. The updating rule is the same, only difference is that each step only a small batch of samples used, the computation cost is much smaller and can lead to much quicker converging. The batch size is usually less than 1000 and can even be 1.
The learning rate in the gradient descent algorithm is very important. When the learning rate is very small, although the loss is guaranteed to decrease, the converging speed may be too slow. We can increase the learning rate to speed up the learning, but may lead to overstep that makes the loss increase. It is very difficult to set suitable learning rate. Different dataset or different network architecture may need different learning rate. We may need to set different learning rate for different parameters and in different training phases.
There are many algorithms that try to solve this problem and make improvements over the basic gradient descent algorithm.
Overall, there are two different ways. One is using momentum and another is setting the learning rate adaptively for each parameter. Adagrad is such an algorithm.

is used to avoid dividing 0 and it is set to a very small value such as .
The above formulae operations are element-wise for each parameter. So each parater has its own effective learning rate. AdaGrad keeps track of the sum of gradients and use it to adjust the learning rate.
One problem of Adagrad is that the effective learning rate is always decreasing, when it is approximate to 0, then the algorithm stops learning.
Another algorithm called RMSProp trys to solve this problem.

is the decay rate. RMSProp makes a simple change which makes C as the moving average of gradient square instead of acculated sum in the Adagrad. Now the effective learning rate is no longer always decreasing.
Momentum

is another hyperparameter called momentum. is the velocity. We integrate previous velocity with gradient to get the current velocity and then using the velocity to update the which is different from basic gradient descent where we directly update the parameters using gradient. This algorithm is helpful to reduce oscillating and speed up convergence.
Nesterov Momentum
The Nesterov momentum uses the gradient of the next position instead of current position and achieves better result over momentum.

Annealing the learning rate
At the start of training, we may want a relatively larger learning rate so that the loss function value can decrease quicker. In the later stage, with the improvement getting smaller in each step, we may want to decay the learning rate so that it can avoid overstepping and fine-tune the parameters. We can set the learning rate decay according to some rule, for example, multiply 0.9 every 1 epoch. Or set the decay manually, for example, when we see the training loss doesn’t decrease any more, we can try to half the learning rate.
Network Achitecture
The neural network consists of layers of neurons. The first layer is called input layer and the last layer is called output layer. The layers between input layer and output layer are called hidden layers.
Each neuron is a computing unit that applys linear transformation to the inputs to it followed by activation function.
The computation can be written as , where w is the weights, b is the bias and f is the activation function.
We typically use a non-linear function as the activation function. Because if the activation function is linear, it can be incorporated into previous linear transformation. There are many different activation functions. The most commonly used are sigmoid, tanh and rectified linear unit (ReLU).

In deep neural network, ReLU is found to have better results than sigmoid and tanh.
Forward Propagation and Backpropagation
Let represents the activation values of layer . For the input layer, the values are directly from input , so we have a_1 = x . We can compute all neurons’ value layer by layer from input layer until output layer.

From the output layer’s values, we can compute the loss that measures the error between model predicted value and the actual target value.
In the training process, we need to use gradient descent algorithm to update the parameters to reduce the loss. Backpropagation makes use of chain rule to compute gradients of all parameters with respect to the output efficiently. The backpropagation is applied on the computation graph from the last output node backward to all other nodes. During backpropagation, in a node, for each input, multiply the input gradient with respect to the local output and the node output gradient with respect to the final output which is received from later node, and then the process continues for each input node.
Chain Rule
The chain rule is used to compute derivative of composition functions. For example, if variable is a function of which in turn is a function of , then according to the chain rule:

Example
The following illustrates the forward propogation and Backpropagation process of feeding one sample data to a neural network that has one hidden layer with ReLU activation and uses cross-entropy loss.
are the weights and biases for hidden layer and output layer respectively. are the sample data and class label.
Forward propogation:
Compute the affine transform for hidden layer.
$$ Z = W^T X + b $$
Compute the ReLU activation for hidden layer.

Compute the affine transform for output layer which is the class score.

Convert class score to probability using softmax function.

Compute the loss

Backpropagation:
Compute gradient of class score.

Compute gradient of weight .

Compute gradient of bias .

Backpropagate to hidden layer.

Set non-positive elements to 0 in . Because if $x > 0 $ and if $x 0 $.

Compute gradient of weight .

Compute gradient of bias .

From above, we can see that during backpropagation, we used many intermediate results computed in forward propogation. Thus we often save the needed intermediate values in forward propogation to save computation time by avoiding duplicate computation in backpropagation.
Although above example is just for a simple neural network, it can be easily extended to more complex network. During the forward propogation and backpropagation process, the computation is local to each layer. Each layer only needs to know the value propagated to it, compute the values and propagate the values to other layers. It doesn’t need to care about how other layers do the computation. Thus different layers and operations can be used as components to construct deep and very complex neural networks in many different ways of combination.
Fully Connected Layer
In fully connected layer, every neuron in the layer is connected to each neuron in the previous layer. This is the traditional layer type.
Convolutional Layer
For image and other high dimentional data, convolutional layer is often prefereable to fully connected layer. Because fully connected layer will create too many connections, and thus has much more parameters which can be slow to train and easy to overfit. For example, if the input image is 30x30x3, each neuron in the first fully hidden layer will connect to 30x30x3=2700 neurons in the input layer. For such small image, it may not be problem. But for larger image such as 300x300x3, there will be 270000 connections for a single neuron which is difficult to handle. Another problem is that high dimentional data such as image often has inherent spatial structure, but for the fully connected layer, the input is just a vector of pixel values, the relative position of the piexels has no effect and so the spatial structure information is lost.
To address these problems, convolutional layer is invented. To be suitable for image data, the layout of neurons in convolutional layer is 3 dimentional instead 1 dimentional in the fully connected layer. The 3 dimentions called width, height and depth respectively. Each neuron in the convolutional layer now only connects to a small regions of neurons of previous layer. The small region is small in width and hight but includes all depth. The width and height of the region is called receptive field or filter size. So the receptive field controls how large the connection region will be. In this way, we reduce the connection dramastically. For example, If the receptive field is 3×3, the input volume is 300x300x3, then one neuron will connect to 3x3x3=27 neurons of the previous layer instead of 270000 in fully connected layer. Apart from the benefit of reducing number of connections, it is also helpful to learn the local feature of the image.
To reduce the number of parameters further, the convolutional layer let neurons in the same depth dimention share the same weights which is called filter. So for different positions in the image, the filter uses the same weights to extract the features which makes the feature extracting translation invariant.
During forward propagation phase, we slide a window of size defined by receptive field over all the input volume and compute the dot product of filter weights and the pixel values in the window to get a single number in the output volume. The dot products of all positions constitute the activation map. And the activation maps for all filters stacked in the depth dimention to constitute the total output volume.
In summary, by arranging layer of neurons in 3D space, constraining the connections to local area and sharing the weights, convolutional layer can make better use of spatial information with much less parameters.
Pooling layer
The pooling layer can be used to reduce the spatial size of the representation and the number of parameters. It works by sliding a small window over input volume, using a non-linear function to computing a number with the values in the small window as input. The computation is conducted for each input depth independently.The most often used non-linear function is max function. Other functions such as average and L2-norm are also used. By reduce multiple values in a local region to only 1 number, the pooling layer has the effect of extract more abstract features and help the model to generalize and reduce overfitting. The pooling layer introduces no additional parameters and it will reduce the width and height by factor 2 or more with depth unchanged. So the number of parameters of the later layers are reduced. The most often used filter size is 2×2, this will result in output volume of 1/4 input volume size. Larger filter size is rarely used, because it will discard too much information and often result in bad performance.
Dropout Layer
Dropout is method to reduce overfitting. In the training stage, we randomly drop out the neurons and the associated connections according to probability . This has the effect of sampling from a large number of sub-networks. In the testing stage, we don’t drop out neurons. Instead, we use the full networks but with the neuron’s output weighted with . In this way, we compute the average output of all the sub-networks approximately.
By randomly droping out neurons, the dropout techniques trains over exponentially large number of sub-networks, and using the average prediction of them which is like a kind of ensemble learning, it reduces the overfitting and also increase the speed of training.
Batch Normalization
During neural network training, the parameters change of one layer will change the distribution of inputs of the layers after it. This phenomenon called internal covariate shiftis is especially true for deep neural network, the impact will be amplified by multiple layers. To adapt to the input distribution change, it usually requires small learning rate and thus making the training slow.
To solve this problem, we can transform inputs to the layer to have mean 0 and variance 1. This transformation is called whitening. To make the computation fast and also differentiable required by the back propagation, we can whiten each dimension of the input indepently.

The is one dimension of the input which is scalar.
To avoid changing the layer’s representation, we add a linear transformation after the whitening transformation.

The two transformations together are called batch normalization.
During training, the mean and variance of are estimated from mini-batch samples. The population means and variances are also estimated by taking moving average of mini-batch statistics during training. During inference, the fixed population means and variances are used so that the output is only determined by the input.
For a layer in the original network.

We can apply batch normalization in this way.

The reason to remove is that it can be canceled by parameter in the batch normalizaton.
In the convolutin layer, the activation map is got by using the same filter applied on different locations of previous layer. When we use batch normalizaton for the convolution layer, we will normalize all the activations in the activation map together in the mini-batch. So if the activation map has size and the batch size is , then the normalization is applied over the values. Just like the activation map shares the same weights, we use the same parameter and for a activation map.
The batch normalization can reduce layer input distribution change and make the gradients less sensitive to parameter scales, thus higher learning rate can be used to speed up the training.
During training, the batch normalization depends on the whole mini-batch samples, the output of one training sample is not deterministic any more. In this way, batch normalization has the effect of regulization and can remove other regulization methods such as dropout.
Preprocessing
During the training, an image is randomly transformed before feeding to the neural networks. In this way, the neural networks will train on multiple versions of the same image and the actual training data set size is much larger than original data set size. This will make the model better generalize and reduce overfitting.
Randomly Shift the Image
First pad the image, and then randomly crop the image. In this way, the image will randomly shift in the 4 directions.
Randomly Flip the Image
The image is fliped left to right with 0.5 probability.
Randomly adjust the image brightness
This randomly add a value between -63 and 63 to all RGB components of every pixel.
Randomly change the image contrast
Randomly choose a contrast factor $ 0.2 f 1.8$. For each RBG channel, compute the mean and update the corresponding component of each pixel with:

After above randomly changing steps of the image, lastly we normalize the image data to make it have zero mean and unit norm.
Resource and tools
The model training and evaluation is implemented using python with tensorflow framework 1.0 on ubuntu linux system. I use Amazon Elastic Compute Cloud (EC2) G2 instance which uses NVIDIA GRID K520 GPUs for my model training.
The image classification app on the mobile is implemented using Android java with tensorflow mobile library. Currently the tensorflow mobile library support 3 platforms: Android, IOS and Raspberry Pi. The library provides APIs that let mobile app easily load pre-trained model and do inference with it.
The android image classification app is developed with Android Studio which is the official IDE for Android.
Checkpoint File
During training, we can use tensorflow API to save the learned model parameters periodically to binary checkpoint files. In this way, the model parameters are backed up. Next time, the model parameters can be restored by loading data from checkpoint file.
Model File
The model file is in Protocol Buffers format which can be saved and loaded using many different languages. So we can save the model file using python and load the model using java in android app.
The Graph object contains all the information about the model graph. The graph consists of nodes. Each node stores various information including node name, operation such as “Add” and “Conv2D”, input nodes and other attributes such as filter size for “Conv2D”.
To make it suitable for deployment, we can use tool from tensorflow freeze_graph.py to combine the graph definition file and checkpoint file that contains learned parameters into a single model file. The tool achieves this by replacing Variable node with Const node that contains the parameters and it also removes nodes unnecessary for inference to simplify graph and decreases file size.
The resulting model file can then be shipped with Android app. In the android app, upon starting, we will first load the model file using Tensorflow Mobile java API. Then we can do inference using the loaded model.
Regularization
We often use regularization method to reduce overfitting. One way of regularization is to add weight penalty to the loss. The new loss is the addition of original data loss and the added regularization loss. The regularization parameter controls the regularization strength. Large will put more weight to regularization loss and thus stronger regularization. Small will put more weight to data loss and thus weaker regularization. Different dataset or network architectures may require very different value of . There is no simple way to decide suitable . It is usually set through cross validation. By adding regularization loss which penalizes large weights, it helps to result in networks with smaller weights.
Small weights means a few change of the inputs won’t change the output of the network too much. Few outliers won’t matter too much for the regularized networks which make the network less sensitive to the noise in the data. On the other hand, a little change on some of the inputs may cause the output of network with large weights change a lot. So large weights will make the model easily adapt to all the training data including noise.
In summary, regularized networks with small weights tend to be simpler, robust to noise, less likely to overfit and better to generalize. Unregularized networks with large weights tend to be more complex, easy to learn the noise and more likely to overfit.
L2 regularization
$$L = underbrace{ frac{1}{N} sum_i L_i }_ ext{data loss} + underbrace{ frac{1}{2} lambda sum_ksum_l W_{k,l}^2 }_ ext{regularization loss} \\$$
L1 regularization
$$L = underbrace{ frac{1}{N} sum_i L_i }_ ext{data loss} + underbrace{ lambda sum_ksum_l |W_{k,l}| }_ ext{regularization loss} \\$$
The L2 regularization and L1 regularization are similar. Both penalize large weights. But they have different form of weight updating in gradient descent algorithm. For L2 regularization, the additional update of because of added regularization loss is

. For L1 regularization, it is

.
From above we can see that the updating amount is constant for L1 regularization and proportional to for L2 regularization. Thus the penalty is much larger for L2 regularization when is large and much larger for L1 regularization when is small. The effect is that weights in L1 are sparse with a small number of relatively large weights and others driven to . Whereas L2 regularization weights are more diffuse. The sparsity featue of L1 regularization makes L1 a better choice for feature seletion purpose. In other situations, L2 regularization is found usaully better than L1 regularization.
We can also combine these two regularizations which is called Elastic net regularization.
$$L = underbrace{ frac{1}{N} sum_i L_i }_ ext{data loss} + underbrace{ sum_ksum_l lambda_1 |W_{k,l}| + frac{1}{2} lambda_2 W_{k,l}^2 }_ ext{regularization loss} \\$$
Apart from adding regularization loss, another way to avoid weights with too large magnitude is called Max norm regularization. This method does the weights updating as normal using gradient descent algorithm and then clipping the weights if needed to ensure each weight vector norm below a preset maximum value.

Posted in Uncategorized

Leave a Reply

Your email address will not be published. Required fields are marked *