机器学习 深度学习 python代写
DLNN 2020: Assignment 1
Wojtek Kowalczyk
wojtek@liacs.nl
18 February 2020
Introduction
The objective of this assignment is to develop and evaluate several algorithms for classifying
images of handwritten digits. You will work with a simplified version of the famous MNIST
data set: a collection of 2707 digits represented by vectors of 256 numbers that represent 16×16
images. The data is split into a training set (1707 images) and a test set (1000 images). These
data sets are stored in 4 files: train in.csv, train out.csv, test in.csv, test out.csv,
where in and out refer to the input records (images) and the corresponding digits (class
labels), respectively. These files are stored in data.zip.
You may find more information about the original problem of handwritten digit recognition,
more data sets, and an overview of accuracies of best classifiers (it is about 99.6%!) at
Task 1: Analyze distances between images
The purpose of this task is to develop some intuitions about clouds of points in highly
dimensional spaces. In particular, you are supposed to develop a very simple algorithm for
classifying hand-written digits.
Let us start with developing a simple distance-based classifier. For each digit d, d =
0, 1, . . . , 9, let us consider a cloud of points in 256 dimensional space, Cd, which consists of
all training images (vectors) that represent d. Then for each cloud Cd we can calculate its
center, cd, which is just a 256-dimensional vector of means over all coordinates of vectors that
belong to Cd. Once we have these centers, we can easily classify new images: to classify an
image, calculate the distance from the vector that represents this image to each of the 10
centers; the closest center defines the label of the image. But first let us take a closer look at
our data.
For each cloud Cd, d = 0, 1, . . . , 9, calculate its center, cd, and the radius, rd. The
radius of Cd is defined as the biggest distance between the center of Cd and points from Cd.
Additionally, find the number of points that belong to Cd, nd. Clearly, at this stage you are
supposed to work with the training set only.
Next, calculate the distances between the centers of the 10 clouds, distij = dist(ci, cj), for
i, j = 0, 1, . . . 9. Given all these distances, try to say something about the expected accuracy
of your classifier. What pairs of digits seem to be most difficult to separate?
1
u
2707g digits 9 b 9
j
ooo test
as
o l 0 13 pie
25,6 numbers
70.0 du TtImgT
ri
Lx yH
Task 2: Implement and evaluate the simplest classifier
Implement the simplest distance-based classifier that is described above. Apply your
classifier to all points from the training set and calculate the percentage of correctly classified
digits. Do the same with the test set, using the centers that were calculated from the training
set. In both cases, generate a confusion matrix which should provide a deeper insight into
classes that are difficult to separate. A confusion matrix is here a 10-by-10 matrix (cij ), where
cij contains the percentage (or count) of digits i that are classified as j. Which digits are most
difficult to classify correctly? For calculating and visualising confusion matrices you may use
the sklearn package. Describe your findings. Compare performance of your classifier on the
train and test sets. How do the results compare to the observations you’ve made in Step 1?
How would you explain it?
So far we assumed that the distance between two vectors is measured with help of the Euclidean
distance. However, this is not the only choice. Rerun your code using alternative distance
measures that are implemented in sklearn.metrics.pairwise.pairwise distances.
Which distance measure provides best results (on the test set)?
Task 3: Implement a multi-class perceptron algorithm
Implement (from scratch) a multi-class perceptron training algorithm (from slide 36, second
lecture) and use it for training a single layer perceptron with 10 nodes (one per digit),
each node having 256+1 inputs and 1 output. Train your network on the train set and evaluate
on both the train and the test set, in the same way as you did in the previous steps.
As your algorithm is non-deterministic (results depend on how you initialize weights), repeat
your experiments a few times to get a feelinig of the reliability of your accuracy estimates.
Try to make your code efficient. In particular, try to limit the number of loops, using
matrix multiplication whenever possible. For example, append to your train and test data a
column of ones that will represent the bias. The weights of your network can be stored in a
matrix W of size 257×10. Then the output of the network on all inputs is just a dot product
of two matricies: Train and W, where Train denotes the matrix of all input vectors (one per
row), augumented with 1’s. To find the output node with the strongest activation use the
numpy argmax function. An efficient implementation of your algorithm shouldn’t take more
than a few seconds to converge on the training set (yes, the training sets consists of patterns
that are linearly separable so the perceptron algorithm will converge).
Finally, notice that the perceptron algorithm that is presented in the G´eron’s textbook on
page 287 (Equation 10-3) is slightly di↵erent than the algorithm discussed during the lecture
(slide 36). Accordingly, slightly modify your code and rerun your experiments (assume ⌘ to
be 1). Compare the results (accuracy, convergence speed, runtime).
Task 4: Linear Separability
What do you think: is the set of all images of the digit 1 (from the training set) linearly
separable from the set of all images of the digit 7? More generally, are images of the digit
i linearly separable from the images of the digit j, for i, j in {0, 1, . . . , 9}, such that i 6= j?
To answer these questions study the Cover’s Theorem, think about the answers, and then
implement the simple perceptron algorithm and apply it to all pairs of sets of images from
classes i and j. (Hint: don’t run your algorithm for more than a few hundred iterations).
And can you linearly separate the set of all images of i (for i in {0, 1, . . . , 9} from all
remaining images? (This time you might need to use a few thousands iterations.)
Task 5: Implement the XOR network and the Gradient Descent Algorithm
This is probably the last time in your life that you are asked to implement a neural network
from scratch – therefore, have fun! Proceed as follows:
1. Implement the function xor net(x1, x2, weights) that simulates a network with two
inputs, two hidden nodes and one output node. The vector weights denotes 9 weights
(tunable parameters): each non-input node has three incoming weights: one connected
to the bias node that has value 1, and two other connections that are leading from the
input nodes to a hidden node or from the two hidden nodes to the output node. Assume
that all non-input nodes use the sigmoid activation function.
2. Implement the error function, mse(weights), which returns the mean squared error
made by your network on 4 possible input vectors (0, 0), (0, 1), (1, 0), (1, 1) and the corresponding
targets: 0, 1, 1, 0.
3. Implement the gradient of the mse(weights) function, grdmse(weights). Note that the
vector of values that are returned by grdmse(weights) should have the same length
as the input vector weights: it should be the vector of partial derivatives of the mse
function over each element of the weights vector.
4. Finally, implement the gradient descent algorithm:
(a) Initialize weights to some random values,
(b) Iterate: weights = weights − ⌘ · grdmse(weights),
where ⌘ is a small positive constant (called “step size” or “learning rate”).
Use your program to train the network on the XOR data. During training, monitor two
values: the MSE made by your network on the training set (it should be dropping), and the
number of misclassified inputs. (The network returns a value between 0 and 1; we may agree
that values bigger than 0.5 are interpreted as “1”, otherwise as “0”.)
Run your program several times using various initialization strategies and values of the
learning rate. You may experiment with alternative activation functions, e.g., hyperbolic
tangent, tanh, or a linear rectifier, relu(x) = max(0, x).
Additionally, try the “lazy approach”: just keep generating random weights of the
network, testing if it computes the XOR function, and stop as soon as you’ve found such
weights. To get an idea how many sets of weights should be tried before finding a good one
repeat your experiment several times. Describe your work and findings in a report.
Organization
Your submission should consist of a report in the pdf format (at most 10 pages) and neatly
organized code that you’ve used in your experiments (but no data) so we (or others) could
reproduce your results. Submit your work via the Blackboard – details of the submission
procedure will be announced soon.
The deadline for this assignment is Sunday, 15th March, 23:59.