程序代写代做代考 Softmax regression
Softmax regression
Classification by minimising cross-entropy loss Srinandan Dasmahapatra
Classification: discrete output
Minimise deviation of prediction from annotation
• •
•
Given training set represented by points labelled green and red, variable Y …
… where each point has two features
𝒟 := {((x1(n),x2(n)),y(n))|n = 1,…,N}
X2
(n)
Task: find function f(x , x ) = ŷ that x(n)
1(n) 2(n) reproduces given labels
2
⏟
X1
x(n) 1
⏟
Green-ness
(01)
Analogy with seeing in colour
Opsins (photopigments) in cones respond to colour preferentially
( σ(̂ w⊤redx(n)) ) σ(̂ w⊤greenx(n))
0 x(n) 1 B1C
B x(n) C B 2. C
(1) 0
Red-ness
wred := (wred, 1, wred, 2, …, wred, d)
wgreen := (wgreen, 1, wgreen, 2, …, wgreen, d)
@.A
x(n) d
Find equation for decision boundary
Assign probability for each point being green/red
σ( f ) = Learning = adjusting weights until agreement with data
decision boundary
f(x1, x2; w) = w0 + w1x1 + w2x2
1 1+exp(−f)
wi
Constructing scale for comparing predictions with training labels
•
Output f(x(n); wi) ≜ f (n), i = red/green, C = 2, iC
(n) (n) ŷ =σ(f )
0.5
(n) (n) −y lnŷ
• •
0 ≤ σ( f (n)) ≤ 1 probability, with ∑ σ( f (n)) = 1
i
Letŷ =σ(f )
i=1 i for red/green
(n)
i (1) i (0)
(n) • y(n) = or
(n) (n) 2 01 (y−ŷ)
(n) (n)
• Evaluation of classification: cost( y , ŷ )
• Compare two different costs — quadratic and logarithmic
• Logarithm penalises mistakes more, also has a sharper drop (large gradient to guide weights to lower loss)
f (n)
For a one component (scalar) output
Multiclass classification
Input: images 32 x 32 x 3 dimensions, Output: one-hot encodings: 10 dimensions
CIFAR-10: Example dataset for multi class classification
0B 0 0 1C B 01 C
cat7!e4 =B 0 C B 0 C
B@ 0 CA 0
0B 0 0 1C B 0 C
ship7!e9 =B 0 C B 0 C
B@ 1 CA 0
C-classes C different weight vectors wi C=2
•
• •
(0)
1 y
(x)
(10)
For each input vector (say a representation of an image or sound file), produce an output on the (C-1)-dimensional surface embedded in C- dimensional Euclidean space
(0)
x ≥ 0,y ≥ 0,x + y = 1
e 3 = 0@ 0 01 1A C-dimensional prediction σ(̂ f(x(n); wi)) and true 0 1 1
Cost for input x(n) = measure of mismatch between
label ei e1 = @ 0 A 0
1
C=3
0 0 1 e2 = @ 10 A
Hat ⋅̂on σ̂indicates normalisation: entries add up
to one: σ(̂ f(x; w1)) + σ(̂ f(x; w2)) + σ(̂ f(x; w3)) = 1 @
ˆ(f(x;w1)) ˆ(f(x; w2)) A ˆ ( f ( x ; w 3 ) )
Multiclass classification
Weight vectors for each class
• •
• •
wc = (wc,wc,…,wc),c = 1,…,C. 0 1 d
C – number of classes, 10 for CIFAR-10
d – dimensionality of data, x = (x , x , …, x )
d+1
plane
0 plane 1 1
0 plane w0
1 x1 C
wd B
B . . . . CB . C
· · ·
w1
B. ….CBx2C
1 2 d
f(x; wc) = wc ⋅ 1 + wcx1 + ⋅ + wcxd
BIRD d
C
C
B wbird wbird ··· wbird CB . Cd+1 B 0 . 1 . . d . C B . C
A B . 01dx
@ . . . . .
wtruck wtruck ··· wtruck @ xd 1 A
01d : for each input data point,
compute output for all classes
0 ˆ(f(x;wplane)) 1 @ ˆ(f(x; wbird)) A ˆ ( f ( x ; w t r u c k ) )
PLANE
TRUCK
Set up gradient descent of loss for classification
Re-phrasing what has been done
• For each class each data point x(n) is assigned a score s(n) = f(x(n); wc), c = 1,…, C c
• Choose the largest of the C scores as the predicted class for x(n)
•
• Exponential function: monotonic in argument (x ↗ ⟹ ex ↗ )
c* = arg max s(n)
c
• Replace max by softmax: max(s1, s2, s3) ⟶ softmax(s1, s2, s3) = ln(es1 + es2 + es3)
c∈{1,…,C}
•
• Treat component c of [ŷ ] = σ(̂ s ) as probability that x belongs to class c : P(c|x ) cc
(n) Normalise exponential scores: s ↦
(n) (n) =: σ(̂ s ) = [ŷ ]
es(n) c
c
(n) (n) (n) (n)
c
es(n) +es(n) +⋯+es(n) 12C
c
• •
• •
C For each data point x(n) sum over costs − ∑ y
(n) (n)
ln ŷ for all classes
Multi-class loss function: cross entropy
Measures information about label distribution from input data and choice of weights
Sum costs over all data points L(W) := L({w , …, w }) = − ∑ ∑ y
ln ŷ c
, called cross-entropy.
c=1c c
1 C NC(n)(n)
(n) ŷ
n=1c=1 c (n) (n)
0
0
1 (n)
Eg: target y L(W) = − ln
(n) 1 2 3 4 2 ŷ
(n) 1 = 0
(n) prediction ŷ =
ŷ 2
(n) (n) (n) : −(0 ⋅ ln ŷ + 1 ⋅ ln ŷ + 0 ⋅ ln ŷ + 0 ⋅ ln ŷ ) = − ln ŷ
N
(1) (2) (N) (n)
c c c n=1 c (12N)n
ŷ ⋅ ŷ ⋯ŷ = − ∑ ln ŷ
: reduce negative of log(predicted probabilities)
ŷ 4
3 (n)
Gradient descent on cross-entropy finds optimal weights
•
•
• •
Learning: Reduce L(W) by changing weights
100 75 50 25 0
convex
non-convex
For linear maps f, cross-entropy is convex
w(t+1) = w(t) − η ∇w L(W)
w
w
All weights are contained in w Jupyter notebook
t
LOSS(w)
LOSS(w)
Example: data x; 2-class problem
Compare probability assignment for arg max with arg softmax
f [0,(s − 5)] f[0,(s−5)]+f[0,−(s−5)]
• •
P(c=2 | x)
s1=5
1.0 0.8 0.6 0.4 0.2
1.0 0.8 0.6 0.4 0.2
2 4 6 8 10
2 4 6 8 10
s2
P(c=1 | x) s1=5
1.0 0.8 0.6 0.4 0.2
1.0 0.8 0.6 0.4 0.2
2 4 6 8 10
2 4 6 8 10
s2
f=Max
f=Softmax
Lab 2