CS计算机代考程序代写 Bayesian flex algorithm PowerPoint Presentation – cscodehelp代写 – INTERVIEW&CODEHELP

PowerPoint Presentation

Lecturer: Ben Rubinstein

Lecture 2. Statistical Schools of Thought

COMP90051 Statistical Machine Learning

This lecture

How do learning algorithms come about?
• Frequentist statistics
• Statistical decision theory
• Extremum estimators
• Bayesian statistics

Types of probabilistic models
• Parametric vs. Non-parametric
• Generative vs. Discriminative

COMP90051 Statistical Machine Learning

Wherein unknown model parameters are treated
as having fixed but unknown values.

Frequentist Statistics

willoweit.
Typewritten Text
频率统计学学派

willoweit.
Typewritten Text

COMP90051 Statistical Machine Learning

Frequentist statistics

• Abstract problem
∗ Given: X1, X2, …, Xn drawn i.i.d. from some distribution
∗ Want to: identify unknown distribution, or a property of it

• Parametric approach (“parameter estimation”)
∗ Class of models {𝑝𝑝𝜃𝜃 𝑥𝑥 :𝜃𝜃 ∈ Θ} indexed by parameters Θ (could

be a real number, or vector, or ….)
∗ Point estimate �𝜃𝜃 (𝑋𝑋1, … ,𝑋𝑋𝑛𝑛) a function (or statistic) of data

• Examples
∗ Given n coin flips, determine probability of landing heads
∗ Learning a classifier

Hat means estimate
or estimator

Independent and
identically distributed

COMP90051 Statistical Machine Learning

Estimator Bias

Frequentists seek good behaviour, in ideal conditions

• Bias: B𝜃𝜃 �𝜃𝜃 = E𝜃𝜃 �𝜃𝜃 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛 − 𝜃𝜃

Example: for i=1…40

• 𝑋𝑋𝑖𝑖,1, … ,𝑋𝑋𝑖𝑖,20 ~ 𝑝𝑝𝜃𝜃 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁(𝜃𝜃 = 0,𝜎𝜎2 = 1)

• �𝜃𝜃𝑖𝑖 =
1
20
∑𝑗𝑗=1
20 𝑋𝑋𝑖𝑖,𝑗𝑗 the sample mean, plot as

Subscript θ means data
really comes from pθ

Average of �̂�𝜃1, … , �̂�𝜃40 True 𝜃𝜃 = 0
Gap converges to bias
(in probability)!

COMP90051 Statistical Machine Learning

Estimator Variance

Frequentists seek good behaviour, in ideal conditions

• Variance: Var𝜃𝜃 �̂�𝜃 = E𝜃𝜃 (�̂�𝜃 − E𝜃𝜃[�̂�𝜃])2

Example cont.

• Plot each (�̂�𝜃𝑖𝑖 − E𝜃𝜃[�̂�𝜃𝑖𝑖])2= �̂�𝜃𝑖𝑖
2

�𝜃𝜃 still function of data

Average of �̂�𝜃1
2

, … , �̂�𝜃40
2

True Var𝜃𝜃 �̂�𝜃 =
𝜎𝜎2

20
= 0.05

Once again, average converges
to true (in probability)!

COMP90051 Statistical Machine Learning

Asymptotically Well Behaved

For our example estimator (sample mean), we could calculate its
exact bias (zero) and variance (𝜎𝜎2). Usually can’t guarantee low
bias/variance exactly 

Asymptotic properties often hold! 
• Consistency: �𝜃𝜃 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛 → 𝜃𝜃 in probability

• Asymptotic efficiency: Var𝜃𝜃 �𝜃𝜃 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛 converges to the
smallest possible variance of any estimator of 𝜃𝜃

Amazing Cramér-Rao lower bound (outside subject scope):
Var𝜃𝜃 �𝜃𝜃 ≥

1
𝐼𝐼(𝜃𝜃)

with 𝐼𝐼(𝜃𝜃) the Fisher information of pθ for any �𝜃𝜃
7

Bias closer and
closer to zero

Variance closer &
closer to optimal

willoweit.
Rectangle

COMP90051 Statistical Machine Learning

Maximum-Likelihood Estimation

• A general principle for designing estimators

• Involves optimisation

• �̂�𝜃 𝑥𝑥1, … , 𝑥𝑥𝑛𝑛 ∈ argmax
𝜃𝜃∈Θ

∏𝑖𝑖=1
𝑛𝑛 𝑝𝑝𝜃𝜃(𝑥𝑥𝑖𝑖)

• “The best estimate is one under
which observed data is most likely”

Later: MLE estimators usually well-behaved asymptotically

Fischer

COMP90051 Statistical Machine Learning

Example I: Bernoulli

• Know data comes from Bernoulli distribution with
unknown parameter (e.g., biased coin); find mean

• MLE for mean

∗ 𝑝𝑝𝜃𝜃 𝑥𝑥 = �
𝜃𝜃, if 𝑥𝑥 = 1
1 − 𝜃𝜃, if 𝑥𝑥 = 0 = 𝜃𝜃

𝑥𝑥 1 − 𝜃𝜃 1−𝑥𝑥

(note: 𝑝𝑝𝜃𝜃 𝑥𝑥 = 0 for all other 𝑥𝑥)

∗ Maximising likelihood yields �𝜃𝜃 = 1
𝑛𝑛
∑𝑖𝑖=1
𝑛𝑛 𝑋𝑋𝑖𝑖

9Derivation: https://stats.stackexchange.com/questions/275380/maximum-likelihood-estimation-for-bernoulli-distribution

https://stats.stackexchange.com/questions/275380/maximum-likelihood-estimation-for-bernoulli-distribution

COMP90051 Statistical Machine Learning

Example II: Normal

• Know data comes from Normal distribution with
variance 1 but unknown mean; find mean

• MLE for mean
∗ 𝑝𝑝𝜃𝜃 𝑥𝑥 =

1
2𝜋𝜋

exp −1
2
𝑥𝑥 − 𝜃𝜃 2

∗ Maximising likelihood yields �𝜃𝜃 = 1
𝑛𝑛
∑𝑖𝑖=1
𝑛𝑛 𝑋𝑋𝑖𝑖

• Exercise: derive MLE for variance 𝜎𝜎2 based on
𝑝𝑝𝜃𝜃 𝑥𝑥 =

1
2𝜋𝜋𝜎𝜎2

exp − 1
2𝜎𝜎2

𝑥𝑥 − 𝜇𝜇 2 with 𝜃𝜃 = (𝜇𝜇,𝜎𝜎2)

10Derivation: https://www.statlect.com/fundamentals-of-statistics/normal-distribution-maximum-likelihood

https://www.statlect.com/fundamentals-of-statistics/normal-distribution-maximum-likelihood

COMP90051 Statistical Machine Learning

MLE ‘algorithm’

1. Given data 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛 define probability distribution,
𝑝𝑝𝜃𝜃, assumed to have generated the data

2. Express likelihood of data, ∏𝑖𝑖=1
𝑛𝑛 𝑝𝑝𝜃𝜃(𝑋𝑋𝑖𝑖)

(usually its logarithm… why?)

3. Optimise to find best (most likely) parameters �̂�𝜃
1. take partial derivatives of log likelihood wrt 𝜃𝜃
2. set to 0 and solve

(failing that, use gradient descent)

COMP90051 Statistical Machine Learning

Mini Summary

• Frequentist school of thought

• Point estimates

• Quality: bias, variance, consistency, asymptotic efficiency

• Maximum-likelihood estimation (MLE)

Next: Statistical Decision Theory, Extremum estimators

COMP90051 Statistical Machine Learning

Statistical Decision Theory
Branch within statistics, optimisation, economics,

control, emphasising utility maximisation.

COMP90051 Statistical Machine Learning

Decision theory

• Act to maximise utility – connected to
economics and operations research

• Decision rule 𝛿𝛿(𝒙𝒙) ∈ 𝐴𝐴 an action space
∗ E.g. Point estimate �𝜃𝜃 𝑥𝑥1, … , 𝑥𝑥𝑛𝑛
∗ E.g. Out-of-sample prediction �𝑌𝑌𝑛𝑛+1|𝑋𝑋1,𝑌𝑌1, … ,𝑋𝑋𝑛𝑛,𝑌𝑌𝑛𝑛,𝑋𝑋𝑛𝑛+1

• Loss function 𝑁𝑁 𝑁𝑁,𝜃𝜃 : economic cost, error metric
∗ E.g. square loss of estimate �𝜃𝜃 − 𝜃𝜃

∗ E.g. 0-1 loss of classifier predictions 1 𝑦𝑦 ≠ �𝑦𝑦

Wald

willoweit.
Highlight

COMP90051 Statistical Machine Learning

Risk & Empirical Risk Minimisation (ERM)

• In decision theory, really care about expected loss

• Risk 𝑅𝑅𝜃𝜃 𝛿𝛿 = E𝑿𝑿~𝜃𝜃 𝑁𝑁 𝛿𝛿 𝑿𝑿 ,𝜃𝜃
∗ E.g. true test error
∗ aka generalization error

• Want: Choose 𝛿𝛿 to minimise 𝑅𝑅𝜃𝜃 𝛿𝛿

• Can’t directly! Why?

• ERM: Use training set X to approximate pθ
∗ Minimise empirical risk �𝑅𝑅𝜃𝜃 𝛿𝛿 =

1
𝑛𝑛
∑𝑖𝑖=1
𝑛𝑛 𝑁𝑁 𝛿𝛿 𝑋𝑋𝑖𝑖 ,𝜃𝜃

willoweit.
Rectangle

COMP90051 Statistical Machine Learning

Decision theory vs. Bias-variance

We’ve already seen

• Bias: B𝜃𝜃 �̂�𝜃 = E𝜃𝜃 �̂�𝜃 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛 − 𝜃𝜃

• Variance: Var𝜃𝜃 �̂�𝜃 = E𝜃𝜃 (�̂�𝜃 − E𝜃𝜃[�̂�𝜃])2

But are they equally important? How related?

• Bias-variance decomposition of square-loss risk
E𝜃𝜃 𝜃𝜃 − �̂�𝜃

2
= [B(�̂�𝜃)]2+Var𝜃𝜃(�̂�𝜃)

willoweit.
Rectangle

COMP90051 Statistical Machine Learning

Extremum estimators
Very general framework that covers elements of

major statistical learning frameworks; enjoys good
asymptotic behaviour in general!!

COMP90051 Statistical Machine Learning

Extremum estimators

• �̂�𝜃𝑛𝑛 𝑿𝑿 ∈ argmin
𝜃𝜃∈Θ

𝑄𝑄𝑛𝑛 𝑿𝑿,𝜃𝜃 for any objective Qn()

• Generalises bits of all statistical frameworks. W00t!
∗ MLE and ERM seen earlier this lecture; and
∗ MAP seen later in this lecture.
∗ These are all M-estimators, with Q as a sum over data

(i.e. of log-likelihood, loss, or log-likelihood plus log prior)

• And it generalises other frameworks too!

COMP90051 Statistical Machine Learning

Consistency of Extremum Estimators
• Recall consistency: stochastic convergence to 0 bias

• Theorem for extremum estimators: �𝜃𝜃𝑛𝑛 → 𝜃𝜃 in prob,
if there’s a (“limiting”) function Q() such that:

1. Q() is uniquely maximised by 𝜃𝜃.
That is, no other parameters make Q() as large as 𝑄𝑄(𝜃𝜃).

2. The parameter family Θ is “compact”
(a generalisation of the familiar “closed” & “bounded” set, like [0,1])

3. Q() is a continuous function
4. Uniform convergence: sup𝜃𝜃∈Θ 𝑄𝑄𝑛𝑛 𝜃𝜃 − 𝑄𝑄(𝜃𝜃) → 0 in probability.

𝑄𝑄 𝜃𝜃 + 𝜀𝜀

𝑄𝑄 𝜃𝜃
𝑄𝑄 𝜃𝜃 − 𝜀𝜀

ob
je

ct
iv

e
va

lu
e

parameter 𝜃𝜃𝜃𝜃𝐿𝐿 𝜃𝜃𝑈𝑈

𝑸𝑸𝒏𝒏 𝜽𝜽

Maximiser �̂�𝜃𝑛𝑛 of 𝑸𝑸𝒏𝒏 must be in [𝜃𝜃𝐿𝐿,𝜃𝜃𝑈𝑈]

MLE

�𝜃𝜃𝑛𝑛

willoweit.
Typewritten Text
not examiner

COMP90051 Statistical Machine Learning

A game changer

• Frequentists: estimators that aren’t even correct with infinite data
(inconsistent), aren’t adequate in practice

• Proving consistency for every new
estimator? Ouch!

• So many estimators are extremum
estimators – general guarantees make it
much easy (but not easy!) to prove

• Asymptotic normality
∗ Extremum estimators converge to Gaussian in distribution
∗ Asymptotic efficiency: the variance of that limiting Gaussian

• Practical: Confidence intervals – think error bars!!

 Frequentists like to have this asymptotic theory for their algorithms
20

COMP90051 Statistical Machine Learning

Mini Summary

• Decision theory: Utility-based, Minimise risk

• Many familiar learners minimise loss over data (ERM)

• Extremum estimators generalise ERM, MLE, (later: MAP)
∗ Amazingly, consistent: Gives us confidence that they work

(eventually)
∗ Amazingly, asymptotically normal: Helps make confidence

intervals

Next: Last but not least, the Bayesian paradigm

COMP90051 Statistical Machine Learning

Bayesian Statistics
Wherein unknown model parameters have

associated distributions reflecting prior belief.

COMP90051 Statistical Machine Learning

Bayesian statistics

• Probabilities correspond to beliefs

• Parameters
∗ Modeled as r.v.’s having distributions
∗ Prior belief in 𝜃𝜃 encoded by prior distribution 𝑃𝑃(𝜃𝜃)

• Parameters are modeled like r.v.’s (even if not really random)
• Thus: data likelihood 𝑃𝑃𝜃𝜃(𝑋𝑋) written as conditional 𝑃𝑃(X|𝜃𝜃)

∗ Rather than point estimate �𝜃𝜃, Bayesians update belief 𝑃𝑃(𝜃𝜃) with
observed data to 𝑃𝑃(𝜃𝜃|𝑋𝑋) the posterior distribution)

Laplace

COMP90051 Statistical Machine Learning

Tools of probabilistic inference

• Bayesian probabilistic inference
∗ Start with prior P(𝜃𝜃) and likelihood P(𝑋𝑋|𝜃𝜃)
∗ Observe data 𝑋𝑋 = 𝑥𝑥
∗ Update prior to posterior P(𝜃𝜃|𝑋𝑋 = 𝑥𝑥)

• Primary tools to obtain the posterior
∗ Bayes Rule: reverses order of conditioning

𝑃𝑃 𝜃𝜃 𝑋𝑋 = 𝑥𝑥 =
𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 𝑃𝑃(𝜃𝜃)

𝑃𝑃(𝑋𝑋 = 𝑥𝑥)
∗ Marginalisation: eliminates unwanted variables

P 𝑋𝑋 = 𝑥𝑥 = �
𝑡𝑡
𝑃𝑃(𝑋𝑋 = 𝑥𝑥,𝜃𝜃 = 𝑡𝑡)

Bayes

These are
general tools of
probability and
not specific to

Bayesian
stats/ML

This quantity
is called the

evidence

COMP90051 Statistical Machine Learning

Example

• We model 𝑋𝑋|𝜃𝜃 as 𝑁𝑁(𝜃𝜃, 1) with prior N(0,1)
• Suppose we observe X=1, then update posterior

𝑃𝑃 𝜃𝜃 𝑋𝑋 = 1 =
𝑃𝑃 𝑋𝑋 = 1 𝜃𝜃 𝑃𝑃(𝜃𝜃)

𝑃𝑃(𝑋𝑋=1)

∝ 𝑃𝑃 𝑋𝑋 = 1 𝜃𝜃 𝑃𝑃 𝜃𝜃

= 1
2𝜋𝜋
𝑒𝑒𝑥𝑥𝑝𝑝 − (1−𝜃𝜃)

2
1
2𝜋𝜋
𝑒𝑒𝑥𝑥𝑝𝑝 − 𝜃𝜃

∝ 𝑁𝑁 0.5,0.5
NB: allowed to push constants out front and “ignore” as
these get taken care of by normalisation

𝑃𝑃 𝜃𝜃 𝑋𝑋 = 1 =
𝑃𝑃 𝑋𝑋 = 1 𝜃𝜃 𝑃𝑃(𝜃𝜃)

𝑃𝑃(𝑋𝑋=1)
∝ 𝑃𝑃 𝑋𝑋 = 1 𝜃𝜃 𝑃𝑃 𝜃𝜃
= 1

2𝜋𝜋
𝑒𝑒𝑥𝑥𝑝𝑝 − (1−𝜃𝜃)

2
1
2𝜋𝜋
𝑒𝑒𝑥𝑥𝑝𝑝 −𝜃𝜃

2
Discard constants w.r.t

𝜃𝜃

Collect exp’s

Name of the game is to get
posterior into a recognisable form.
exp of quadratic must be a Normal

Want leading numerator
term to be 𝜃𝜃2 by moving

coefficient to denominator

Complete the square in
numerator: move out

excess constants

Factorise

Recognise as
(unnormalized) Normal!

Constant underlined
Variance/std deviation circled

COMP90051 Statistical Machine Learning

How Bayesians make point estimates

• They don’t, unless forced at gunpoint!
∗ The posterior carries full information, why discard it?

• But, there are common approaches
∗ Posterior mean 𝐸𝐸𝜃𝜃|𝑋𝑋 𝜃𝜃 = ∫𝜃𝜃𝑃𝑃 𝜃𝜃 𝑋𝑋 𝑑𝑑𝜃𝜃
∗ Posterior mode argmax

𝜃𝜃
𝑃𝑃(𝜃𝜃|𝑋𝑋) (max a posteriori or MAP)

∗ There’re Bayesian decision-theoretic interpretations of these

𝑃𝑃(𝜃𝜃|𝑋𝑋)

𝜃𝜃𝜃𝜃MAP

COMP90051 Statistical Machine Learning

MLE in Bayesian context

• MLE formulation: find parameters that best fit data
�̂�𝜃 ∈ argmax𝜃𝜃 𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝜃𝜃

• Consider the MAP under a Bayesian formulation
�̂�𝜃 ∈ argmax𝜃𝜃𝑃𝑃 𝜃𝜃 𝑋𝑋 = 𝑥𝑥

= argmax𝜃𝜃
𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 𝑃𝑃 𝜃𝜃

𝑃𝑃 𝑋𝑋 = 𝑥𝑥
= argmax𝜃𝜃 𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 𝑃𝑃 𝜃𝜃

• Prior 𝑃𝑃 𝜃𝜃 weights; MLE like uniform 𝑃𝑃 𝜃𝜃 ∝ 1

• Extremum estimator: Max log𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 + log𝑃𝑃 𝜃𝜃

COMP90051 Statistical Machine Learning

Frequentists vs Bayesians – Oh My!

• Two key schools of statistical
thinking
∗ Decision theory complements both

• Past: controversy; animosity;
almost a ‘religious’ choice

• Nowadays: deeply connected

ht
tp

s:
//

xk
cd

.c
om

/1
13

2/
C

C-
N

C2
.5

COMP90051 Statistical Machine Learning

(Some) Categories of
Probabilistic Models

COMP90051 Statistical Machine Learning

Parametric vs non-parametric models

Parametric Non-Parametric

Determined by fixed, finite
number of parameters

Number of parameters grows
with data, potentially infinite

Limited flexibility More flexible

Efficient statistically and
computationally

Less efficient

Examples to come! There are non/parametric models
in both the frequentist and Bayesian schools.

COMP90051 Statistical Machine Learning

Generative vs. discriminative models
• X’s are instances, Y’s are labels (supervised setting!)

∗ Given: i.i.d. data (X1, Y1),… ,(Xn, Yn)
∗ Find model that can predict Y of new X

• Generative approach
∗ Model full joint P(X, Y)

• Discriminative approach
∗ Model conditional P(Y|X) only

• Both have pro’s and con’s

Examples to come! There are generative/discriminative
models in both the frequentist and Bayesian schools.

COMP90051 Statistical Machine Learning

Mini Summary

• Bayesian paradigm: Its all in the prior!

• Bayesian point estimate: MAP (an extremum estimator)

• Parametric vs Non-parametric models

• Discriminative vs. Generative models

Next: Logistic regression (unlike you’ve ever seen before)

Workshops week #2: learning Bayes one coin flip at a time!

�Lecturer: Ben Rubinstein
This lecture
Slide Number 3
Frequentist statistics
Estimator Bias
Estimator Variance
Asymptotically Well Behaved
Maximum-Likelihood Estimation
Example I: Bernoulli
Example II: Normal
MLE ‘algorithm’
Mini Summary
Slide Number 13
Decision theory
Risk & Empirical Risk Minimisation (ERM)
Decision theory vs. Bias-variance
Slide Number 17
Extremum estimators
Consistency of Extremum Estimators
A game changer
Mini Summary
Slide Number 22
Bayesian statistics
Tools of probabilistic inference
Example
Slide Number 26
How Bayesians make point estimates
MLE in Bayesian context
Frequentists vs Bayesians – Oh My!
Slide Number 30
Parametric vs non-parametric models
Generative vs. discriminative models
Mini Summary

CS计算机代考程序代写 Bayesian flex algorithm PowerPoint Presentation – cscodehelp代写

Published by admin

Leave a Reply Cancel reply