计算机代考程序代写 deep learning algorithm When Models Meet Data – cscodehelp代写

When Models Meet Data
Australian National University
1

8.1 Data, Models, and Learning
• A machine learning system has three major components: • Data, models, learning
• A model is obtained by learning from the training data
• A prediction is made by applying a learned model on test data
learning
(Training) Data
(Task)
(Test) Data Model
Model
Predictions (cat, dog,…)
2

8.1 Data, Models, and Learning
• We aim to learn good models.
• How is good defined? We need to have performance metrics on the test data. Examples include
• Classification accuracy
• Distance from the ground truth
• Test time (efficiency)
• Model size
• ………
• New performance metrics are constantly being proposed by the machine learning community.
3

8.1.1 Data as Vectors
• Data, read by computers, should be in a numerical format.
• See the tabular format below
• Row: an instance
• Column: a particular feature
• Apart from tabular format, machine learning can be applied to many types of data, e.g., genomic sequences, text and image contents of a webpage, and social media graphs, citation networks…
4

• We convert the table into numerical format
• Gender is quantized to -1 and +1
• Degree from BS, MS to PhD: 1, 2, 3
• Postcode corresponds to Latitude and Longitude on the map
• Name is removed because of privacy and because it does not contain
useful information for the machine learning system. (exceptions? See [1])
5
[1] Chen et al., What’s in a Name? First Names as Facial Attributes. CVPR 2013

• We use 𝑁 to denote the number of examples in a dataset and index the examples with lowercase 𝑛 = 1, ⋯ , 𝑁
• Each row is a particular individual 𝒙( referred to as an example or data point in machine learning
• The subscript 𝑛 refers to the fact that this is the 𝑛th example out of a total of 𝑁 examples in the dataset
• Each column represents a particular feature of interest about the example, and we index the features as 𝑑 = 1, ⋯ , 𝐷
• Each example is a 𝐷-dimensional vector
6

• Consider the problem of predicting annual salary from age
𝐷 columns
𝑁 rows
• A supervised learning algorithm
• We have a label 𝑦( (the salary) associated with each example 𝒙( (age).
• A dataset is written as a set of example-label pairs
𝒙,,𝑦, ,…, 𝒙(,𝑦( ,…, 𝒙.,𝑦.
• The table of examples 𝒙,, … , 𝒙. are concatenated and written as 𝑋 ∈ R.×3
We are interested in: What is the salary (𝑦) at age 60 (𝑥 = 60)?
𝑥: age 𝑦: salary
7

8.1.2 Models as Functions
• Once we have data in an appropriate vector representation, we
can construct a predictive function (known as a predictor).
• Here, a model means a predictor.
• A predictor is a function that, when given a particular input example (in our case, a vector of features), produces an output.
• For example,
𝑓: R3 → R
where the input 𝒙 is a 𝐷-dimensional vector, and the output is a real-valued scalar. That is, the function 𝑓 is applied to 𝒙, written as 𝑓 𝒙 and returns a real number.
8

8.1.2 Models as Functions
• We mainly consider the special case of linear functions
𝑓𝒙 =𝜽;𝒙+𝜃>
• Example: predicting salary 𝑓 𝒙 from age 𝒙.
Black and solid diagonal line is an example predictor.
𝑓 60 = 100
9

8.1.3 Models as Probability Distributions
• The observed data is usually a combination of the true underlying data and
noise, i.e., 𝒙? = 𝒙 + 𝒏
• We wish to reveal 𝒙 from 𝒙?
• So We would like to have predictors that express some sort of uncertainty, e.g., to quantify the confidence we have about the value of the prediction for a particular test data point.
Example function (black solid diagonal line) and its predictive uncertainty at 𝑥 = 60 (drawn as a Gaussian).
• Instead of considering a predictor as a single function, we could consider predictors to be probabilistic models.
• We will learn probability in later lectures
10

8.1.4 Learning is Finding Parameters
• The goal of learning is to find a model and its corresponding parameters such
that the resulting predictor will perform well on unseen data.
• 3 algorithmic phases when discussing machine learning algorithms
• Prediction or inference
• Training or parameter estimation
• Hyperparameter tuning or model selection
• Prediction phase: we use a trained predictor on previously unseen test data
• The training or parameter estimation phase: we adjust our predictive model based on training data. We will introduce the empirical risk minimization for finding good parameters.
• We use cross-validation to assess predictor performance on unseen data.
• We also need to balance between fitting well on training data and finding “simple” explanations of the phenomenon. This trade-off is often achieved
using regularization.
11

• Hyperparameter tuning or model selection
• We need to make high-level modeling decisions about the structure of the predictor. For example
• Number of layers to be used in deep learning
• Number of components in a Gaussian Mixture Model Hyperparameter
• Weight of regularization terms
• The problem of choosing among different models/hyperparameters is called model selection
• Difference between parameters and hyperparameters
• Parameters are to be numerically optimized (~10B weights in a deep
network)
• Hyperparameters need to use search techniques (neural architecture search [2])
[2] Zoph et al., Neural architecture search with reinforcement learning, Arxiv 2016
12

8.2 Empirical Risk Minimization
• What does it mean to learn?
• Estimating parameters based on training data.
• Four questions will be answered
• What is the set of functions we allow the predictor to take? –
Hypothesis class of functions
• How do we measure how well the predictor performs on the training data? — Loss functions for training
• How do we construct predictors from only training data that performs well on unseen test data? — regularization
• What is the procedure for searching over the space of models? – – Cross-Validation
13

8.2.1 Hypothesis Class of Functions
• We are given 𝑁 examples 𝒙( ∈ R3 and corresponding scalar
labels 𝑦( ∈ R.
• Supervised learning: we have pairs 𝒙,, 𝑦, , … , 𝒙., 𝑦.
• We want to estimate a predictor 𝑓 C, 𝜽 : R3 → R, parametrized by 𝜽
• We hope to be able to find a good parameter 𝜽∗ such that we fit the data well, that is
𝑓 𝒙(,𝜽∗ ≈𝑦( forall𝑛=1,…,𝑁
• We use 𝑦F = 𝑓 𝒙 , 𝜽∗ to represent the output of the predictor ((
14

Example (least-squares regression)
• When the label 𝑦( is real-valued, a popular choice of function
class for predictors is affine functions (linear functions). 𝑓𝒙 =𝜽;𝒙+𝜃>
• For more compact representations, we concatenate an additional unit feature 𝑥 > = 1 to 𝒙(, i.e.,
𝒙( = 𝑥(> ,𝑥(, ,𝑥(G ,…,𝑥(3 H = 1,𝑥(, ,𝑥(G ,…,𝑥(3 H
• The parameter vector is 𝜽 = 𝜃>,𝜃,,𝜃G,…,𝜃3 ;
• We can write the predictor as follows
𝑓𝒙(,𝜽 =𝜽H𝒙(
which is equivalent to the affine model
𝑓𝒙(,𝜽 =𝜃>+I3 𝜃J𝑥(J =𝜃>𝑥(> +I3 𝜃J𝑥(J =𝜽H𝒙( JK, JK, 15

Example (least-squares regression)
𝑓𝒙(,𝜽 =𝜽H𝒙(
• The predictor takes the vector of features representing a single example 𝒙( as input and produces a real-valued output,
𝑓: R3L, → R
• 𝑓 𝒙(, 𝜽 = 𝜽H𝒙( is a linear predictor
• There are many non-linear predictors, such as the neural networks
16

8.2.2 Loss Function for Training
• In training, we aim to learn a model that fits the data well.
• To define “fits the data well”, we specify a loss function l 𝑦 , 𝑦F
((
• Input: ground truth label 𝑦( of a training example
the prediction 𝑦F of this training example (
• Output: a non-negative number, called loss. It represents how much error we have made on this particular prediction
• To find good parameters 𝜽∗, we need to minimize the average loss on the set of 𝑁 training examples
• We usually assume training examples 𝒙,, 𝑦, , … , 𝒙., 𝑦. are
independent and identically distributed (i.i.d).
17

• Under the i.i.d assumption, the empirical mean is a good estimate of the population mean.
• We can use the empirical mean of the loss on the training data
• Given a training set 𝒙,,𝑦, ,…, 𝒙.,𝑦. , we use the notation of an example matrix H .×3
𝑿≔ 𝒙,,…,𝒙. ∈R
and a label vector
• The average loss is given by1 .
𝐑 𝑓,𝑿,𝒚=Il𝑦,𝑦F RST 𝑁 (K, ( (
where 𝑦F = 𝑓 𝒙 , 𝜽 . The above equation is called the ((
empirical risk. The learning strategy is called empirical risk
minimization.
𝒚= 𝑦,,…,𝑦. H ∈R.
18

Example – Least-Squares Loss
• We use the squared loss function
G
• We aim to minimize the empirical risk, which is the average of the losses over
l 𝑦 , 𝑦F = 𝑦 − 𝑦F ((((
the training data. . .
min1I l𝑦,𝑦F =min1I 𝑦−𝑓𝒙,𝜽
G Y∈RZ 𝑁 (K, ( ( 𝜽∈RZ 𝑁 (K, ( (
Usingthelinearpredictor𝑓𝒙(,𝜽 =𝜽H𝒙(,weobtaintheoptimizationproblem min1I. 𝑦(−𝑓𝒙(,𝜽 G
𝜽∈RZ 𝑁 (K,
• This equation can be equivalently expressed in matrix form
min, 𝒚−𝑿𝜽 𝜽∈RZ .
• This is known as the least-squares problem. There exists a closed-form
analytic solution for this by solving the normal equations. We will discuss it in
later lectures
G
19

• We actually want to find a predictor 𝑓 that minimizes the expected risk (or the population risk)
𝐑[]^ 𝑓 =𝔼`,a l 𝑦,𝑓 𝒙
where 𝑦 is the ground truth label and 𝑓 𝒙 is the prediction
based on the example 𝒙.
• 𝐑[]^ 𝑓 is the true risk, if we can access an infinite amount of
data
• The expectation 𝔼 is over the infinite set of all possible data and labels.
20

• Machine learning applications have different types of performance measure.
• For classification: accuracy, AUC, F1 score, etc.
• For detection: mean average precision, mIoU, etc.
• For image denoise/super resolution: SSIM, PSNR, etc.
• In principle, the loss function should correspond to the measure.
• However, there are often mismatches between loss functions and the measures – due to implementation/optimization considerations
21

Check your understanding
• A machine learning model may contain as few as a couple of parameters
•Whenweusealinearregressionmodeling,𝑓𝒙 =𝜽;𝒙+𝜃>,we don’t have hyperparameters.
• Hyperparameters are usually learned through the same way as normal parameters.
• It’s very hard to know the expected risk, but easier to know the empirical risk
• Given a fixed task, we can only use a fixed set of evaluation metrics.
22

We start with training a classifier
Training data
classifier
Deng and Zheng, “Are labels always necessary for classifier accuracy evaluation?” CVPR 2021.
23

We do a bit testing….
Correct prediction
Dog 0.05
classifier
Cat Prediction result = Ground truth
Testing image
Cat 0.95
Dog 0.85
Cat 0.15
Cat
Wrong prediction
classifier
Prediction result ≠ Ground truth 24
Testing image
Deng and Zheng, “Are labels always necessary for classifier accuracy evaluation?” CVPR 2021.

We now evaluate a model
classifier
Evaluation
Accuracy: 90%
Training data
Testing data
Ground truths provided
Deng and Zheng, “Are labels always necessary for classifier accuracy evaluation?” CVPR 2021.
25

Is this way of evaluation feasible? • Yes
ImageNet MSCOCO
Ground truths provided
LFW
Deng and Zheng, “Are labels always necessary for classifier accuracy evaluation?” CVPR 2021.
26

Is this way of evaluation feasible? • No….
Suppose we deploy our cat-dog classifier to a swimming pool
We can’t calculate a classifier accuracy!!
Ground truths not provided
27
Deng and Zheng, “Are labels always necessary for classifier accuracy evaluation?” CVPR 2021.

We encounter this problem too many times in CV applications….
• Deploy a ReID model to a new community
• Deploy face recognition in an airport
• Deploy a 3D object detection system to a new city • ……
We can’t quantitatively measure the performance of our model like we usually do!!
Unless we annotate the test data…, but environment will change over time…. We need to annotate test data again
Deng and Zheng, “Are labels always necessary for classifier accuracy evaluation?” CVPR 2021.

Formally:
Given
– A training dataset
– A classifier trained on this dataset – A test set without labels
We want to estimate:
Classification accuracy on the test set
29
Deng and Zheng, “Are labels always necessary for classifier accuracy evaluation?” CVPR 2021.

Method – regression
digit classification
Fréchet distance
Domain gap between a training set and test sets
Deng and Zheng, “Are labels always necessary for classifier accuracy evaluation?” CVPR 2021.
Every point is a dataset
30
Recognition accuracy (%)

Method – regression
Rotation prediction accuracy
Deng, Gould and Zheng, “What does rotation prediction tell us about classifier accuracy under
varying testing environments?” ICML 2021
Every point is a dataset
31

Experiment
Deng and Zheng, “Are labels always necessary for classifier accuracy evaluation?” CVPR 2021.
32

Experiment
“Confidence”: a simple pseudo label method.
If the maximum value of the softmax vector is greater than 𝜏, we view this sample as correctly classified.
33
Deng and Zheng, “Are labels always necessary for classifier accuracy evaluation?” CVPR 2021.

Experiment
Deng and Zheng, “Are labels always necessary for classifier accuracy evaluation?” CVPR 2021.
34

Experiment
The two regression methods are stable and quite accurate.
35
Deng and Zheng, “Are labels always necessary for classifier accuracy evaluation?” CVPR 2021.

Leave a Reply

Your email address will not be published. Required fields are marked *