CS代考程序代写 Data 100, Midterm 2 Fall 2019
Data 100, Midterm 2 Fall 2019
Name:
Email:
Student ID:
Exam Room:
All work on this exam is my own (please sign):
@berkeley.edu
Instructions:
• This midterm exam consists of 100 points and must be completed in the 80 minute time period ending at 9:30, unless you have accommodations supported by a DSP letter.
• Note that some questions have circular bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply.
• When selecting your choices, you must fully shade in the box/circle. Check marks will likely be mis-graded.
• You may use two cheat sheets each with two sides.
• Please show your work for computation questions as we may award partial credit.
1
Data 100 Midterm 2, Page 2 of 12 November 13, 2019
Reference Table
exp(x)
ex
log(x)
loge(x) or ln(x)
Linear regression model
T ⃗ˆ yˆ = f ⃗ˆ ( ⃗x ) = ⃗x β
β
Logistic (or sigmoid) function
σ(t) = 1 1+exp(−t)
Logistic regression model
T ⃗ˆ yˆ=f⃗ˆ(⃗x)=P(Y =1|⃗x)=σ(⃗x β)
β
Squared error loss
L ( y , yˆ ) = ( y − yˆ ) 2
Absolute error loss
L ( y , yˆ ) = | y − yˆ |
Cross-entropy loss
L(y, yˆ) = −y log(yˆ) − (1 − y) log(1 − yˆ)
Model Bias
E[f⃗ˆ(⃗x)] − g(x) β
Model Variance
E[(f⃗ˆ(⃗x) − E[f⃗ˆ(⃗x)])2] ββ
0 Howdy
[0 pts] In LASSO regression, LASSO is an acronym. What does it stand for?
Data 100 Midterm 2, Page 3 of 12 November 13, 2019
1 PCA
A children’s zoo collects data about how much time 1000 visitors spend at each of 8 selected exhibits and stores them in a dataframe df zoo. These exhibits include 6 animals and 2 activities (train and playground). An example row of df zoo is given below.
(a)
[3 Pts] Suppose we center and scale df zoo (as we learned about in class) to form the design matrix X. X has 1000 rows and 8 columns exactly corresponding to the dataframe described above, except that it has been centered and scaled. Suppose we then use SVD to decompose X into U , Σ, and V T . Suppose that we want to compute the principal component matrix P, where the 1st column of P is the 1st principal component, the 2nd column of P is the 2nd principal component, etc. Which of the following expressions are equal to P? Select all that apply.
U Σ VT X UX UΣ XU XΣ XV [2 Pts] How many rows and columns are in P?
# rows = # columns =
(b)
(c) i.
[3 Pts] What is the total variance V of our centered and scaled design matrix X? If there is not enough information provided in the problem statement, write ”not enough information.”
answer =
ii. [3 Pts] Suppose our first 6 singular values are 56, 53, 21, 20, 20, 19. What fraction of the variance is captured by the first two principal components? Do not carry out any arithmetic operations; just give us a numerical expression that could be evaluated into the correct answer. Regardless of your answer to the previous problem, you may assume that you know V, and may give your answer for this problem in terms of V. If there is not enough information, write ”not enough information.”
answer =
Data 100 (d)
Midterm 2, Page 4 of 12 November 13, 2019
[6 Pts] Below is a 2D scatterplot of the first two principal components. We see that there appear to be 3 types of visitors, grouped on the top, bottom-left, and bottom-right.
Below are plots of the first and second rows of V T .
Use these plots to describe the characteristics of each of the 3 groups in the scatterplot above. Your explanations should only be a sentence or two.
Top group description:
Bottom-left group description:
Bottom-right group description:
Data 100 Midterm 2, Page 5 of 12 November 13, 2019
2 Linear Regression
Suppose we have a data set of 100 points whose first few rows are shown below, and that we’d like to predict ⃗y from ⃗v and w⃗. Suppose we create a design matrix X whose first column is ⃗v, second column is w⃗, and third column ⃗u is a new feature ui = |vi|. The resulting model is yˆi =β1vi +β2wi +β3|vi|.Thetoprowisrow1,e.g.y1 =4.
(a) [3 Pts] For the data above, suppose we arbitrarily pick β⃗ = [0.1, 12, 0.2]T . What is yˆ1? yˆ1 =
(b) [2 Pts] For the data above, let ⃗e be the residual vector if β⃗ = [0.1, 12, 0.2]T . What is |e1|? |e1| =
(c) [3 Pts] For the data above, suppose that ⃗e · ⃗e = 9. What is the MSE? MSE =
⃗ˆ
(d) [3 Pts] Let β be the exact parameter vector that minimizes the empirical L2 risk, where
y
v
w
4
-30
1
6
-40
2
5
20
3
we write this risk as R(β⃗, X, ⃗y). Also, let ⃗e be the residuals for the optimal parameter ⃗ˆ
vector β. Which of the following quantities are guaranteed to be zero? ⃗ˆ⃗⃗ˆ
ei TheMSE ∇β⃗(R(β,X,⃗y)) ⃗e·yˆ ⃗e·β Noneofthese
(e) [1 1/2 Pts] For the data above, the matrix X has full rank (i.e. no columns are linear combinations of any others). Suppose we compute Z = (XT X)−1XT ⃗y. What is Z? Select one and fill in its blank.
⃝ It is a vector of length .
⃝ It is a matrix with rows and columns. ⃝ It does not exist because |vi| is not differentiable.
(f) [5 Pts] Let β⃗ridge be the β⃗ that minimizes the sum of the MSE plus an L2 regularization term for a positive λ. Let ⃗e be the residuals for the parameter vector β⃗ridge. Which of the following are true? Recall that ||β⃗||2 is the sum of the squares of the components of β⃗ and R is the empirical L2 risk defined in (d).
ei =0
∇β⃗(R(β⃗ridge,X,⃗y)) = 0
⃗ˆ ⃗
R(β,X,⃗y)≤R(βridge,X,⃗y)
⃗ 2 ⃗ˆ2 ||βridge||2 ≤ ||β||2
None of these
Data 100 Midterm 2, Page 6 of 12 November 13, 2019
3 Bias-Variance Tradeoff
We obtain n data points (n is some large fixed integer) which have been generated from the truemodelY =f(x)+ε,whereεisrandomnoise(E[ε]=0,Var(ε)=σ2).
We fit linear models of varying complexity to our data, and plotted the bias, variance, and irreducible error below.
(a) [1 1/2 Pts] Sketch the MSE on the above graph. Where does its minimum occur? Draw a star on your MSE plot where the minimum occurs.
(b) [1Pt] SupposewecontrolthecomplexityofthelinearmodelsusingaRidgepenaltyterm λ βi2. Which of the following is true?
⃝ The left side of the graph represents small λ. ⃝ The right side of the graph represents small λ.
(c) [3 Pts] Which of the following can impact our model variance? Select all that apply. The regularization coefficient λ.
The choice of features to include in our design matrix. The learning rate α in gradient descent.
The size of the training set.
Data 100 Midterm 2, Page 7 of 12 November 13, 2019
4 Cross Validation
Suppose we have a training dataset of 90 points, and a test set of 30 points, and want to know which λ value is best for a ridge regression model. Our candidate hyperparameters are λ = 0.1, λ = 1, and λ = 10.
(a) [2 1/2 Pts] A DS100 student suggests performing 10-fold cross validation to find the opti- mal λ. Is the choice of 10-fold CV reasonable?
⃝ Yes.
⃝ No, since we have 3 candidate hyperparameters we should use 3-fold cross
validation.
⃝ No, since we have 30 test points, we should use 30-fold cross validation.
⃝ No, CV should never be used for selecting hyperparameters.
(b) Suppose we select the best choice of λ from the three choices available using 3-fold cross validation. As mentioned in class, we can compute the optimal parameters for a ridge regression model with the expression β⃗ = (XT X + nλI)−1XT ⃗y. Assume that we use this closed equation to fit the parameters for our model.
i. [2 Pts] During the entire process of selecting our best λ, how many total times will we evaluate the expression (XT X + nλI)−1XT ⃗y?
⃝1 ⃝2 ⃝3 ⃝6 ⃝9 ⃝30 ⃝60 ⃝90 ⃝270 ii. [2 Pts] How many rows will be in X each time this expression is evaluated?
⃝1 ⃝2 ⃝3 ⃝6 ⃝9 ⃝30 ⃝60 ⃝90 ⃝120 ⃝ It will vary each time. ⃝ Not enough information.
(c) As in the previous part, suppose we want to select the best λ from the three choices above using 3-fold cross validation. To evaluate the MSE for a given β⃗, we use the sum of squares: ||⃗y − Xβ⃗||2. Reminder that this expression is just another way of writing ( ⃗y i − ⃗x Ti β⃗ ) 2 .
i. [2 Pts] During the entire process of selecting our best λ, how many times will this expression get evaluated?
⃝1 ⃝2 ⃝3 ⃝6 ⃝9 ⃝30 ⃝60 ⃝90
ii. [2 Pts] How many rows will be in X each time this expression is evaluated?
⃝1 ⃝2 ⃝3 ⃝6 ⃝9 ⃝30 ⃝60 ⃝90 ⃝120 ⃝ It will vary each time. ⃝ Not enough information.
Data 100
Midterm 2, Page 8 of 12 November 13, 2019
5
(a)
Gradient Descent
[3 Pts] The learning rate can potentially affect which of the following? Select all that apply. Assume nothing about the function being minimized other than that its gradient exists. You may assume the learning rate is positive.
The speed at which we converge to a minimum.
Whether gradient descent converges.
The direction in which the step is taken.
Whether gradient descent converges to a local minimum or a global minimum.
[3Pts] Supposewerungradientdescentwithafixedlearningrateofα=0.1tominimize the2Dfunctionf(x,y)=5+x2 +y2 +5xy.
The gradient of this function is
2x + 5y ∇x,yf(x,y)= 2y+5x
If our starting guess is x(0) = 1, y(0) = 2, what will be our next guess x(1), y(1)?
x(1) = y(1) =
[2 Pts] Suppose we are performing gradient descent to minimize the empirical risk of a linear regression model y = β0 + β1×1 + β2×21 + β3×2 on a dataset with 100 observations. Let D be the number of components in the gradient, e.g. D = 2 for the equation in part b. What is D for the gradient used to optimize this linear regression model?
⃝2 ⃝3 ⃝4 ⃝8 ⃝100 ⃝200 ⃝300 ⃝400 ⃝800
(b)
(c)
Data 100 Midterm 2, Page 9 of 12 November 13, 2019
6 One Hot Encoding and Feature Engineering
A Canadian study of workers in the 1980s collected the following information: • wage (hourly in dollars)
• edu (years)
• job_type (1 for blue collar, 2 for white collar, and 3 for managerial)
A data scientist fitted a model with wage as the response, and the other two variables as ⃗ˆ
features (job_type was one-hot encoded). The resulting fitted model was yˆ = ⃗x · β, where ⃗ˆ T
β= −8 3 6 −3 ,i.e.
yˆ=−8+3xedu +6xm −3xb,
where y is the hourly wage, xedu is years of education, and the other two variables are the
dummies for managerial and blue collar workers, respectively.
(a) [2 Pts] For a blue collar worker with 10 years of education, what is the predicted value
of wage (the predicted hourly wage) according to our model?
wage =
(b) [2 Pts] For a white collar worker with 10 years of education, what is the predicted value
of wage according to our model?
wage =
(c) [6 Pts] Sketch the fitted model on the graph below. Hint: What you did in parts (a) and (b) is useful here. When grading we will only look at y-values for x = 10 and x = 20, so don’t worry about exact values other than these. Don’t worry about exact shape.
Data 100 (d)
Midterm 2, Page 10 of 12 November 13, 2019 [5 Pts] The first four rows of the original data frame appear below on the left.
Create the design matrix X used to fit the model on the previous page by filling in the table below. Put the variable name in the first row and fill the remaining 4 rows with the corresponding data. You may not need all columns. Use the top row to name your columns.
wage
edu
job.type
15
10
1
28
14
2
20
12
1
35
16
3
(e)
[6 Pts] Suppose we believe that the slope of the relationship between education level and wage is different for each of our 3 job types, e.g. perhaps white collar workers have salaries that are 2x their years of education, but blue collar workers only 1.5x. Create a design matrix below that will yield a model with different slopes and y-intercepts for each job type. Use the top row to name your columns. You may not need all columns.
Warning: This is a very challenging problem. Move on if you’re stuck.
Data 100 Midterm 2, Page 11 of 12 November 13, 2019
7 Logistic Regression
Suppose we want to build a classifier to predict whether a person survived the sinking of the Titanic. The first 5 rows of our dataset are given below.
(a) For a given classifier, suppose the first 10 predictions of our classifier and 10 true obser- vations are as follows:
i. [1 Pt] What is the accuracy of our classifier on these 10 predictions?
ii. [1 1/2 Pts] What is the precision on these 10 predictions? iii. [1 1/2 Pts] What is the recall on these 10 predictions?
(b) [4 1/2 Pts] In general (not just for the Titanic model), if we increase the threshold for a classification model, what of the following can happen to our precision, recall, and accuracy? We have not included the option ”X can stay the same”, because this is trivially true (e.g. if we increase the threshold by some tiny number, it will have no effect).
Precision can increase. Precision can decrease. Recall can increase.
Recall can decrease.
Accuracy can increase. Accuracy can decrease.
prediction
1
1
1
1
1
0
1
1
1
1
true label
0
1
1
1
0
0
0
1
1
1
Data 100 Midterm 2, Page 12 of 12 November 13, 2019 For convenience, we repeat the figure from the previous page below.
(c) Suppose after training our model we get β⃗ = −1.2 −0.005 2.5T , where −1.2 is an intercept term, −0.005 is the parameter corresponding to passenger’s age, and 2.5 is the parameter corresponding to sex.
i. [3 Pts] Consider S ̄ıla ̄nah Iskandar Na ̄s ̄ıf Ab ̄ı Da ̄ghir Yazbak, a 20 year old female. What chance did she have to survive the sinking of the Titanic according to our model? Give your answer as a probability in terms of σ. If there is not enough information, write “not enough information”.
P(Y = 1|age = 20,female = 1) =
ii. [3 Pts] S ̄ıla ̄nah Iskandar Na ̄s ̄ıf Ab ̄ı Da ̄ghir Yazbak actually survived. What is the cross-entropy loss for our prediction in part i? If there is not enough information, write ”not enough information.”
cross entropy loss =
iii. [6 Pts] Let m be the odds of a given male passenger’s survival according to our model, i.e. if the passenger had an 80% chance of survival, m would be 4, since their odds of survival are 0.8/0.2 = 4. It turns out we can compute f, the odds of survival for a female of the same age, even if we don’t know the age of the two people. What is this relationship? Hint: How are the odds related to t = ⃗xT β⃗ for a given observation?
Warning: This is a very challenging problem. Move on if you’re stuck. f=