计算机代考程序代写 algorithm Hive ETX2250/ETF5922: Data Visualization and – cscodehelp代写
ETX2250/ETF5922: Data Visualization and
Lecturer:
Department of Econometrics and Business Statistics
Week 6
Prediction
Prediction arises in many business contexts.
There is some unknown variable that is the target of the prediction.
This is usually denoted and may be called the dependent variable, or response or target variable. There are some known variables that are used to make the prediction.
These are usually denoted and may be called the independent variables, or predictors or features.
2/58
x
y
Supervised Learning
For some observations data will be available for both AND .
We can use these observations to learn some rule that gives predictions of This prediction is denoted
This general setup is often called supervised learning.
as a function of .
3/58
xy
xy
)x(^f = ^y
Unsupervised Learning
There is only information available for .
We group together similar observations to create clusters, or learn relationships between the observations
In machine learning, we would call this training with unlabelled data.
As a general rule, supervised learning is more likely to have greater accuracy.
4/58
x
Summary : Supervised learning
Variable
Training
Evaluation
Predictor 1 (X1) Data available Data available Dependent Variable (Y) Data available Data NOT available
Predictor 2 (X2)
Data available
Data available
5/58
Example : Supervised learning
Variable
Old Customers
New Customer
Age (X1) Data available Data available Default (Y) Data available Data NOT available
Limit (X2)
Data available
Data available
6/58
Summary : Unsupervised learning
Variable
Training
Evaluation
Variable 1 (X1) Data available Data available Class Data NOT available Data NOT available
variable 2 (X2)
Data available
Data available
7/58
Example : Unsupervised learning
Variable
Observed penguin
New penguin
Bill length (X1) Data available Data available Species Data NOT available Data NOT available
Bill depth (X2)
Data available
Data available
8/58
Supervised learning
Regression
Sometimes is a numeric (metric) variable. For example Company pro t next month.
Amount spent by a customer.
Demand for a new product.
In this case we are doing regression.
This can be more general than the linear regression that you may be familiar with.
10/58
y
Classi cation
Sometimes is a categorical (nominal, non-metric) variable. For example Will a borrower default on a loan?
Can we detect which tax returns are fraudulent?
Can we predict which brand customers will choose?
In this case we are doing classi cation.
11/58
y
Classi cation example: Credit Data
12/58
Default or not?
13/58
Default or not?
14/58
Assessing Classi cation
Some math
Generally data and available for .
An algorithm is trained on this data. Some function of is derived where . How to decide if is a good classi er or bad classi er?
16/58
)x(^f = ^y ix
n , … ,3 ,2 ,1 = i ix
iy
^f
Misclassi cation
The misclassi cation error is given by
Here equals 1 of the statement in parentheses is true and 0 otherwise.
Large numbers imply a worse performance.
Since all points are used for training and evaluation this measures in-sample performance.
17/58
1=i n ) i ^y ≠ i y ( I ∑ 1
n
n ).(I
Training v Test
In practice we want predictions for values of that are not yet observed.
To arti cially create this scenario the data we have available can be split into two
Training sample used to determing
Test sample used to evaluate .
The values of the test sample will be treated as unknown during training.
18/58
y
^f ^f
y
Notation
is the set of indices for training data.
is the number of observations in training data.
is the set of indices for test data
is the number of observations in test data.
19/58
| 0N| 0N
| 1N| 1N
Example
Suppose there are ve observations,
Suppose observations 1,2 and 4 are used as training data. Suppose observations 3 and 5 are used as test data.
Then
And
Only the data in
and and
is used to determine
20/58
)5x ,5y( , … ,)2x ,2y( ,)1x ,1y(
^f
2 = |0N|
1N
3 = |1N|
}5 ,3{ = 0N }4 ,2 ,1{ = 1N
Training v Test
Training error rate
Test error rate
21/58
0N∈i |0N| )i^y≠iy(I∑ 1
1N∈i |1N| )i^y≠iy(I∑ 1
Over tting
Some methods perform very well (even perfectly) on training error rate. Usually these same methods will perform poorly on test error rate. This phenomenon is called over tting.
Generally achieving a low test error rate (also called out-of-sample or generalisation error) is more important.
22/58
A simple example
Consider a test set of a single observation
The classi er is trained using all data apart from This classi er is then used to predict the value of The choice of may seem arbitrary.
. .
.
23/58
jy j
}j{ = 0N
j
Extending the idea of test and training
The process can be repeated so that each observation is left out exactly once. Each time all remaining observations are used as the training set.
This process is called Leave-one-out cross validation(LOOCV)
24/58
k-fold CV
A faster alternative to LOOCV is k-fold cross validation
The data are randomly split into partitions.
Each observation appears in exactly one partition, i.e. the partitions are non-overlapping. Each partition is used as the test set exactly once.
25/58
k
Confusion matrix
Consider predicting one of two categories (default or no default)
We have the observed data (the truth) and our predicted outcome. The set of all possible outcomes is in the confusion matrix.
Default (actual) True Positive False Negative
The total error rate is the sum of misclassi cation over total number of classications
Actual vs Predicted
Default (predicted)
No default (predicted)
No default (actual)
False Positive
True Negative
26/58
evitageNeslaF+evitisoPeslaF+evitageNeurT+evitisoPeurT =etaRrorrE evitage Nesla F + evitso Pesla F
Sensitivity v Speci city
In a 2-class problem think of one class as the presence of a condition and the other class as the absence of a condition.
In an auditing example the condition can be that the person is guilty.
Sensitivity refers to the true positive rate (also called recall). The proportion of guilty classi ed as guilty.
Speci city refers to the true negative rate(also called selectivity. The proportion of innocent classi ed as innocent.
False positive rate
27/58
evitageNeurT+evitisoPeslaF =etaRevitisoPeslaF evitiso Pesla F
evitageNeurT+evitisoPeslaF =etaRevitageNeurT evitage Neur T
evitageNeslaF+evitisoPeurT =etaRevitisoPeurT evitiso Peur T
Probabilistic Classi cation
In many cases an algorithm will predict a single “best” class. Predict a customer will purchase Gucci.
In other instances an algorithm will provide probabilities.
The customer has a 40% chance of purchasing Gucci, a 35% of chance of purchasing Givenchy and a 25% chance of purchasing YSL.
28/58
Probabilistic Classi cation
A probabilistic prediction can be converted to a point prediction. Simply choose the class with the highest probability.
In the example on the previous slide the choice would be Gucci.
29/58
Two class case
In the two class case, choosing the class with highest probability is simple. Assign to a class if the probability is greater than 0.5
In some applications a different threshold may be used.
This is particularly the case if there are asymmetric costs involved with different types of misclassi cation.
30/58
An example
Suppose you work for the tax o ce.
You need to decide who should be audited and who should not be audited. When doing classi cation you can make two mistakes
Audit an innocent person
Fail to audit a guilty person Are these mistakes equally costly?
31/58
Tax example
Auditing an innocent person is costly since resources are used for no gain. Suppose it costs $100 to audit a person.
Failing to audit a guilty person is costly since there is a failure to recover tax revenue. Let $500 be recovered from the guilty.
In this example, it is more costly to fail to audit the guilty. However, misclassi cation rate treats both errors the same.
32/58
Sensitivity v Speci city
Consider that we audit when the probability of being guilty is greater than 50%. Changing this threshold can change the sensitivity and speci city.
Reducing the threshold to 0 means everyone is audited. The sensitivity will be perfect but speci city will be zero.
Raising the threshold to 1 means no one is audited. The speci city will be perfect but sensitivity will be zero.
33/58
Example
Person
Pred. Pr. Guilty
Truth
A 0.3 Not Guilty C 0.6 Guilty
B
0.4
Guilty
D
0.7
Guilty
34/58
Questions
For a threshold of 0.5
What is your prediction for each individual? What is the misclassi cation error?
What is the sensitivity?
What is the speci city?
What is the cost?
35/58
Answer
Person
Pred. Pr. Guilty
Prediction
Truth
A 0.3 Not Guilty Not Guilty C 0.6 Guilty Guilty
B
0.4
Not Guilty
Guilty
D
0.7
Guilty
Guilty
36/58
Answers
Misclassi cation error is 0.25. Sensitivity is 0.6667 Speci city is 1
Cost is $500
37/58
Your turn
How do the answers change when the threshold is 0.2? How do the answers change when the threshold is 0.65?
38/58
Answer (Threshold 0.2)
Person
Pred. Pr. Guilty
Prediction
Truth
A 0.3 Guilty Not Guilty C 0.6 Guilty Guilty
B
0.4
Guilty
Guilty
D
0.7
Guilty
Guilty
39/58
Answers
Misclassi cation error is 0.25. Sensitivity is 1
Speci city is 0
Cost is $100
40/58
Answer (Threshold 0.65)
Person
Pred. Pr. Guilty
Prediction
Truth
A 0.3 Not Guilty Not Guilty C 0.6 Not Guilty Guilty
B
0.4
Not Guilty
Guilty
D
0.7
Guilty
Guilty
41/58
Answers
Misclassi cation error is 0.5. Sensitivity is 0.333 Speci city is 1
Cost is $1000
42/58
Receiver operating characteristic curve
-Vary the threshold and and plot the TPR against the FPR
When AUC (area under curve) is close to 1, close to perfect! When AUC (area under curve) is .5, no separation between groups
43/58
Unsupervised learning
Cross validation
Split the data into two sub samples (randomly splitting) Run your unsupervised method on each seperately Compare the two cluster solutions for consistency
number of clusters cluster pro les
45/58
External validation
Examine the difference on variables not included in the cluster analysis but for which you’d expect theoretical and relevant reason to expect variation across the clusters
46/58
Additional issues with prediction
Predict v Explain
In this unit our emphasis will be on prediction. This is very different to explanation or causality.
Consider the example of predicting sales of a Toyota by looking at number of internet searches for “Toyota”.
If there is a large number of people searching for “Toyota” it is more likely for sales of Toyota in the following period to be higher.
48/58
Causality
This relationship is not easy to manipulate.
For instance, if Toyota instructs its employees to spend the afternoon searching the word “Toyota” on Google, sales will not go up.
In this case there is a common cause for browsing for cars and buying cars; namely the intent to buy cars. Unlike intent to buy a car, browsing behaviour is observable and can be used for prediction.
49/58
Causal language vs correlation language
Causal language
A causes B A explains B
Correlation language
A is associated with B
An increase in A is related to an increase in B
50/58
Ethics and analytics
Data driven decision making
Increasingly in society we see data used to drive decision making This is often to the strength and bene t of society
recent use of data driven decisions to govern decisions surrounding lockdowns and safety precautions during Covid-19
However, data driven decisions can also have negative consequences, unintended at initial outset Robodebt saga
We do not have time for a full exploration of data ethics, but we will touch on some ideas today
For a checklist of ethics in data and more resources on each particular topic see
https://deon.drivendata.org/
52/58
Dataset bias
The data used to train the algorithm is not representative of the people it will be used to make decisions for, or contains undesirable instances of societal bias.
Accordingly, the algorithm might learn false or undesirable relationships
If the test data is random sample of the dataset used for training, this bias may not be apparent until rollout
An example: A word network is trained using Google news archives. It learns relationships between words – like female and queen. As the dataset contains biases, the algorithm also learns gendered associations, like male and engineer.
https://arxiv.org/abs/1607.06520
53/58
Feedback loops
In deployed algorithms, the data is sometimes fed back in to continuously update predictions.
This can lead to feedback loops – where the decisions made from the algorithm impact the data, which then impacts the decisions made by the algorithm
If the data is initially biased, this can exaggerate the bias considerably.
An example: We use an algorithm to predict where crimes are likely to occur, and allocate police o cers accordingly. The areas predicted to have high crime record more crime because there are more police o cers. This is fed into the algorithm, which results in even higher predictions of crime in those areas, which leads to a higher police presence etc.
https://arxiv.org/abs/1706.09847
54/58
Proxy discrimination
It is against the law to discriminate against certain classes of people (e.g., gender, race/ethnicity, religion), and so these variables are not always used in predictive analytics
However, proxy variables which are strongly related to these protected classes can have the same effect.
An example: Amazon explores (but doesn’t use) an algorithm to assist with hiring decisions. As they are hiring in tech, there is already a substantial gender bias in the existing workforce. The algorithm rates candidates from all women universities lower than those from other universities.
https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-
tool-that-showed-bias-against-women-idUSKCN1MK08G
55/58
Fairness across groups
In this lecture we’ve discussed the error rate over the dataset (testing and/or training)
However, we might also look at the error rate for particular groups of individuals, particularly those in protected classes (gender, race/ethnicity, age etc)
The algorithm error should not be signi cantly higher for these classes when compared to the broader dataset.
An example: Recidivism (when criminals re-offend) prediction error is higher when the candidate is black when compared to white candidates.
https://www.liebertpub.com/doi/abs/10.1089/big.2016.0047?journalCode=big
56/58
Conclusion
For the remainder of the unit the focus is on different analytic techniques. The next three weeks will be unsupervised methods, followed by three weeks on supervised methods.
In a business (and any other setting) be aware that
Correlation does not imply causation
Prediction should be thought about probabilistically.
Cost should be taken into account when classi cation is used in decision making.
57/58
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Lecturer:
Department of Econometrics and Business Statistics
Week 6