计算机代考程序代写 data science decision tree algorithm Where are we now? – cscodehelp代写 – INTERVIEW&CODEHELP

Where are we now?
XML Template (2011) [10.8.2011–6:17pm] [1–18] K:/IVI/IVI 415994.3d (IVI) [PREPRINTER stage]
Kandel et al. 3
Figure 1. The iterative process of wrangling and analysis. One or more initial data sets may be used and new versions may come later. The wrangling and analysis phases overlap. While wrangling tools tend to be separated from the visual analysis tools, the ideal system would provide integrated tools (light yellow). The purple line illustrates a typical iterative process with multiple back and forth steps. Much wrangling may need to take place before the data can be loaded within visualization and analysis tools, which typically immediately reveals new problems with the data. Wrangling might take
2
place at all the stages of analysis as users sort out interesting insights from dirty data, or new data become available or

Classification
• Introduction to classification
• Comparing Classification and Regression
• Decision tree classification
• Will make connections to entropy
• k nearest neighbour classification
3

Classification vs Regression
4

Classification
• Classification
• Making predictions about the future, based on historical data
• Predictive modelling/classification
• The sexy part of data science ?!
• A foundation for machine intelligence, AI, machines replacing humans and taking our jobs ….
5

Classification – Example 1
Predicting disease from microarray data Training
Gene 1
Gene 2
Gene 3
…
Gene n
Develop cancer <1 year Person 1 2.3 1.1 0.3 ... 2.1 1 Person 2 3.2 0.2 1.2 ... 1.1 1 Person 3 1.9 3.8 2.7 ... 0.2 1 ... ... ... ... ... ... .. Person m 2.8 3.1 2.5 ... 3.4 0 Testing Gene 1 Gene 2 Gene 3 ... Gene n Develop cancer <1 year Person X 2.1 0.9 0.6 ... 1.9 ? 6 Classification – Example 2 Animal classification Test data https://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf 7 Classification- Example 3 Banking: classifying borrowers Test data Tid Home Owner Marital status Annual Income Defaulted Borrower 11 No Single 55K ? 8 Classification – Example 4 Detecting tax cheats Tid 1 2 3 4 5 6 7 8 9 10 10 Refund Marital Taxable Cheat No No No No Yes No No Yes No Yes Status Income Yes Single 125K No Married 100K No Single 70K Yes Married 120K No Divorced 95K No Married 60K Yes Divorced 220K No Single 85K No Married 75K No Single 90K Test data Tid Refund Marital Status Taxable Income Tax Cheat 11 Yes Married 125K ? 9 categorical categorical continuous class Classification: Definition • Given a collection of records (training set ) • Eachrecordcontainsasetofattributes,oneclasslabel. • Find a predictive model for each class label as a function of the values of all attributes, i.e. y = f(x1, x2, ..., xn) • y: discrete value, target variable • x1,... xn: attributes, predictors • f: is the predictive model (a tree, a rule, a mathematical formula) • Goal: previously unseen records should be assigned a class as accurately as possible • Atestsetisusedtodeterminetheaccuracyofthemodel,i.e.the full data set is divided into training and test sets, with the training set used to build the model and the test set used to validate it 10 Classification framework 11 Regression: Recall Definition • Givenacollectionofrecords(trainingset) • Eachrecordcontainsasetofattributesandonetargetvariable • Learn predictive model from data, i.e. y = f(x1, x2, ..., xn) • y: target variable is continuous (key difference to classification) • x1,... xn: attributes, predictors 12 Regression example 1 Predicting ice-creams consumption from temperature: y = f(x) ? 13 Regression example 2 Predicting the activity level of a target gene Gene 1 Gene 2 Gene 3 ... Gene n Person 1 2.3 1.1 0.3 ... 2.1 Person 2 3.2 0.2 1.2 ... 1.1 Person 3 1.9 3.8 2.7 ... 0.2 ... ... ... ... ... ... Person m 2.8 3.1 2.5 ... 3.4 Gene n+1 3.2 1.1 0.2 ... 0.9 Gene n+1 ? Gene 1 Gene 2 Gene 3 ... Gene n Person m+1 2.1 0.9 0.6 ... 1.9 14 Quiz • Is it classification or regression? • Prediction of rain or no rain tomorrow based on today’s rain data • Prediction of the amount of rain tomorrow based on today’s rain data • Prediction of exam mark based on study hours • Prediction of exam pass or fail based on study hours • Prediction of salary based on degree results • Prediction of complications of surgery based on pre-op medical data 15 Decision Trees 16 Decision Tree example: Prediction of tax cheats Splitting Attributes Tid 1 2 3 4 5 6 7 8 9 10 10 Refund Marital Taxable Cheat No No No No Yes No No Yes No Yes Status Income Yes Single 125K No Married 100K No Single 70K Yes Married 120K No Divorced 95K No Married 60K Yes Divorced 220K No Single 85K No Married 75K No Single 90K Refund Yes No NO MarSt Single, Divorced Married NO TaxInc < 80K NO > 80K
YES
Training Data
Model: Decision Tree
17
categorical categorical
continuous class

Decision Tree example: Prediction of tax cheats 2
MarSt
NO
NO
Single, Divorced
Refund
TaxInc
< 80K NO Married Tid 1 2 3 4 5 6 7 8 9 10 10 Refund Marital Taxable Cheat No No No No Yes No No Yes No Yes Status Income Yes Single 125K No Married 100K No Single 70K Yes Married 120K No Divorced 95K No Married 60K Yes Divorced 220K No Single 85K No Married 75K No Single 90K Yes No > 80K
YES
Training Data
There can be more than one tree that fits the same data!
Model: Decision Tree
18
categorical categorical
continuous class

Decision Tree Classification Task
Attrib1
Yes No No Yes No No Yes No No No
Attrib2
Attrib3
125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Class
No No No No Yes No No Yes No Yes
Large
Medium
Small
Medium
Large
Medium
Large
Small
Medium
Small
Tid
1 2 3 4 5 6 7 8 9 10
Tree Induction algorithm
Induction
Learn Model
10
Model
Decision Tree
Training Set
Apply Model
Attrib2
Small
Medium
Large
Tid Attrib1 Attrib3
11 No 55K ?
12 Yes 80K ?
13 Yes 110K ?
14 No Small 95K ?
15 No Large 67K ?
Test Set
Class
Deduction
10
19

Apply DT Model to Test Data (1)
Start from the root of the tree
Refund
Yes No
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K NO > 80K
YES
20

Apply DT Model to Test Data (2)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K NO > 80K
YES
21

Apply DT Model to Test Data (3)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K NO > 80K
YES
22

Apply DT Model to Test Data (4)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K NO > 80K
YES
23

Apply DT Model to Test Data (5)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K NO > 80K
YES
24

Apply DT Model to Test Data (6)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
Married
NO
Assign Cheat to “No”
TaxInc
< 80K NO > 80K
YES
25

Apply DT Model to Test Data (7)
Start from the root of tree.
Test Data
Refund
Marital
Taxable
Cheat
Status
Income
10
No
Single
100K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K NO > 80K
YES
26

Decision Trees
Decision tree
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute • Branch represents an outcome of the test
• Leaf nodes represent class labels
27

Decision Tree Algorithm
Tid
1 2 3 4 5 6 7 8 9 10
Tree Induction algorithm
Induction
Attrib1
Yes No No Yes No No Yes No No No
Attrib2
Attrib3
125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Class
No No No No Yes No No Yes No Yes
Large
Medium
Small
Medium
Large
Medium
Large
Small
Medium
Small
Learn Model
10
Model
Decision Tree
Training Set
Apply Model
Attrib2
Small
Medium
Large
Tid Attrib1 Attrib3
11 No 55K ?
12 Yes 80K ?
13 Yes 110K ?
14 No Small 95K ?
15 No Large 67K ?
Test Set
Class
Deduction
10
28

Decision Tree Algorithm (2)
• Many Algorithms:
• We will look at a representative one
(Hunt’s algorithm)
• How would you code an algorithm that automatically constructs a decision tree for the tax cheat data?
Tid
1 2 3 4 5 6 7 8 9 10
Refund
Yes No No Yes No No Yes No No No
Marital Status
Taxable Income
Cheat
No No No No Yes No No Yes No Yes
Single
125K
Married
100K
Single
70K
Married
120K
Divorced
95K
Married
60K
Divorced
220K
Single
85K
Married
75K
Single
90K
10
29

General Structure of Hunt’s Algorithm
• Let Dt be the set of training records that reach node t • General Procedure:
• If Dt contains only records that belong to the same class yt, then t is a leaf node labeled as yt
• If Dt is an empty set, then t is a leaf node labeled by the default class yd
• If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets
• Recursively apply the procedure to each subset
Tid
1 2 3 4 5 6 7 8 9 10
Refund
Yes No No Yes No No Yes No No No
Marital Status
Taxable Income
Cheat
No No No No Yes No No Yes No Yes
Single
125K
Married
100K
Single
70K
Married
120K
Divorced
95K
Married
60K
Divorced
220K
Single
85K
Married
75K
Single
90K
10
Dt
Refund Yes
No
30

Refund
Yes No No Yes No No Yes No No No
Marital Status
Taxable Income
Cheat
No No No No Yes No No Yes No Yes
Single
125K
Married
100K
Single
70K
Married
120K
Divorced
95K
Married
60K
Divorced
220K
Single
85K
Married
75K
Single
90K
Tid
Hunt’s Algorithm Example 1 2
If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
Refund Yes
3
4
5
6
7
8
9
10
10 D
t
No
31

Stopping condition: leaf node
• If Dt contains records that all belong to the same class yt, then t is a leaf node labeled as yt • If Dt is an empty set, then t is a leaf node labeled as the default class yd
Refund Yes
No
32

Decision Tree splitting example
Splitting Attributes
Refund
Yes No
NO MarSt Single, Divorced
Married
NO
TaxInc
< 80K NO > 80K
YES
33

Tree induction (creation)
34

Tree Induction
• Determine how to split the records
• How to specify the attribute test condition? • How to determine the best split?
• Determine when to stop splitting
• When a node only has instances of a single class
35

How to Specify Test Conditions?
• Depends on attribute types • Nominal
• Ordinal
• Continuous
• Depends on number of ways to split • 2-way split
• Multi-way split
36

Splitting Nominal Attributes
• Multi-waysplit:Useasmanypartitionsasdistinctvalues
CarType
Sports
Family
Luxury
• Binary split: Divide values into two subsets, find an optimal partitioning
{Sports, CarType Luxury}
{Family}
OR {Family, CarType Luxury}
{Sports}
37

Splitting Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values
Size
Medium
{Small, Medium}
Size
{Medium, Size OR Large}
Small
Large
• Binary split: Divides values into two subsets, find optimal partitioning
{Large}
{Small}
• What about this split?
{Small, Large}
Size
{Medium}
38

Splitting Continuous Attributes
• Different ways of handling
• Discretisation to form an ordinal categorical attribute
• Static – discretise once at the beginning
• Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles) or clustering
• Binary Decision: (A < v) or (A 3 v) • consider all possible splits and find the best cut • can be more computing intensive 39 Splitting Continuous Attributes T axable Income > 80K?
Yes No
(i) Binary split
T axable Income?
Tid Refund
1 Yes
2 No
3 No
4 Yes
5 No
6 No
7 Yes
8 No
9 No
10 No
Marital Taxable
< 10K > 80K
Single 125K Married 100K Single 70K Married 120K Divorced 95K Married 60K Divorced 220K Single 85K Married 75K Single 90K
No No No No Yes No No Yes No Yes
[10K,25K) [25K,50K) [50K,80K) (ii) Multi-way split
10
Status
Income Cheat
40

How to determine the
Before Splitting: 10 records of class 0, 10 records of class 1
Own Car?
Car Type?
Family Sports
Student ID?
c1 c
C0:1 … C0:1 C0:0 … C0:0
Yes
No
Luxury
c c20 11
Which test condition is better?
10
C1: 0 C1: 0 C1: 1
C1: 1
C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
C0: 8 C1: 0
C0: 1 C1: 7
42

How to determine the best split
• Greedy approach: Nodes with homogeneous class distribution are better • Need a measure of node impurity:
C0: 5 C1: 5
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
C0: 9 C1: 1
43

Measures of Node Impurity
• Entropy
• We have seen entropy in the feature correlation section, where it was used to
measure the amount of uncertainty in an outcome
• Entropy can also be viewed as an impurity measure
• The set {A,B,C,A,A,A,A,A} has low entropy: low uncertainty and low impurity • The set {A,B,C,D,B,E,A,F} has high entropy: high uncertainty and high impurity
44

Node Impurity Criteria based on entropy
Entropy (H) at a given node t:
H(t)=−∑p(j|t)logp(j|t) j
(NOTE: p( j | t) is the relative frequency of class j at node t) Measures homogeneity of a node:
• Maximum (log nc) when records are equally distributed amongst all classes (nc is number of classes)
• Minimum (0.0) when all records belong to one class
45

Examples of computing Entropy
P(C1)=0/6=0 P(C2)=6/6=1 Entropy=–0log2 0–1log2 1=–0–0=0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
H(t)=−∑p(j|t)log2 p(j|t) j
C1
0
C2
6
C1
1
C2
5
C1
2
C2
4
46

Question: What is the entropy?
P(C1) = P(C2) = Entropy =
H(t)=−∑p(j|t)log2 p(j|t) j
C1
13
C2
20
47

Question: What is the entropy?
P(C1) = 13 P(C2) = 20
H(t)=−∑p(j|t)log2 p(j|t) j
C1
13
C2
20
33
33
Entropy = – ((13/33)log(13/33) + (20/33)log(20/33)) = 0.97
48

How good is a Split?
Compare the impurity (entropy) of parent node (before splitting) with the impurity (entropy) of the children nodes (after splitting)
Gain = H(Parent) HX(Parent|Child) k N(vj)
= H(Parent)j=1 N H(vj)
H(vj): impurity measure of node vj
j: children node index
N(vj): number of data points in child node vj N: number of data points in parent node The larger the gain, the better
49

How good is a Split?
Gain = H(Parent) HX(Parent|Child) k N(vj)
= H(Parent)j=1 N H(vj)
Note: the information gain is equivalent to the mutual information
between the class feature and the feature being split on
Thus splitting using the information gain is to choose the feature with highest information shared with the class variable
50

How good is a Split?
Own Car?
Car Type?
Student ID?
c1 c
C0:1 … C0:1 C0:0 … C0:0
Yes
No Family Sports
Luxury
c c20 11
10
C1: 0 C1: 0 C1: 1
C1: 1
C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
C0: 8 C1: 0
C0: 1 C1: 7
51

How to determine the ?
Before Splitting: 10 records of class 0, 10 records of class 1
C0:10 C1:10
C0:10 C1:10
Own Car?
Car Type?
Family Sports
Student ID?
Yes
No
c c20 11
Luxury
c1 c
C0:1 … C0:1 C0:0 … C0:0
C1: 0 C1: 0 C1: 1 C1: 1
C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
C0: 8 C1: 0
Which test condition is better?
– Compute the gain of both splits
– Choose the one with the larger gain
52
C0: 1 C1: 7
10

How to determine the ?
Before Splitting: 10 records of class 0, 10 records of class 1
Own Car?
Car Type?
No Family Sports
Student ID?
Yes
c c20 11
Luxury
c1 c
C0:1 … C0:1 C0:0 … C0:0
C1: 0 C1: 0 C1: 1 C1: 1
10
C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
Own Car: Information gain=0.029
Car type: Information gain=0.62
We should choose Car type as the better split
53
C0: 8 C1: 0
C0: 1 C1: 7

Creating a decision tree
• Calculate information gain [Left Child],[Right Child] for all:
• Refund [Yes], [No]
• Marital status [Single],[Married],[Divorced]
• Taxable income
• [60,60], (60,220]
• [60,70], (70,220]
• [60,75],(75,220]
• [60,85],(85,220]
• [60,90],(90,220]
• [60,95],(95,220]
• [60,100],(100,220]
• [60,120],(120,220]
• [60,125],(125,220]
• Choose feature + split with the highest information gain and use this as the root node and its split
• Do recursively, terminating when a node consists of only Cheat=No or Cheat=Yes.
Tid
1 2 3 4 5 6 7 8 9 10
Refund
Yes No No Yes No No Yes No No No
Marital Status
Single Married Single Married Divorced Married Divorced Single Married Single
Taxable Income
125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Cheat
No No No No Yes No No Yes No Yes
10
54

Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
A:50 B:150
A:25 B:25
A:25 B:125
57

MI trees: advantages and disadvantages
• Advantages
• Easy to interpret
• Relatively efficient to construct
• Fast for deciding about a test instance
• Disadvantages
• A simple greedy construction strategy, producing a set of If ..then rules.
Sometimes this is too simple for data with complex structure: • Everything should be as simple as possible, but not simpler
• For complex datasets, the tree might grow very big and difficult to understand 62

K-nearest neighbour classifier
63

Nearest Neighbour Classifier
Basic idea: “If it walks like a duck and quacks like a duck, then it’s probably a duck”
Training Records
Choose k of the “nearest” records
Compute Distance
Test Record
64

Nearest-Neighbour Classifier
Unknown record
! Requires three things
– The set of stored records
– Distance Metric to compute distance between records
– The value of k, the number of nearest neighbours to retrieve
! To classify an unknown record:
1. Compute distance to other
training records
2. Identify k nearest neighbors
3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking a majority vote)
65

Definition of Nearest Neighbour
X
X
X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbours of a record x are data points that have the k smallest distance to x
66

Distance measure
• Compute distance between two points p=(p1,p2,…), • Euclidean distance
d(p,q)= å(p-q)2
q=(q1,q2,…)
i
ii
• Can also use Pearson coefficient (similarity measure) • Determine the class from nearest neighbour list
• take the majority vote of class labels among the k-nearest neighbours
• weigh the vote according to distance, e.g. w = % &!
67

1 nearest-neighbour
Voronoi Diagram defines the classification boundary
The green area takes the class of the green point
68

K-Nearest Neighbour classifier
Choosing a value for k:
• If k is too small, its sensitive to noise points
• If k is too large, the neighbourhood may include points from other classes
X
http://2.bp.blogspot.com/-EK4tA2525EM/U-c6Q4jJuwI/AAAAAAAADjw/hdqRXuunpnQ/s1600/knn.png
69

Points to remember
• Understand what is the difference between classification and regression and why it is useful to build models for these tasks
• Understand how a decision tree may be used to make predictions about the class of a test instance
• Understand the key steps in building a decision tree: • how to split the instances
• how to specify the attribute test condition • how to determine the best split
• how to decide when to stop splitting
70

Points to remember
• Understand the operation and rationale of the k nearest neighbour algorithm for classification
• Understand the advantages and disadvantages of using the k nearest neighbour or decision tree for classification
71

计算机代考程序代写 data science decision tree algorithm Where are we now? – cscodehelp代写

Published by admin

Leave a Reply Cancel reply