CS计算机代考程序代写 Hive algorithm deep learning Keras python Assignment 1: K-Nearest Neighbors Classification (15 marks)¶
Assignment 1: K-Nearest Neighbors Classification (15 marks)¶
Student Name:
Student ID:
General info¶
Due date: Friday, 19 March 2021 5pm
Submission method: Canvas submission
Submission materials: completed copy of this iPython notebook
Late submissions: -10% per day (both week and weekend days counted)
Marks: 15% of mark for class.
Materials: See Using Jupyter Notebook and Python page on Canvas (under Modules> Coding Resources) for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. In particular, if you are not using a lab computer which already has it installed, we recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. Deep learning libraries such as keras and pytorch are also allowed. You can also use any Python built-in packages, but do not use any other 3rd party packages (the packages listed above are all fine to use); if your iPython notebook doesn’t run on the marker’s machine, you will lose marks. You should use Python 3.
Evaluation: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don’t ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions.
You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to Python style requirements. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it. We reserve the right to deduct up to 2 marks for unreadable or exessively inefficient code.
Updates: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on the discussion board (Piazza -> Assignment_1); we recommend you check it regularly.
Academic misconduct: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. Please check the CIS Academic Honesty training for more information. We will be checking submissions for originality and will invoke the University’s Academic Misconduct policy where inappropriate levels of collusion or plagiarism are deemed to have taken place.
Overview¶
In this assignment, you will implement the K-nearest neighbor (KNN) classification algorithm and apply it to a real-world machine learning data set. In particular we will classify zoo animals into seven animal categories.
Firstly, you will read in the dataset into a train and a test set (Q1). Secondly, you will implement different distance functions (Q2). Thirdly, you will implement a KNN classifier (Q3). You will apply and evaluate your classifier on the data set exploring different parameters (Q4, Q5). Finally, you will critically discuss your results (Q6).
Question 1: Loading the data (2.0 marks)¶
Instructions: For this assignment we will develop a K-Nearest Neighbors (KNN) classifier to classify animals in the zoo into pre-defined categories of animals, namely
1: mammal
2: bird
3: reptile
4: fish
5: amphibian
6: insect
7: invertebrate
We use a data set of zoo animals from the UCI Machine learning repository.
The original data can be found here: https://archive.ics.uci.edu/ml/datasets/Zoo.
The dataset consists of 101 instances. Each instance corresponds to an animal which has a unique identifier (the name of the animal; first field) and is characterized with 16 features:
1. hair Boolean
2. feathers Boolean
3. eggs Boolean
4. milk Boolean
5. airborne Boolean
6. aquatic Boolean
7. predator Boolean
8. toothed Boolean
9. backbone Boolean
10. breathes Boolean
11. venomous Boolean
12. fins Boolean
13. legs Numeric (set of values: {0,2,4,5,6,8})
14. tail Boolean
15. domestic Boolean
16. catsize Boolean
You need to first obtain this dataset, which is on Canvas (assignment 1). The files we will be using are called zoo.data and zoo.labels. Make sure the files are saved in the same folder as this notebook.
Both files are in comma-separated value format.
zoo.features contains 101 instances, one line per instance. The first field is the unique instance identifier (name of animals). The following fields contain the 16 features, as described above.
zoo.labels contains the gold labels (i.e., one of the seven classes above), for one instance per line. Again, the first field is the instance identifier, and the second field the instance label.
Task: Read the two files
1. create a training_feature set (list of features for the first 90 instances in the zoo.* files) and a training_label set (list of labels for the corresponding).
2. create a test_feature set (list of features of the remaining instances in the zoo.* files) and a test_label set (list of labels for the corresponding).
Check: Use the assertion statements in “For your testing” to validate your code.
In [1]:
data = open(“zoo.features”, ‘r’).readlines()
gold = open(“zoo.labels”, ‘r’).readlines()
train_features = []
train_labels = []
test_features = []
test_labels = []
###########################
## YOUR CODE BEGINS HERE
###########################
###########################
## YOUR CODE BEGINS HERE
###########################
For your testing
In [ ]:
assert len(train_features) == len(train_labels)
assert len(train_features[0])==len(train_features[-1])
assert train_features[2][12]==0 and train_labels[2]==4
Question 2: Distance Functions (2.0 marks)¶
Instructions: Implement the four distance functions specified below. Use only the library imported below, i.e., do not use implementations from any other Python library.
1. Eucledian distance
2. Cosine distance
3. Hamming distance
4. Jaccard distance
Each distance function takes as input
• Two feature vectors
and returns as output
• The distance between the two feature vectors (float)
Note for the purpose of this assignment we consider the numeric feature (legs) to be discretized already with each individual value belonging constituting a separate class.
In [ ]:
import math
#########################
# Your answer BEGINS HERE
#########################
def eucledian_distance(fv1, fv2):
# insert code here
def cosine_distance(fv1, fv2):
# insert code here
def hamming_distance(fv1, fv2):
# insert code here
def jaccard_distance(fv1, fv2):
# insert code here
#########################
# Your answer ENDS HERE
#########################
For your testing:
In [ ]:
assert round(eucledian_distance([1,0],[0.5,1]),2)==1.12
assert jaccard_distance([1,1,1,1], [0,1,0,0])==0.75
Question 3: KNN Classifier (4.0 marks)¶
Instructions: Here, you implement your KNN classifier. It takes as input
• training data features
• training data labels
• test data features
• parameter K
• distance function(s) based on which nearest neighbors will be identified
It returns as output
• the predicted labels for the test data
Voting stragegy Your classifier will use majority voting (i.e., no distance weighting)
You should implement the classifier from scratch yourself, i.e., you must not use an existing implementation in any Python library.
Ties. Ties may occur when computing the K nearest neighbors, or when predicting the class based on the neighborhood. You may deal with ties whichever way you want (as long as you still use the requested distance functions).
In [ ]:
def KNN(train_features, train_labels, test_features, k, dist_fun):
predictions = []
###########################
## YOUR CODE BEGINS HERE
###########################
###########################
## YOUR CODE BEGINS HERE
###########################
return predictions
For your testing:
In [ ]:
assert KNN([[1,1],[5,5],[1,2]], [1,0,1], [[1,1]], 1, eucledian_distance) == [1]
Question 4: Evaluation (1.0 marks)¶
Instructions: Write a function that computes the “accuracy” of your classifier. Given a set of predicted lables and a set of gold labels, it returns the fraction of correct labels over all predicted labels.
Example: The gold truth labels for four test instances are [1, 1, 1, 1]. A system predicted the labels [0, 1, 0, 0] for the same 4 instances. The accuracy of the system is 1/4 = 0.25
Your function will take as input
• gold labels
• predicted labels
It returns as output
• the accuracy value (float).
In [ ]:
def accuracy(predict, gold):
###########################
## YOUR CODE BEGINS HERE
###########################
###########################
## YOUR CODE BEGINS HERE
###########################
return accuracy
For your testing:
In [ ]:
assert accuracy([1, 1, 1, 1], [0, 1, 0, 1])==0.5
Question 5: Applying your KNN classifiers to the Zoo Dataset (3.0 marks)¶
Using the functions you have implemented above, please
1. For each of the distance functions you implemented in Question 2, construct four KNN classifiers with K=1, K=5, K=25, K=55. You will create a total of 16 (4 distance functions x 4 K values) classifiers.
2. Compute the test accuracy for each model
In [ ]:
########################
# Your code STARTS HERE
########################
# 1. Predict test labels with each KNN classifier
# 2. Compute the accuracy scores
########################
# Your code ENDS HERE
########################
print(“Euclidean”)
print(accuracy_knn_euc_1)
print(accuracy_knn_euc_5)
print(accuracy_knn_euc_25)
print(accuracy_knn_euc_55)
print(“Cosine”)
print(accuracy_knn_cos_1)
print(accuracy_knn_cos_5)
print(accuracy_knn_cos_25)
print(accuracy_knn_cos_55)
print(“Jaccard”)
print(accuracy_knn_jac_1)
print(accuracy_knn_jac_5)
print(accuracy_knn_jac_25)
print(accuracy_knn_jac_55)
print(“Hamming”)
print(accuracy_knn_ham_1)
print(accuracy_knn_ham_5)
print(accuracy_knn_ham_25)
print(accuracy_knn_ham_55)
Question 6: Discussion (3.0 marks)¶
1. (a) Which parameter K resulted in the best performance? (b) Why? (c) What could be done to improve those classifiers that are currently performing poorly?
2. The results of KNN with Euclidean distance and KNN with Cosine distance are remarkably similar. Why is that so? Referring to the definitions of the distance functions.
Each question should be answered in no more than 2-3 sentences.
1(a).
1(b).
1(c).
2.
In [ ]: