程序代写代做代考 python decision tree PracticalQuiz2TobeCompleted
PracticalQuiz2TobeCompleted
COM6012 – 2018: Practical Quiz 2¶
In this exercise, we are interested in using regression trees to predict the quality of wine in the white wine dataset. The input variables are: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. The output variable is the quality of the wine, a score between 0 and 10 given by expert wine tasters.
You need to complete the code below.
Deadline¶
09:50 14 March, 2018
How and what to submit¶
Submit a .txt or .pdf file containing your code, with comments. Notice that you do not have to run your code on the HPC as a standalone program.
Upload your file to MOLE before the date and time specified above. Name your file as NAME_REGCOD.xxx, where NAME is your full name, and REGCOD is your registration code.
Assessment criteria¶
The marks that you will receive for different parts of the code are specified below. You are free to start the code from scratch. In that case, you will receive marks according to the similar steps that you need to complete the code below.
The assessment criteria include:
Being able to create and use dataframes with the Spark ML library.
Being able to perform cross-validation with Spark in terms of the different parameters of a machine learning model.
Format¶
You can re-use any of your previous code. For example, the answer to the quiz may have a similar form to a previous exercise in a lab session or an assignment.
On plagiarism see this link
Predicting wine quality using regression trees¶
Some pieces of the code below are missing. Please, complete them.
We start by loading the usal Apache Spark libraries and some other utilities that we may find useful
In [1]:
// The usual imports
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.regression.DecisionTreeRegressor
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.evaluation.RegressionEvaluator
File “
(/, The, usual, imports)
^
SyntaxError: invalid syntax
We now open an SparkSession
In [ ]:
// “Open the bridge”
val sparkSession = SparkSession.builder.
master(“local[1]”).
appName(“Decision Tree Regression”).
getOrCreate()
In this solution, we decide to load the data as an RDD and then get rid of the header
In [ ]:
// Import the data as text and remove the header
# val data_string = sparkSession.sparkContext.textFile(“files/winequality-white.csv”)
val data_string = sparkSession.sparkContext.textFile(“winequality-white.csv”)
val header = data_string.first
val only_rows = data_string.filter(line => line != header)
We now convert the string data to Double, bearing in mind that the fields are separated by “;” (3 marks)
In [ ]:
//***************
INSERT YOUR CODE HERE. It should not take more than one line.
//***************
Organise the data into a column of “features” and a column of “labels” (2 marks)
In [ ]:
//***************
INSERT YOUR CODE HERE. It should not take more than one line.
//***************
Convert the RDD into a DataFrame for handling it using Spark ML
In [ ]:
// Convert to a dataframe
val data = sparkSession.createDataFrame(data_rdd).toDF(“label”, “features”)
Split the data into training and test sets (30% held out for testing). Use your registration number as the seed
In [ ]:
// Split the data into training and test sets (30% held out for testing. Use your registration number as the seed
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = ????)
Create the decision tree for regression and specify which columns are features and which one labels (3 marks)
In [ ]:
//***************
INSERT YOUR CODE HERE
//***************
Create the pipeline. We want to be able to test according to the number of bins and levels of the tree (3 marks)
In [ ]:
//***************
INSERT YOUR CODE HERE
//***************
Create a grid to evaluate the CrossValidator() for a minimum of three bins and a minimum of three potential depths of the tree (3 marks)
In [ ]:
//***************
INSERT YOUR CODE HERE
//***************
Here we create an evaluator for the regression tree
In [ ]:
// Setup the evaluator
val metric = “rmse”
val evaluator = new RegressionEvaluator().
setLabelCol(“label”).
setPredictionCol(“prediction”).
setMetricName(metric)
Create a CrossValidator() that uses your pipeline, your grid, and your evaluator. Set the number of folds to five (4 marks)
In [ ]:
//***************
INSERT YOUR CODE HERE
//***************
Train your model using only the training data
In [ ]:
// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(trainingData)
Provide the RMSE on the test data. Notice that you do not need to insert any code. You will be given the marks if you display the correct RMSE value on the test data. Please, include the RMSE value on the test data in your submission (2 marks)
In [ ]:
// Compute the root mean square error
val pred = cvModel.transform(testData)
val rmse = evaluator.evaluate(pred)
println(“The RMSE on the test data is ” + rmse)