程序代写代做代考 python data science algorithm decision tree CMP3036M Data Science, page 1 of 4

CMP3036M Data Science, page 1 of 4

University of Lincoln
School of Computer Science

2016 – 2017

Assessment Item 2 of 2 Briefing Document

Title: CMP3036M Data Science Indicative Weighting: 50%

Learning Outcomes

On successful completion of this component a student will have demonstrated competence in the following
areas:

 LO1 Critically apply fundamental concepts and techniques in data science
 LO2 Utilise state-of-the-art tools to design data science applications for various types of data
 LO3 Analyze and interpret large datasets and deliver appropriate reports on them

Task Overview: Bike Rental Demand Prediction

The objective of this assignment is to analyse a dataset
concerning bike rentals. The dataset can be downloaded from
Blackboard. It is based on the real data from Capital
Bikeshare company that maintains a bike rental network in
Washington DC. The dataset has one row for each hour of
each day in 2011 and 2012, for a total of 17,379 rows. It
contains features of the day (workday, holiday) as well as
weather parameters such as temperature and humidity. The
range of hourly bike rentals is from 1 to 977. The bike usage
is stored in the field ‘cnt’. Our task is to develop a prediction
model for the number of bike rentals such that Capital
Bikeshare can predict the bike usage in advance.

You need to write a report that discusses how you complete the task, and go into sufficient depth to
demonstrate knowledge and critical understanding of the relevant processes involved. 100% of available marks
are through the completion of the written report, with clear and separate marking criteria for each
required report section. Notably, a distinct and significant report section on discussing and critiquing the
analysis and implementation processes you carried out for your data solution is required.

Attribute/feature information in the dataset:
 instant: record index
 dteday : date
 season : season (1:springer, 2:summer, 3:fall, 4:winter)
 yr : year (0: 2011, 1:2012)
 mnth : month ( 1 to 12)
 hr: hour (0 to 23)
 holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedules)
 weekday : day of the week
 workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
 weathersit :

– 1: Clear, Few clouds, Partly cloudy, Partly cloudy
– 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
– 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
– 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

 temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8,
t_max=+39 (only in hourly scale)

http://dchr.dc.gov/page/holiday-schedules

CMP3036M Data Science, page 2 of 4

 atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min),
t_min=-16, t_max=+50 (only in hourly scale)

 hum: Normalized humidity. The values are divided to 100 (max)
 windspeed: Normalized wind speed. The values are divided to 67 (max)
 casual: count of casual users
 registered: count of registered users
 cnt: count of total rental bikes including both casual and registered

Report Guidance

Your report must conform to the below structure and include the required content as described, information on
specific marking criteria for each section is available in the accompanying CRG document. You must supply a
written report containing five distinct sections that provide a full and reflective account of the processes
undertaken.

Section 1: Data Summary, Preprocessing and Visualisation (5%)

As a first step, you need to load the data set from the .csv file into Microsoft Azure Machine Learning Studio.
You then provide a summary of the dataset and proceed data preprocessing. For example, what is the size of
data? How many features are there? Which data entries are redundant and can be skipped? Is there any NAs?
Which data entries are categorical and may be marked as numeric? Are there any features need to normalized
(where appropriate)?

For data visualisation, you need to generate several plots using Python or R. For example, generate trellies plots.
As categorical features of the plot, use the ‘season’ and ‘weathersit’ feature, which categorize the season of the
year and the current weather situation (sun, rain, etc…). Always use the target values for the y-axis and for the x-
axis test the fields ‘temp’ (temperture), ‘atemp’ (feeling temperature), ‘hum’ (humidity) and ‘windspeed’. What
are your findings? What relationships can you see? You need to report the most interesting plots and interpret
your results! Note, the information you report here should be useful for your model development!

Section 2: Comparison of Algorithms (7.5%)

You need to test different algorithms on this data. Split the dataset into a 75% training set and a 25% test set (i.e.,
the test set method). Train a linear regression model and evaluate the performance on the test set. Please use the
mean absolute error (MAE) when reporting the performance of the algorithm. Report your Azure graph
(i.e., the plots generated within Azure ML Studio through its built-in functions or your own R/Python scripts) and
your performance. Using the data visualisation, can you find some polynomial feature expansion that improves
the performance? Report your steps and your results. If your results do not improve, explain why. Train a boosted
decision tree regression model using the same data split. Use the default parameters. Do the prediction
performance improve? Repeat the same step with the decision forest algorithm, again with default parameters.
Which algorithm performs best?

Section 3: Model Selection (15%)

Regardless of the result of the previous section, you will now use the boosted trees (for computation time
reasons). You want to understand its parameters a bit better. For doing so, you will use the parameter range
option of the tree module and start with the ‘Minimum number of samples per leaf node’ per parameter, where
you will use the following values: [1,2,3,4,6,8,10,15,20,40,80]. The other parameters will be set 32 (max number
of leaves), 0.4 (learning rate) and 100 (num trees). Using the tune hyper-parameters module, show with a plot
how the performance depends on the ‘min number of samples’ parameter. Interpret your results, what is the best
parameter value? For which range can you see overfitting for which underfitting? Exemplify your conclusion
referring to the lecture material.

So far, you have done the model selection with the test set method. However, as a good data scientist you know
this is not always a safe option. To be sure, you resort to 10-fold cross validation using the ‘partition and sample’
module. Redo your evaluation. Report your plots. Do you come to a different conclusion? Explain your results
also by discussing the qualitative differences between cross validation and the test set method.

Repeat the process for the other parameters of the algorithm. As validation method, use the method of your

CMP3036M Data Science, page 3 of 4

choice but justify your choice. Always leave the other parameters fixed and use parameter values of your choice
for the parameter you vary to generate these results. Report your most interesting findings and explain them
referring to the material you have learned in the lecture.

Section 4: Time Series Modelling (15%)

You happily present your great results to the CEO of Bikeshare and he wants to immediately test it on the data of
the new year 2013. Surprisingly, your algorithm works significantly worse on this data. The CEO is not amused
and asks you for the reasons. What do you answer?

In order to test the scenario with the new year data you will from now on only train on data from the year 2011
(yr = 0) and test on data from the year 2012 (yr=1). Use the relative expression from the ‘split data’ module for
doing so. Repeat the training of the linear model and the regression forests (using the best found parameters). Can
you confirm the findings of the CEO? What is your performance?

After 3 days of sleeping badly, you have a brilliant idea of how to fix this problem. As you have experienced, you
cannot directly predict the new year’s values. However, if you know the number of rented bikes in the last 12
hours, you might be able to predict the bike usage for the next hour. You want to test this hypothesis. For doing
so, add 12 new features to the data set using Python or R code. The features should indicate the bike usage 1 to 12
hours before the actual entry. Remove the data of the first 12 hours as they do not have a history that is long
enough. Report your code snippets. Retrain the regression forest. What performance do you obtain?

In the shower, you have another brilliant idea. Maybe it also helps to add the progress of bike rentals for the last
12 days (using the same time of the current entry). Again use Python or R code to implement this 12 additional
features. Remove the first 12 * 24 rows as the history of these entries is again not long enough. Compare the
performance of the original approach, using the 12 hours before as additional features, using the 12 days before as
features and using the 12 hours and the 12 days. Which results will you report back to the CEO? Also compare
the decision forest algorithm to the tree boosting algorithm. Make sure that the comparison is fair (i.e., they use
close to optimal parameter settings). Again report your steps and your findings.

Section 5: Time Series Prediction (7.5%)

The CEO of Bikeshare is happy now with your best results that you have. However, using the last 12 hours as
features only allows for prediction of the next hour. This is too short notice for Bikeshare to make use of the
prediction. The CEO is asking you how much the performance would decrease if you would predict for 2, 3, 4
and 5 hours ahead. Can you provide him with these results? Create a plot with the prediction horizon on the x-
axis and the performance on the y-axis. Again report your code-snippets, diagrams or plots.

Important Information

The report should be a maximum of 4500 words. A presentation penalty of 5% will be strictly applied if you
exceed the 4500 maximum word limit (10% leeway applies). Keep in mind that:
 The report must contain your name, student number, module name and code;
 The report must be in PDF and no more than 4500 words (~10%), including cover page (if you have one),

table of contents, appendices (if you have any) and references (if you have, and word count doesn’t apply to
references);

 The report must be formatted in single line spacing and use an 11pt font;
 The report does not include this briefing document;
 You describe and justify each step that is needed to reproduce your results by using code-snippets,

screenshots and plots;
 You interpret the results of your data analysis and model development;
 Please explain trends, characteristics or even outliers when you summarise and describe data;
 Whenever you need to set a random seed for an Azure component, use your student ID as seed. This should

prevent that you report exactly the same results as your fellow students;
 When using screen shots or plots generated from Azure, Python or R, please make sure they are clearly

readable but also do not take a vast amount of space if this is not necessary;
 Always refer to the mean absolute error (MAE) when reporting the ‘performance’ of the algorithm.

CMP3036M Data Science, page 4 of 4

Submission Instructions

The deadline for submission of this work is included in the School Submission dates on Blackboard. You must
make an electronic submission of your work to Blackboard that includes the following mandatory item:

 A PDF of your written report (following the requirements above), submitted to the Turnitin upload area for
assessment 2

This assessment is an individually assessed component. Your work must be presented according to the School of
Computer Science guidelines for the presentation of assessed written work. Please make sure you have a clear
understanding of the grading principles for this component as detailed in the accompanying Criterion Reference
Grid. Your citations and referencing should be in accordance with University guidelines.

If you are unsure about any aspect of this assessment component, please seek the advice of the delivery team.

Published by admin

Leave a Reply Cancel reply