CS代考程序代写 Bioinformatics computational biology Chair of Bioinformatics and Computational Biology Department of Informatics
Chair of Bioinformatics and Computational Biology Department of Informatics
Technical University of Munich
Personal sticker
Compliance to the code of conduct
I hereby assure that I solve and submit this exam myself under my own name by only using the allowed tools listed below.
Signature or full name if no pen input available
S5115
Data Analysis and Visualization in R Exam: IN2339 / Retake Date: Monday 29th June, 2020
Examiner: Julien Gagneur Time: 08:15 – 09:45
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 I
Exam empty – Page 1 / 24 – IN-DataViz-1-20200629-E5115-01
Working instructions
• This exam consists of 24 pages with a total of 11 problems.
Please make sure now that you received a complete copy of the exam.
• The total amount of achievable credits in this exam is 27 credits.
• Detaching pages from the exam is prohibited.
• Allowed resources:
– slides, exercises and notes from the lectures
– Other content from the internet
– You are not allowed to communicate with anyone except with the examiners during the exam and during the oral questioning (one hour following the written exam). Hence, you can consult a forum for an existing post but you are not allowed to post any question nor result on any communication media (e.g. forum, WhatsApp group, social media, etc.) up to one hour following the exam.
– You should answer the questions using the knowledge (data analysis and statistical methods), and R packages taught during the lecture at the sole exception of the dslabs R package, whose datasets will be needed. In this respect, consulting other content from the internet is probably a bad idea as they may hint towards methods and code that were not taught.
– The R libraries you can use are: data.table, ggplot2, tidyr, dslabs, magrittr and dplyr. Load them in your R session by running the following code: library(data.table); library(ggplot2); library(tidyr); library(dslabs); library(magrittr); library(dplyr). Be sure to have them already installed using install.packages(‘data.table’), and so on for each of those libraries.
• Filling the exam
– Download the pdf to your computer and edit it there. Make sure that your pdf editor supports native text input fields. Check the list of pdf readers in this document: https://tumexam.de/static/handreichung_ submissions_students.pdf
– Do not work with the pdf loaded in a web browser as it does not save your edits.
– Answer by typing, no handwriting or sketching. Write into the solution box inside the pdf document.
– Not all R outputs (e.g. tables or plots) are required except for the answer to the question. Simply copy the executed code from R to the solution box in the exam. If the question states “justify”, provide a short justification in plain English. In this case, only providing the code is not enough.
– We do not accept any additional files.
– Some questions value one point, other two points. No half-point will be given.
• Interactions with examiners and oral questioning
– The examiners are reachable during the exam and for the oral questioning via a zoom meeting
– The zoom meeting will be open from 8.00 to 11.00
– The written exam will start at 8.15 sharp at what time point the exam will be downloadable from TUMexam.
– Your exam should be uploaded back to TUMexam by 9.50 sharp.
– Do not switch on your microphone, nor your camera, and do not share your screen during the written
exam.
– You should primarily use the zoom conference chat with direct messages to the examiner during the exam if you have any question during the written exam.
– If your zoom connection breaks during the written exam, try first to reconnect. If it keeps failing you can post questions at exam@mailgagneur.informatik.tu-muenchen.de
– Immediately after the written exam starts the oral questioning 9.45-10.45.
– The purpose of the oral questioning is to ensure your identity and that you did the exam by yourself. You should be able to explain why you gave a particular answer to a question (i.e. what was your reasoning). It does not matter whether your answer to the question is right or wrong. We only want to make sure that it comes from you. In the oral questioning you will not be allowed to consult any document any longer.
– You are not allowed to communicate with anyone except the examiners during the entire hour reserved for oral questioning, even if you have been already orally questioned.
IN-DataViz-1-20200629-E5115-02 – Page 2 / 24 – Page empty
– You must be reachable at all times by videoconference during the oral questioning hour. If your zoom connection breaks, immediately inform us at exam@mailgagneur.informatik.tu-muenchen.de and propose an alternative videoconference channel (preferably WhatsApp). We will not store your phone number after the oral questioning.
– For the oral questioning, switch on the camera and microphone. Give us your matricule number, first name, and last name as it appears in TUMonline by copy-pasting this information in the chat window. Show your student ID and face. We will then ask you a few questions about your submission, to verify that you wrote it yourself.
Left room from to / Early submission at
Page empty – Page 3 / 24 – IN-DataViz-1-20200629-E5115-03
0 1 2
a)
Problem 1 (6 credits)
Question Nr. 4HR86EZ11LNA05GE44NY1
The olive dataset from the dslabs package contains the % of 8 fatty acids found in Italian olive oils.
## region area palmitic palmitoleic stearic oleic linoleic
library(dslabs) data(olive) head(olive)
## 1 Southern Italy North-Apulia
## 2 Southern Italy North-Apulia
## 3 Southern Italy North-Apulia
## 4 Southern Italy North-Apulia
## 5 Southern Italy North-Apulia
## 6 Southern Italy North-Apulia
## linolenic arachidic eicosenoic
## 1 0.36 0.60 0.29
## 2 0.31 0.61 0.29
## 3 0.31 0.63 0.29
## 4 0.50 0.78 0.35
## 5 0.50 0.80 0.46
## 6 0.51 0.70 0.44
10.75 0.75
10.88 0.73
9.11 0.54
9.66 0.57
10.51 0.67
9.11 0.49
2.26 78.23 6.72
2.24 77.09 7.81
2.46 81.13 5.49
2.40 79.52 6.19
2.59 77.71 6.72
2.68 79.24 6.78
Write R code using ggplot2 to plot the distribution of the % of the fatty acids except oleic using density plots with different line colors to distinguish each fatty acid. Give meaningful labels to both axes.
IN-DataViz-1-20200629-E5115-04 – Page 4 / 24 – Page empty
Page empty – Page 5 / 24 – IN-DataViz-1-20200629-E5115-05
0 1 2
b)
IN-DataViz-1-20200629-E5115-06
– Page 6 / 24 –
Page empty
Question Nr. 3QM41LNA09GZ73DB55AQ5
The mpg dataset from the ggplot2 package contains different cars features, mostly involving fuel.
library(ggplot2) data(mpg) head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans
drv cty hwy fl class
##
## 1 audi
## 2 audi
## 3 audi
## 4 audi
## 5 audi
## 6 audi
0.2 0.1 0.0
0.06 0.04 0.02 0.00
0.0
45678 234567
a4 1.8 1999
a4 1.8 1999
a4 2 2008
a4 2 2008
a4 2.8 1999
a4 2.8 1999
4 auto(l5)
4 manual(m5) f
4 manual(m6) f
4 auto(av) f
6 auto(l5) f
6 manual(m5) f
0.3 0.2 0.1
Write R code that produces the following plot.
cyl
density
hwy
20 30 40
0.075 0.050 0.025 0.000
value
cty
101520253035
f
18
21
20
21
16
18
29 p
29 p
31 p
30 p
26 p
26 p
displ
compa~
compa~
compa~
compa~
compa~
compa~
Page empty – Page 7 / 24 – IN-DataViz-1-20200629-E5115-07
0 1 2
c)
Question Nr. 0GW52IK81LNA06UY23CW8
The admissions dataset from the dslabs package provides the number of applicants and admitted students to 6 different majors stratified by gender. Write R code using ggplot2 to plot the difference between applicants and admitted students on each major, using bars and stratified by gender using facets. Give meaningful labels to both axes. Do not mind if you obtain negative values.
library(dslabs) data(admissions) admissions
## ##1 ##2 ##3 ##4 ##5 ##6 ##7 ##8 ## 9 ##10 ##11 ## 12
major gender admitted applicants
A men B men C men D men E men F men A women Bwomen C women Dwomen Ewomen F women
62 825
63 560
37 325
33 417
28 191
6 373
82 108
68 25
34 593
35 375
24 393
7 341
IN-DataViz-1-20200629-E5115-08
– Page 8 / 24 –
Page empty
Page empty – Page 9 / 24 – IN-DataViz-1-20200629-E5115-09
Problem 2 (2 credits)
Which operation has been applied to table A and table B to return the result table? Justify your answer. Write
0
1 one line of R code that would produce the result table assuming a data table A and a data table B in the working
2
environment.
Table A:
Table B:
id CreditCard
15 1837655746651971 21 5927428911423246 14 7393954899774435 23 7844437946592947 16 7364376521545978 13 3923818281216234
8 1764682661721638 24 2622321425978251 19 7271112241595296 18 4225693846619738
CCV type
582 l 221 i 142 r 479 l 881 l 698 o 566 o 528 o 393 o 421 r
firstName
Aamina Marcus Derek Muntasir Alexis Alexis Khanea Julia Keith Tiana Adam
lastName customer_id
el-Sinai 16 Hendrix 13 Martinez 8 al-Sharifi 24 Smith 19 Arreola 18 Forrest 1 Deronde 6 Hart 25 Ramirez 9 Highman 2
Result table:
id CreditCard CCV
1 NA NA
2 NA NA
6 NA NA
8 1764682661721638 566
9 NA NA
13 3923818281216234 698
14 7393954899774435 142
15 1837655746651971 582
16 7364376521545978 881
18 4225693846619738 421
19 7271112241595296 393
21 5927428911423246 221
23 7844437946592947 479
24 2622321425978251 528
25 NA NA
type firstName
NA Khanea NA Adam NA Julia
o Derek NA Tiana
o Marcus r NA
l NA
l Aamina r Alexis o Alexis
i NA
l NA
l Muntasir NA Keith
lastName
Forrest Highman Deronde Martinez Ramirez Hendrix NA
NA el-Sinai Arreola Smith NA
NA al-Sharifi Hart
IN-DataViz-1-20200629-E5115-10
– Page 10 / 24 –
Page empty
Problem 3 (2 credits)
a)
0 1
Question Nr. 6CK12JZ6CD9OQ44JL0
Which normal Q-Q plot (i.e. Q-Q plot against the standard Normal distribution) A, B, C, or D corresponds to the distribution depicted in plain in the plot below? The standard Normal distribution, i.e. the Gaussian distribution with mean 0 and variance 1, is shown in the plot below with a dashed line.
Page empty
– Page 11 / 24 –
IN-DataViz-1-20200629-E5115-11
A 15 10
−5 −10 −15
C 15 10
−5 −10 −15
0.4
0.2
0.0
sample sample
sample sample
density
B 15 10 55 00
−4 −2 0 2 4
theoretical
D 15 10 55 00
−4 −2 0 2 4
theoretical
−15−10−5 0 5
x
10 15
−4 −2 0 2 4
theoretical
−4 −2 0 2 4
theoretical
−5 −10 −15
−5 −10 −15
0 1
b)
IN-DataViz-1-20200629-E5115-12
– Page 12 / 24 –
Page empty
Question Nr. 3JN34PA3DD3YU35RP0
Which normal Q-Q plot (i.e. Q-Q plot against the standard Normal distribution) A, B, C, or D corresponds to the distribution depicted in plain in the plot below? The standard Normal distribution, i.e. the Gaussian distribution with mean 0 and variance 1, is shown in the plot below with a dashed line.
0.4 0.3 0.2 0.1 0.0
sample sample
sample sample
density
A15 10 5 0 −5 −10 −15
C 15 10
−5 −10 −15
−15−10−5 0 5 10 15
x
B 1.00 0.75 0.50 0.25 0.00
−4 −2 0 2 4
theoretical
−4 −2 0 2 4
theoretical
−4 −2 0 2 4
theoretical
D 15 10 55 00
−4 −2 0 2 4
theoretical
−5 −10 −15
Problem 4 (2 credits)
0
1 Question Nr. 9OQ7L2ZT32SI73AD2 2
The brexit_polls dataset from the dslabs package contains poll outcomes for 127 polls performed by different pollsters either online or by telephone (poll_type).
library(dslabs) data(brexit_polls) head(brexit_polls)
##
## 1
## 2
## 3
## 4
## 5
## 6 ##
## 1
## 2
## 3
## 4
## 5
## 6
startdate 2016-06-23 2016-06-22 2016-06-20 2016-06-20 2016-06-20 2016-06-17 spread
0.04 0.10 0.02 0.03
-0.01 0.08
enddate 2016-06-23 2016-06-22 2016-06-22 2016-06-22 2016-06-22 2016-06-22
pollster poll_type YouGov Online Populus Online YouGov Online Ipsos MORI Telephone Opinium Online ComRes Telephone
samplesize remain leave undecided 4772 0.52 0.48 0.00 4700 0.55 0.45 0.00 3766 0.51 0.49 0.00 1592 0.49 0.46 0.01 3011 0.44 0.45 0.09 1032 0.54 0.46 0.00
You are interested in generating a table that shows the polls of June 2016 only. Write R code to create such table with the same column names (header displayed below).
##
## 1:
## 2:
## 3:
## 4:
## 5:
## 6: ##
## 1:
## 2:
## 3:
## 4:
## 5:
## 6:
startdate 2016-06-23 2016-06-22 2016-06-20 2016-06-20 2016-06-20 2016-06-17 spread
0.04 0.10 0.02 0.03
-0.01 0.08
enddate 2016-06-23 2016-06-22 2016-06-22 2016-06-22 2016-06-22 2016-06-22
pollster YouGov Populus YouGov Ipsos MORI Opinium ComRes
poll_type Online Online Online Telephone Online Telephone
samplesize remain leave undecided 4772 0.52 0.48 0.00 4700 0.55 0.45 0.00 3766 0.51 0.49 0.00 1592 0.49 0.46 0.01 3011 0.44 0.45 0.09 1032 0.54 0.46 0.00
Page empty
– Page 13 / 24 –
IN-DataViz-1-20200629-E5115-13
0 1 2
a)
Problem 5 (4 credits)
Question Nr. 4FS41VJ01LT35KP42ZM8
Consider the dataset “brca”. Which statistical test that we studied do you suggest to test the association between the variable “concavity_se” and the variable “outcome”? Assume normally distributed values of the variable“concavity_se” given the value of “outcome”. Justify the choice of the test and provide the two-sided p-value rounded to two significant digits using
signif(…,digits=2).
IN-DataViz-1-20200629-E5115-14 – Page 14 / 24 – Page empty
b)
0 1 2
Question Nr. 5GH17ZA51LS07UH62QI6
Consider the dataset “olive”. Which statistical test that we studied do you suggest to test the association between the variable “oleic” and the variable “stearic”? Do not make any assumption of normality. Justify the choice of the test and provide the
two-sided p-value rounded to two significant digits using signif(…,digits=2).
Page empty – Page 15 / 24 – IN-DataViz-1-20200629-E5115-15
0 1
a)
Problem 6 (2 credits)
QuestionId: 0PE69QK81ME3BC1VR8
Which of the following dependent variables ‘rawpoll_clinton’ and ‘rawpoll_trump’ explains most variance of the response variable ‘rawpoll_mcmullin’ in the ‘polls_us_election_2016’ dataset from the dslabs package? Assume that the assumptions of linear regression are met. Provide code and justify your answer.
0 1
b)
QuestionId: 3FQ41UA96LY2U0K14MQ1
Consider the ‘polls_us_election_2016’ dataset from the dslabs package. Assuming all assumptions of linear regression being met, is the effect of ‘rawpoll_clinton’ on ‘rawpoll_mcmullin’ significant at the significance level of 0.05? Justify.
IN-DataViz-1-20200629-E5115-16 – Page 16 / 24 – Page empty
Problem 7 (2 credits)
a)
0 1
Question Nr. 8ZF48OQ3WP45KH55GQ0
We consider a linear regression model parameterized as
yi =α+β·xi +εi
where i = 1…N denotes the data point indices, yi is the response variable, α and β the coefficients, xi the explanatory variable
and εi the error term. Let yˆi be the fitted value.
Does the following plot provide evidence against the assumptions of the linear regression? Justify.
response quantiles
Page empty
– Page 17 / 24 –
IN-DataViz-1-20200629-E5115-17
5
0
−5
−10
−2 0 2
theoretical quantiles
0 1
b)
Question Nr. 0JU8SP106HN41KM58HL9
We consider a linear regression model parameterized as
yi =α+β·xi +εi
where i = 1…N denotes the data point indices, yi is the response variable, α and β the coefficients, xi the explanatory variable
and εi the error term. Let yˆi be the fitted value.
Does the following plot provide evidence against the assumptions of the linear regression? Justify.
600
400
200
0
0 20 40 60
x
y
IN-DataViz-1-20200629-E5115-18
– Page 18 / 24 –
Page empty
Problem 8 (2 credits)
Question Nr. 4LY90YX24AQ71F66EX2
library(dslabs)
Consider the “brca” dataset from dslabs package. Fit a logistic regression model which predicts the response variable brca$y given the feature perimeter_se. Assume that all assumptions of the logistic regression model are met. Starting from an original probability of 10 % of malignant (cancer) how much does the probability of developping a malignant (cancer) increase, when the feature perimeter_se increases by 0.8.
0 1 2
Page empty – Page 19 / 24 – IN-DataViz-1-20200629-E5115-19
Problem 9 (2 credits)
0 1 2
QuestionId: 4YF2RH7TD00JE84ZN0
Consider the features smoothness_mean, radius_mean from the brca dataset. Provide R code that plots a ROC curve of both features as predictors of malignancy (variable brca$y == “M”), and indicate the feature that has the highest true positive rate when the false positive rate is 0.4.
IN-DataViz-1-20200629-E5115-20 – Page 20 / 24 – Page empty
Problem 10 (2 credits)
library(dslabs) library(data.table)
Question Nr. 4IL88XL37CT72GN68OW7
Consider the variable ‘concavity_worst’ of the ‘brca’ dataset. A researcher states that no other variable from the matrix ‘brca$x’ associates with the variable ‘concavity_worst’ according to Spearman’s correlation. Do you reject this hypothesis using the significance level of 1%? Provide code and justify your answer. Do not mind warnings, if any, about exact p-values with ties.
0 1 2
1
Page empty – Page 21 / 24 – IN-DataViz-1-20200629-E5115-21
Problem 11 (1 credit)
0 1
Question Nr. 0PJ29LOW8LMHZ4060
Assume 5 elements v, w, x, y and z. A first clustering gave the clusters {v} and {w,x,y,z}. One then run k-means which yields two clusters: {v,w,x} and {y,z}. Applying hierarchical clustering yields also two clusters: {v,w} and {x,y,z}. Which of the k-mean clustering and the hierarchical clustering yielded the clustering most similar to the first clustering? Support your answer with a metric learned in the lecture.
IN-DataViz-1-20200629-E5115-22 – Page 22 / 24 – Page empty
Additional space for solutions–clearly mark the (sub)problem your answers are related to and strike out invalid solutions.
Page empty – Page 23 / 24 – IN-DataViz-1-20200629-E5115-23
IN-DataViz-1-20200629-E5115-24 – Page 24 / 24 – Page empty