CS代考程序代写 Name: Email: Student ID:
Name: Email: Student ID:
@berkeley.edu
DS-100 Midterm Exam Spring 2018
Instructions:
• This midterm exam must be completed in the 80 minute time period ending at
12:30PM, unless you have accommodations supported by a DSP letter.
• Note that some questions have bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply.
• When selecting your choices, you must fully shade in the box/circle. Check marks will likely be mis-graded.
• You may use a one-sheet (two-sided) study guide.
• Work quickly through each question. There are a total of 168 points on this exam.
Honor Code:
As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others. I am the person whose name is on the exam and I completed this exam in accordance with the honor code.
Signature:
1
DS100 Midterm,
Page 2 of 15 March 8th, 2018
Syntax Reference Regular Expressions
“ˆ” matches the position at the beginning of string (unless used for negation “[ˆ]”)
“$” matches the position at the end of string character.
“?” match preceding literal or sub-expression 0 or 1 times. When following “+” or “*” results in non-greedy matching.
“+” match preceding literal or sub-expression one or more times.
“*” match preceding literal or sub-expression zero or more times
“[ ]” match any one of the characters inside, accepts a range, e.g., “[a-c]”.
“( )” used to create a sub-expression
“d” match any digit character. “D” is the
complement.
“w” match any word character (letters, dig- its, underscore). “W” is the comple- ment.
“s” match any whitespace character includ- ing tabs and newlines. S is the comple- ment.
“” match boundary between words
requests.post(url, auth, params, data)
makes a POST requests with params in the header and data in the body.
“.” match any character except new line.
Some useful re and requests package functions.
re.findall(pattern, st) return the list of all sub-strings in st that match pattern.
requests.get(url, auth, params, data)
makes a GET requests with params in the header and data in the body.
Useful Pandas Syntax
df.loc[row_selection, col_list] # row selection can be boolean df.iloc[row_selection, col_list] # row selection can be boolean df.groupby(group_columns)[[’colA’, ’colB’]].sum()
pd.merge(df1, df2, on=’hi’) # Merge df1 and df2 on the ’hi’ column
pd.pivot_table(df, # The input dataframe index=out_rows, # values to use as rows columns=out_cols, # values to use as cols
values=out_values, # values to use in table aggfunc=”mean”, # aggregation function fill_value=0.0) # value used for missing comb.
DS100 Midterm, Page 3 of 15 March 8th, 2018
Data Design and Bias
1. [1 Pt] Your letter grade (e.g., A+, A, . . . ) in a class that grades on a curve is most accurately described as what kind of data?
⃝ Nominal ⃝ Ordinal ⃝ Quantitative ⃝ Numerical
2. [1 Pt] The number of gold medals won by each country in the 2018 Olympics is an example
of what kind of data:
⃝ Nominal ⃝ Ordinal ⃝ Qualitative ⃝ Quantitative
3. A discussion leader with 32 students in her section would like to sample a single student that is representative of the total population of students in her section. She enumerates her students 0 to 31 and follows one of the following procedures:
(a) [4 Pts] She flips a fair coin 31 times and records the number of heads. She then selects the student with the number that matches the number of heads. What type of sample has the discussion leader taken? Select all that apply.
Simple random sample Probability sample Convenience sample None of the above
(b) [4 Pts] She flips a fair coin 5 times and records the sequence of heads and tails as 1’s and 0’s, respectively. She then selects the student whose number corresponds to the binary sequence. For example, if she flipped [1, 1, 0, 0, 1] then she would select:
1∗20 + 1×21 + 0×22 + 0×23 + 1×24 =student21 What type of sample has the discussion leader taken? Select all that apply.
Simple random sample Probability sample Convenience sample None of the above
4. Sampling True/False For each of the following select true or false:
(a) [1 Pt] If each element/member of the population has an equal chance of being chosen,
then we have a simple random sample. ⃝ True ⃝ False
DS100 Midterm, Page 4 of 15 March 8th, 2018
(b) [1 Pt] In cluster sampling, each cluster has an equal chance of being chosen.
⃝ True ⃝ False
(c) [1 Pt] In stratified sampling, each element of the population is a assigned to exactly one
stratum.
⃝ True ⃝ False
(d) [1 Pt] A small simple random sample can often be more representative of the population than a very large convenience sample.
⃝ True ⃝ False
5. We would like to understand the sleeping habits on university students living in campus dorms
across the United States.
(a) [2 Pts] To keep costs down we randomly sample a subset of dorms across the United States and then construct a simple random sample of students within each of the selected dorms. This is an example of which sampling procedure:
⃝ Simple random sample ⃝ Stratified sample ⃝ Cluster sample
(b) [2 Pts] Which of the following sampling procedures would ensure that we have good
coverage of both male and female students within each dorm.
⃝ Simple random sample ⃝ Stratified sample ⃝ Cluster sample
Pandas
6. Pandas True/False
(a) [1 Pt] If the pandas DataFrame df has 10 columns, then df.iloc[:, 0:5] will re- turn a DataFrame with 5 columns.
⃝ True ⃝ False
(b) [1 Pt] Assuming that len(df1) == 100 and len(df2) == 100 are both true,
thendf1.merge(df2, how=’outer’)producesatmost200rows. ⃝ True ⃝ False
(c) [1 Pt] The return type of the pandas.DataFrame.groupby function can either be a DataFrame or a Series object.
⃝ True ⃝ False
DS100 Midterm, Page 5 of 15 March 8th, 2018
7. The tables food and store contain information regarding different ingredients and where to buy them. You may assume all strings are strings and numbers are floats.
This is preview of the first 5 rows of the DataFrames. You may assume it has many more rows than what is shown, with the same structure and no missing data.
name
broccoli green chicken pink cheddar yellow mango yellow carrot orange
calories 25
200 350
40 50
food group vegetable meat
dairy
fruit vegetable
color
food
index 0
1
2
3 4
index 0
1
2
3 4
(a) [5 Pts] Which of the following expressions returns a Series containing only the names of all the red vegetables in the food DataFrame? Select all that apply.
food[(food[“color”] == “red”) | (food[“food_group”] == “vegetable”)][“name”]
food[(food[“color”] == “red”) & (food[“food_group”] == “vegetable”)][“name”]
food[(food[“color”] == “red”) & (food[“food_group”] == “vegetable”)]
food[(food[“name”].isin(store[“food_name”])) & (food[“food_group”] == “vegetable”)]
None of the above.
(b) [5 Pts] Select all of the following expressions that generate a DataFrame containing only
rows of fruit.
food.set_index(“food_group”).loc[“fruit”, :] food.where(food[“food_group”] == “fruit”)
food[food[“food_group”] == “fruit”]
food[“food_group”] == “fruit”
None of the above.
store
food name store name distance price broccoli yasai 1 1.5 broccoli safeway 2 2
cheddar
mango
carrot costco
trader joes 1 4 berkeley bowl 3 1 6 5
DS100 Midterm, Page 6 of 15
(c) [5 Pts] Select all true statements about the following expression.
cal100_foods = food[food[“calories”] <= 100]
nearby_stores = store[store["distance"] <= 2]
output_df = cal100_foods.merge(nearby_stores,
how = "left",
left_on="name",
right_on="food_name")
March 8th, 2018
output df[’name’] and output df[’food name’] are always the same.
output df could contain NaN values.
nearby stores always contains the same number of rows as the output df.
output df could contain more rows than the original food DataFrame.
None of the above.
(d) [4 Pts] Which of the following tables is represented by agg df?
safeway_food = store[store["store_name"] == "safeway"]
merged_df = pd.merge(food, safeway_food, left_on="name",
right_on="food_name")
agg_df = (merged_df.groupby("food_group")
.mean()
.drop(columns="distance")
)
⃝⃝
⃝⃝
DS100 Midterm, Page 7 of 15 March 8th, 2018 (e) [4 Pts] Which of the following expressions would generate the following table?
⃝ (food.groupby(["food_group", "color"])[["calories"]] .median())
⃝ pd.pivot_table(food, values="calories", index="food_group", columns="color", aggfunc=np.median)
⃝ (food.set_index("food_group") .groupby("color")[["calories"]] .mean())
⃝ pd.pivot_table(food, values="calories", index="color", columns="food_group", aggfunc=np.median)
DS100 Midterm, Page 8 of 15 March 8th, 2018
EDA and Visualization
8. [5 Pts] Which of the following claims are true for the distribution shown below? Select all that apply.
It is left skewed It is unimodal The right tail is longer than the left tail It is symmetric None of the above
9. [5 Pts] We wish to compare the results of kernel density estimation using a gaussian kernel and a boxcar kernel. For α > 0, which of the following statements are true? Choose all that apply.
Gaussian Kernel:
Box Car Kernel:
1 Bα(x,z)= α
Kα(x,z)=√1 exp −(x−z) 2πα2 2α2
(a) Gaussian
0 else
2
if − α ≤ x − z ≤ α 2 2
(b) Box Car
Decreasing α for a gaussian kernel decreases the smoothness of the KDE.
The gaussian kernel is always better than the boxcar kernel for KDEs.
Because the gaussian kernel is smooth, we can safely use large α values for kernel density estimation without worrying about the actual distribution of data
The area under the box car kernel is 1, regardless of the value of α
None of the above
DS100 Midterm, Page 9 of 15 March 8th, 2018
10. [5 Pts] Which of the following styles of plots are good for visualizing the distribution of a continuous variable? Choose all that apply.
Pie Charts Box Plots Bar Plots Histogram None of the above
11. [2 Pts] Suppose you wish to compare the number of homes homeowners in the US own and
their respective salaries. Which style of plot would be the best?
⃝ Scatter Plot ⃝ Overlaid Line Plots ⃝ Side by Side Box Plots ⃝ Stacked Bar Plot
12. [5 Pts] Consider the plot below. What are some ways to improve the plot? Choose all that apply. Assume each is done individually.
Remove outliers and then plot on a different scale
Plot as a line plot instead of a scatterplot.
Jitter the data with noise sampled from a uniform distribution of (-1, 1) Utilize transparency
None of the above
13. [5 Pts] Consider the plot below which visualizes day of the week versus the average tip given in dollars. What are serious visualization errors made with this plot? Choose all that apply.
Area perception Jittering Overplotting Stacking None of the above
DS100 Midterm, Page 10 of 15 March 8th, 2018
14. True/False
(a) [1 Pt] A data scientist must always consider potential sources of bias in a given dataset.
⃝ True ⃝ False
(b) [1 Pt] It is always reasonable to drop missing values.
⃝ True ⃝ False
15. Use the following dataset to answer the following questions:
id,diet,pulse,time,kind
1,low fat,85,1 min,rest
1,low fat,85,15 min,rest
1,low fat,88,30 min,rest
2,low fat,90,1 min,rest
2,low fat,92,15 min,rest
2,low fat,93,30 min,rest
3,low fat,97,1 min,rest
3,low fat,97,15 min,rest
(a) [1 Pt] Which of the following best describes the format of this file? ⃝ Raw text
⃝ Tab Separated Values (TSV)
⃝ Comma Separated Values (CSV) ⃝ JSON
(b) [4 Pts] Select all the true statements.
From the data available, the id seems to be a primary key. There appear to be no missing values.
There are nested records.
None of the above.
16. [5 Pts] Select all the true statements about the following XML file:
1 2 3 4 5 6 7 8 9
10
< email >