CS代考程序代写 python chain Name: Email: Student ID:
Name: Email: Student ID:
@berkeley.edu
DS-100 Midterm Exam Fall 2017
Instructions:
• This exam must be completed in the 1.5 hour time period ending at 8:30PM.
• Note that some questions have bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply.
• When selecting your choices, you must shade in the box/circle. Checkmarks will likely be mis-graded.
• You may use a single page (two-sided) study guide.
• Work quickly through each question. There are a total of 116 points on this exam.
Honor Code:
As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others. I am the person whose name is on the exam and I completed this exam in accordance with the honor code.
Signature:
1
DS100 Midterm, Page 2 of 20 October 12, 2017
Syntax Reference Regular Expressions
“ˆ” matches the position at the beginning of string (unless used for negation “[ˆ]”)
“$” matches the position at the end of string character.
“?” match preceding literal or sub-expression 0 or 1 times. When following “+” or “*” results in non-greedy matching.
“+” match preceding literal or sub-expression one or more times.
“*” match preceding literal or sub-expression zero or more times
“.” match any character except new line. Some useful re package functions.
re.split(pattern, string) split the string at substrings that match the pattern. Returns a list.
Useful Pandas Syntax
“[ ]” match any one of the characters inside, accepts a range, e.g., “[a-c]”.
“( )” used to create a sub-expression
“d” match any digit character. “D” is the
complement.
“w” match any word character (letters, digits,
underscore). “W” is the complement.
“s” match any whitespace character includ- ing tabs and newlines. S is the comple- ment.
“” match boundary between words
re.sub(pattern, replace, string)
apply the pattern to string replac- ing matching substrings with replace. Returns a string.
df.loc[row_selection, col_list] # row selection can be boolean df.iloc[row_selection, col_list] # row selection can be boolean df.groupby(group_columns)[[’colA’, ’colB’]].sum()
pd.merge(df1, df2, on=’hi’) # Merge df1 and df2 on the ’hi’ column
pd.pivot_table(df, # The input dataframe index=out_rows, # values to use as rows columns=out_cols, # values to use as cols
values=out_values, # values to use in table aggfunc=”mean”, # aggregation function fill_value=0.0) # value used for missing comb.
DS100 Midterm, Page 3 of 20 October 12, 2017
Data Generation and Probability Samples For each of the following questions select the single best answer.
1. [2 Pts] A political scientist is interested in answering a question about a country composed of three states with exactly 10000, 20000, and 30000 voting adults. To answer this question, a political survey is administered by randomly sampling 25, 50, and 75 voting adults from each town, respectively. Which sampling plan was used in the survey?
⃝ cluster sampling ⃝ stratified sampling ⃝ quota sampling
⃝ snowball sampling
2. [2 Pts] A deck with 26 cards labeled A through Z is thoroughly shuffled, and the value of the third card in the deck is recorded. What is the probability that we observe the letter C on the third card?
⃝ 1 ⃝ 3 ⃝25·24·1 ⃝ 1 ·1 ·24 ⃝Noneoftheabove. 26 26 26 26 26 26 26 26
3. [3 Pts] Suppose Sam visits your store to buy some items. He buys toothpaste for $2.00 with probability 0.5. He buys a toothbrush for $1.00 with probability 0.1. Let the random variable X be the total amount Sam spends. What is E[X]? Show your work in the space provided.
⃝ $1.10
⃝ $1.5
⃝ $3.00
⃝ The toothpaste purchase may not be independent of the toothbrush purchase so we can’t compute this expectation.
You may show your work in the following box for partial credit:
DS100 Midterm, Page 4 of 20 October 12, 2017
4. [3 Pts] Suppose we have a coin that lands heads 80% of the time. Let the random variable X be the proportion of times the coin lands tails out of 100 flips. What is Var[X]? You must show your work in the space provided.
⃝ 0.8 ⃝ 0.16 ⃝ 0.04 ⃝ 0.0016 ⃝ 0.008
DS100 Midterm, Page 5 of 20 October 12, 2017 5. A small town has 5 houses with the following people living in each house:
Suppose we take a cluster sample of 2 houses (without replacement), what is the chance that: (1) [2 Pts] Kim and Lars are in the sample
⃝0 ⃝1/20 ⃝1/10 ⃝1/6 ⃝1/5 ⃝2/5 ⃝1 You may show your work in the following box for partial credit:
Abe, Ben Cat, Dan, Emma Frank, George Hank, Ira, Jen Kim, Lars
(2) [2 Pts] Kim, Abe, and Ben are in the sample
⃝0 ⃝1/20 ⃝1/10 ⃝1/6 ⃝1/5 ⃝2/5 ⃝1 You may show your work in the following box for partial credit:
(3) [1 Pt] Kim and Dan are in the sample – Select all that apply
The same as the chance Kim and Lars are in the sample
The same as the chance Kim, Abe, and Ben are in the sample Neither of the above
DS100 Midterm, Page 6 of 20 October 12, 2017
Data Cleaning and EDA
6. True or False. For each of the following statements select true or false.
(1) [1 Pt] Exploratory data analysis is the process of testing key hypotheses. ⃝ True ⃝ False
(2) [1 Pt] The structure of the data describes how it is formatted and organized. ⃝ True ⃝ False
(3) [1 Pt] Throughout the process of exploratory data analysis it is often necessary to trans- form and clean data.
⃝ True ⃝ False
(4) [1 Pt] During the data cleaning process it is generally a good idea to drop records that
contain missing values. ⃝ True ⃝ False
7. In homework 3, we analyzed ride sharing data comparing the weekday and weekend patterns for both casual and registered riders.
(1) [1 Pt] On weekdays, the number of casual riders was most frequently the number of registered riders.
⃝ higher than ⃝ lower than ⃝ similar to
(2) [1 Pt] Which group of riders demonstrated a pronounced bi-modal daily usage pattern:
⃝ Casual Riders ⃝ Registered Riders ⃝ Both casual and registered riders.
DS100 Midterm, Page 7 of 20 October 12, 2017
8. Using the following snippet of data to answer each of the questions below.
Business.data
“business_id”,”name”,”address”,”phone”
10,”TIRAMISU KITCHEN”,”033 BELDEN PL”,”+14154217044″
19,”LIFESTYLE CAFE”,”1200 VAN NESS AVE”,”+14157763262″
24,”OMNI S.F. HOTEL”,” “,”9999999999999999”
42,”The “Best”, Food!”,”500 CALIFORNIA ST”,”+14156211114″
43,”The “Best”, Food!”,”3716 Cesar Chavez”,”+14156211114″
(1) [1 Pt] Which of the following best describes the format of this file. ⃝ Raw Text
⃝ Tab Separated Values
⃝ Comma Separated Values ⃝ JSON
(2) [1 Pt] Which of the following best describes the granularity of each record? ⃝ Restaurant Chains
⃝ Individual Restaurant Locations ⃝ Strings
⃝ Daily
(3) [4 Pts] Select all the true statements.
From the available data the business id appears to be a primary key.
There appear to be no missing values
While the data appears to be quoted there may be issues with the quote character. There are nested records.
None of the above statements is true.
DS100 Midterm, Page 8 of 20 October 12, 2017
Transformations and Smoothing
9. [3Pts] Whichofthefollowingarereasonablemotivationsforapplyingapowertransformation? Select all that apply:
To help visualize highly skewed distributions
Bring data distribution closer to random sampling
To help straighten relationships between pairs of variables. Reduce the dimension of data
Remove missing values
None of the above
10. [3 Pts] Which of the following transformations could help make linear the relationship shown in the plot below? Select all that apply:
log(y) x2 √y log(x) y2 Noneoftheabove
80
60
40
20
0
123456789
X
Figure 1
Y
DS100 Midterm, Page 9 of 20 October 12, 2017
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
321012
Figure 2
11. [2 Pts] The above plot contains a histogram, rug plot, and Gaussian kernel density estimator. The Gaussian kernel is defined by:
1 (x−z)2 Kα(x,z)=√2πα2exp − 2α2
Judging from the shape of separate standing peaks, which of the following is the most likely value for the kernel parameter α.
⃝ α=0 ⃝ α=0.1 ⃝ α=10 ⃝ α=100
DS100 Midterm, Page 10 of 20 October 12, 2017
Regular Expressions
12. [2 Pts] Select all the strings that fully match the regular expression: [ˆdp]an
Dan pan fan man Noneoftheabove.
13. [2 Pts] Select all the strings that fully match the regular expression: <[a-z]*@w+.edu>
<@berkeley$edu>
None of the above strings match.
14. [2 Pts] Select all the strings that fully match the regular expression: ˆGo.* Way to ˆGo!
Go Bears!
go trees?
None of the above strings match
15. [2 Pts] What is the result of evaluating the following python command?
len(re.split(r”d+”, “You get a 99.9 on the exam.”)) ⃝2⃝3⃝4⃝5
16. For the following tasks, write the corresponding Python code or regular expression.
(1) [2 Pts] Write a regular expression that only matches sub strings consisting of an a imme-
diately followed by zero or one b characters.
regx = r’_________________________________________________’
(2) [3 Pts] Suppose we’ve run the code below:
text = ’Data Science 100’
Use a method in the re module to replace all the continuous segments of spaces with a
single comma. The resulting string should look like “Data,Science,100”. re._______________________________________________________
DS100 Midterm, Page 11 of 20 October 12, 2017
DataFrames, Joins, and Aggregation
17. The ti and fare DataFrames contain data of the people aboard the Titanic when it crashed:
>>> ti.head()
survived class sex id
| >>> fare.head()
| fare alone id
0 0
1 1
2 1
3 1
4 0
Third male 1410
First female 1522
Third female 1864
First female 1687
Third male 1173
|0 |1 |2 |3 |4
73.5000 True 1457
9.2250 True 1645
8.6625 True 1716
59.4000 False 1367
18.0000 False 1639
Both tables contain one row for each passenger, uniquely identified by the id column. Here’s a description of the columns in each DataFrame:
DataFrame ti
survived: 1 if the person survived, else 0
class: ticket class (First, Second, or Third) alone: True if the person was alone at purchase. sex: Sex of person (male or female)
Fill in the blanks to compute the following statements. You may assume that the pandas module is imported as pd. You may not use more lines than the ones provided.
(1) [2 Pts] The total number of survivors.
(2) [4 Pts] The proportion of females who survived (a float).
ti.loc[_______________________________,___________].mean()
DataFrame fare
fare: Price of ticket in USD
(3) [4 Pts]
A DataFrame containing the proportion of survivors for each sex. It should look like:
DS100 Midterm, Page 12 of 20 October 12, 2017
(4) [5 Pts]
A DataFrame containing the proportion of survivors for each sex and class. It should look like:
(5) [8 Pts] A DataFrame containing the proportion of survivors for each sex after filtering out those that bought their ticket alone. The table should have the same structure as (3) but with different numbers.
merged = ___________________________________________________
(merged_____________________________________________________
___________________________________________________________
__________________________________________________________)
18. [3 Pts] From the following list select all statements that are true for Pandas Data Frames. All data frames must have an index.
All columns must be the same type.
You can always index a record by its row number.
Missing values in string columns are always encoded as NaN. None of the above
DS100 Midterm, Page 13 of 20 October 12, 2017
Visualizations
19. The figure below is a scatter plot of the heights of mothers (in) and fathers (in) of a sample of 1000 UC Berkeley students.
(1) [2 Pts] What is the main problem with this plot? ⃝ Choice of scale
⃝ Jiggling the baseline ⃝ Aspect ratio
⃝ Overplotting
⃝ Lack of context
⃝ Perception (length, angle, area)
(2) [2 Pts] What is the remedy for this problem? ⃝ Overlay plots
⃝ Jitter values
⃝ Use color to condition on student’s gender ⃝ Transform one variable or the other or both ⃝ Improve labels and legends
DS100 Midterm, Page 14 of 20 October 12, 2017
20. [2 Pts] The following figure is a line plot of CO2 emissions over time. What is the main problem with this plot?
1960 1970 1980
1990 2000 2010
date
⃝ Empty data region ⃝ Jiggling the baseline ⃝ Overplotting ⃝ Lack of context ⃝ Perception (length, angle, area)
21. Consider the following visualization of the number of casual riders per hour by day of the week, which has been constructed from the bike sharing data used in Homework 3.
350
300
250
200
150
100
50 0
Sat Sun
Mon Tue Wed Thu Fri weekday
(1) [2 Pts] Which days of the week frequently (at least 75% of the time) had fewer than 50 casual riders? Select all that apply.
Saturday Sunday Monday Tuesday None of the above.
(2) [3 Pts] Which of the following describe conclusions that we can draw about the distribu-
tion of rider counts on Tuesdays using the above plot? Select all that apply.
Skewed left Symmetric Skewed right Unimodal Has outliers None of the above
casual
co2
0 200 400
DS100 Midterm, Page 15 of 20 October 12, 2017
Estimation and Loss Minimization
22. Consider the following loss function.
L(θ, x) =
(1) [2 Pts] Select all statements that are true. The loss function is concave.
The loss function is convex.
The loss function is smooth.
None of the above statements are true.
(2) [4 Pts] Given a sample x1, . . . xn, which value of θ minimizes the average loss? Show your work in the space provided.
⃝ 20th percentile ⃝ 25th percentile ⃝ 75th percentile ⃝ 80th percentile
(3) [2 Pts] The optimal value θ∗ is a percentile for the ⃝ sample
⃝ population
4(θ − x) θ ≥ x x−θ θ