Business 260 Homework Assignments

Professor David Mease



Homework 3 - Due Thursday 4/23:

1) Read Chapter 1 (all sections), Chapter 4 (all sections) and Chapter 5 (Sections 5.2, 5.5, 5.6 and 5.7) in the textbook Introduction to Data Mining by Tan, Steinbach and Kumar.

2) Repeat In Class Exercise #90 using the misclassification error rate instead of information gain to determine the best split. Which of these splits considered is the best according to misclassification error rate?

3) Repeat In Class Exercise #91 using the misclassification error rate instead of information gain to determine the best split. Which of these splits considered is the best according to misclassification error rate?

4) The file http://www-stat.wharton.upenn.edu/~dmease/rpart_text_example.txt gives an example of text output for a tree fit using the rpart() function in R from the library rpart. Use this tree to predict the class labels for the 10 observations in the test data http://www-stat.wharton.upenn.edu/~dmease/test_data.csv linked here. Do this manually - do not use R or any software.

5) I split the popular sonar data set into a training set (http://www-stat.wharton.upenn.edu/~dmease/sonar_train.csv) and a test set (http://www-stat.wharton.upenn.edu/~dmease/sonar_test.csv). Use R to compute the misclassification error rate on the test set when training on the training set for a tree of depth 5 using all the default values except control=rpart.control(minsplit=0,minbucket=0,cp=-1, maxcompete=0, maxsurrogate=0, usesurrogate=0, xval=0,maxdepth=5). Remember that the 61st column is the response and the other 60 columns are the predictors.

6) Do Chapter 5 textbook problem #17 (parts a and c only) on pages 322-323. Note that there is a typo in part c - it should read "Repeat the analysis for part (b)". We will do part b in class.

7) The Random Forest classifier often grows all of its individual trees so large that the misclassification error for each tree on the training data is 0%. Using this fact, what can you say will often be true about the misclassification error on the training data for the Random Forest classifier? Verify this by computing the misclassification error on the training data for the Random Forest classifier from In Class Exercise #99. Show your R code for doing this.

8) This question deals with In Class Exercise #94.

a) Repeat In Class Exercise #94 for the k-nearest neighbor classifier for k=5 and k=6.

b) Repeat part a using the exact same R code a few times. Explain why both the training errors and the test errors often change each time you repeat it for k=6 but not for k=5. Hint: Read the help on the knn function if you do not know. (Type "?knn" in R to read the help documentation.)



Homework 2 - Due Thursday 4/2:

1) For the clam example (in class exercise #60) complete the following:
a) What decision would the optimistic (maximax) approach favor?
b) What decision would the conservative (maximin) approach favor?
c) What decision would the minimax regret approach favor?
d) What decision would the expected value approach favor?
e) Compute the standard deviations for the three alternative decisions.
f) What is the EVPI for this problem?
g) Explain the meaning of EVPI in the context of this problem.

2) Complete problem #19 from pages 140-141 in Chapter 4 of the QBA book.

3) For the following problems from the QBA textbook, ignore the specific questions asked in the textbook and instead solve the two-dimensional linear programming problem. Be sure to include a graph of the feasible region in your solution.
a) Chapter 7, Question #24, p. 272-273
b) Chapter 7, Question #31, p. 275
c) Chapter 7, Question #42, p. 278
d) Chapter 7, Question #43, p. 278
e) Chapter 8, Question #1, p. 321

4) For the following problems from the QBA textbook, ignore the specific questions asked in the textbook and instead write down the correct and complete formulation of the linear programming problem. Write all inequalities with the variables on the left side and the constant on the right side. (You do not need to find the solutions for these.)
a) Chapter 8, Question #20, p. 335
b) Chapter 8, Question #26, p. 338
c) Chapter 9, Question #1, p. 397



Homework 1 - Due Thursday 3/19:

1) In the chapter entitled “Numerical Descriptive Measures” do textbook problem 5 AND check the answers using Excel.

2) The data at (http://www.cob.sjsu.edu/mease_d/freethrows.xls) gives free throw percentages for NBA basketball players for the 2005-2006 season.
a) Give the 5 number summary for the free throw percentages.
b) Graph the box-and-whisker plot by hand.
c) Based on your box-and-whisker plot, describe the shape of the data as left-skewed, symmetric or right-skewed.
d) Use Excel to compute the mean, variance and standard deviation.
e) Using your values from part d, give the empirical rule for this data.

3) Review In Class Exercise #19. (There will be one like this on the exam.)

4) The dataset at http://www.cob.sjsu.edu/mease_d/gpa-data.xls contains data from 20 San Jose State University graduating seniors who were asked to report their high school GPA (first column) and their current college GPA (second column).
a) Make a scatter plot of this data with High School GPA on the X-axis and College GPA on the Y-axis using Excel.
b) Give the equation of the least squares regression line using Excel.
c) What is the slope of the least squares regression line?
d) Interpret the slope of the least squares regression line.
e) What is the coefficient of correlation?
f) What is the value of R-squared?
g) Use the least squares regression line to predict the college GPA of a student who had a high school GPA of 2.7.

5) In the chapter entitled “Basic Probability” do textbook problems 4, 8, 12, 14, 17, 20, 23, 31 and 34.

6) In the chapter entitled “Some Important Discrete Probability Distributions” do textbook problems 3, 4, 14, 15 and 20 (but skip part g in 20).

7) Stock X has a mean of $50 and a standard deviation of $10. Stock Y has a mean of $100 and a standard deviation of $20. Find the mean and standard deviation of buying one share of each
A) If they are independent (so the covariance is 0)
B) If the covariance is 30
C) If the covariance is -30

8) In the chapter entitled “The Normal Distribution and Other Continuous Distributions” do textbook problems 6 and 38.