1. In the notes we give the tree of possible tests for a process with actual probabilities .6 of yes and .4 of no. Recall that 1 indicates a successful prediction, zero a failure. Extend the tree two levels so that we have data for 5 tests. Then calculate (by counting basically) the following probabilities. (Express your answers as un-simplified fractions) Pr(5 successes and 0 failures) = ______ Pr(4 successes and 1 failures) = _______ Pr(3 successes and 2 failures) = _______ Pr(2 successes and 3 failures) = _______ Pr(1 successes and 4 failures) = _______ Pr(0 successes and 5 failures) = _______ What is the most likely out come for a test of 5 instances with a rule that has real error rate ? _____________________ 2. Consider the contact lens data on page 4. The data has three groups of instances each of 8 - i.e. 8 instances young .... , 8 pre-presbyopic ... , and 8 presbyopic ..... Let the training set be the first 4 of each of these groups and the test set the second four from each. Let the learned rules be the exactly the tuples that appear. I.e. if we have a tuple a, b, c, d ,e then we will have a rule a & b & c & d => e (also add the rule => none - so we will always get a conclusion) What is the re-substitution error rate of the rules? _________ What is the error rate of the rules on the test set? _________ 3. A test set had 570 tuples. We correctly predicted 400. What is the expected actual success rate with 95% confidence? 4. Cross-validation should use stratification. Stratification requires that the training set have proportions of the classes being learned similar to the sample data as a whole. 10-way cross validation does not really apply to small data sets. However suggest two splits of the data on page 4 such that we have roughly 90% of the data for training and 20% for testing and they are nearly stratified. (Identify the tuples by their number in order on page 4) split training data testing data 1 ___________________________________________________ _____________________ 2 ___________________________________________________ _____________________ 5. How many folds would leave one out cross-validation do for the data on page 4? ________ 6. Assume the following instances are randomly selected by bootstrap from the example on page 4: 1,1,3,4,5,6,7,7,9,11,11,12,12,14,15,15,15,18,18,20,21,22,22,24 What is the test set (list in order) : __________________________________________________ What is the bootstrap error rate of the rule on page 5? (add the rule: else => recommendation = hard to make it complete) error rate = ______________________ + ____________________ = ______________ 7. Two data learning schemes have been tested on a set of sample data using 10-fold cross validation. There average success rate for each fold is given in the table below: scheme 1 .7 .65 .66 .57 .45 .67 .71 .68 .65 .59 scheme 2 .45 .8 .51 .48 .5 .67 .68 .72 .66 .65 Is either scheme better at the 90% confidence level? (Show the important values in your calculation) 8. Using the weather data with probabilities (p. 82). a) Calculate the quadratic loss for the weather tuples on page 9. Your answer should be the sum of the loss for each instance. b) What is the value of the informational loss function for the same data? 9. Consider a data set of 10,000 instances with 1,000 positives. Assume we can tune our ML learned rule to yield either of the results shown in figure 5.1 p 141. That is , setup A picks about 10% of the 10,000 as positive and 400 of these can be expected to respond (err rate 6/10). Setup B picks 4,000 of 10,000 and 800 are expected to respond (err rate 40-8/40). If it costs $10 to call on a customer (he is picked) and we gain $100 if he buys (responds), which setup is better to use, based on dollars gained? (Show how much gained for each) 10. Assume the total sample was 10,000 again and that 8,000 truly are negative and 2,000 truly positives. Assume the smooth ROC curve of figure 5.2 describes our ML rule for the data. Assume basic costs as above. What is our benefit/cost to get a 40% response rate from the selected (positive) instances? 11. Use just the first five instances in the CPU performance data Table 1.5, p. 15. Find their PRP using the regression formula in chapter 3, p 71. Now compare these values to the actual values as listed in the table by calculating the error estimates requested below: mean squared error = _________ root mean squared error = _________ mean absolute error = _________ relative squared error = _________ root relative absolute error = _________ relative absolute error = _________ Dr. Riggs Homework 4 Datamining 2/2 07/23/00