Statistics 401 C Final Exam Name: December 20, 2001 INSTRUCTIONS: Read the questions carefully and completely. Answer the questions and show work in the space provided not on extra sheets. Credit will not be given if work is not shown. Turn in the exam and printouts at the end of the examination period. 1. [35 pts] The small winged fruit of maple trees are called samara. When a samara falls, it spins to the ground and, if conditions are right, starts to grow into a maple tree. A forest scientist studied the velocity (Y) with which samara fall. Below are summaries of data for random samples of samara from two different maple trees. Tree 2 n2 = 12 Tree 1 n1 = 12 Y 1 = 1.23 Y 2 = 0.95 s1 = 0.098 s2 = 0.084 (a) [11] Use these summaries to test the hypothesis that the mean velocities for samara from the two trees are the same against the alternative that they are different. 1 The analysis in (a) is criticized because it fails to account for a covariate, the load of each samara. The load is a quantity based on the size and weight. Refer to the JMP output entitled samara. (b) [7] Consider the full model that predicts Velocity based on Load, an indicator variable (Ind=0 if Tree 1 and Ind=1 if Tree 2) and the interaction between Load and Ind. Is the interaction term statistically significant? Support your answer. (c) [5] What does the result in (b) indicate about the linear relationship between Velocity and Load for the two trees? (d) [5] Compute the adjusted means for the two trees. Note that the average load is 0.20425. (e) [7] Based on the adjusted means, are the two trees significantly different? Support your answer. 2 2. [35 pts] Marine biologists made measurements on the density of coral in the Great Barrier Reef off the coast of Australia. They also measured the distance to shore (km). Below is a plot of the data. You should also refer to the JMP outputs: coralden-Fit Y by X. (a) [5] From the plot, describe the relationship between Density and Distance from shore. (b) [10] Comment on the least squares fit of Density on Distance. In particular; is Distance, by itself, a significant predictor of Density? How much of the variability in Density is explained by the simple linear fit? Is there a pattern in the plot of residuals? 3 (c) [10] Comment on the least squares fit of Density on Distance and Distance2 . In particular; does Distance2 add significant explanatory power to the simple linear fit? How much of the variability in Density is explained by the model? Is there a pattern in the plot of residuals? (d) [10] Comment on the least squares fit of Density on Distance, Distance2 and Distance3 . In particular; does Distance3 add significant explanatory power to the quadratic fit? How much of the variability in Density is explained by the model? Is there a pattern in the plot of residuals? 3. [45 pts] Our population of interest are Major League baseball players who played at least one game in both the 1991 and 1992 seasons, excluding pitchers. A random sample of 80 players is taken from the population of interest. The 1992 salary is the response variable. The explanatory variables relate to various performance measures. A list of the variables appears below. Refer to the JMP output BBsalary. “Best” is defined as the highest R2 with all variables significant at the 5% level. • Salary: 1992 Salary in thousands of dollars • BA: Batting average • OBP: On base percentage • Runs: Number of runs scored 4 • Hits: Number of hits • Doubles: Number of doubles • Triples: Number of triples • HRs: Number of home runs • RBI: Number of runs batted in • Walks: Number of walks • SOs: Number of strike outs • SBs: Number of stolen bases • Errors: Number of errors • FAElig: Indicator of Free Agent Eligibility (Yes=1, No=0) • FA91/2: Indicator of Free Agent 1991/92 (Yes=1, No=0) • ArbElig: Indicator of Arbitration Eligibility (Yes=1, No=0) • Arb91/2: Indicator of Arbitration 1991/92 (Yes=1, No=0) • Name: Player’s name (a) [6] Which of the variables, if used by itself in a simple linear regression, would provide the highest predictive power? What is the value of R2 for this simple linear regression? Is the simple linear regression using this variable statistically significant? (b) [6] Using the Forward selection procedure, what variables are in the final model? Give the R2 , adjR2 and Cp values for this final model. Could this be the “Best” model? Explain briefly. 5 (c) [6] Using the Backward selection procedure, what variables are in the final model? Give the R2 , adjR2 and Cp values for this final model. Could this be the “Best” model? Explain briefly. (d) [6] Using the Mixed selection procedure, what variables are in the final model? Give the R2 , adjR2 and Cp values for this final model. Could this be the “Best” model? Explain briefly. (e) [5] Below are Number in Model 2 3 4 5 6 7 8 9 10 models with the highest R2 for various numbers of variables. Variables Cp in Model R2 adjR2 0.6275 0.6178 30.9999 RBI FAElig 0.6675 0.6647 18.9151 RBI FAElig Arb91/2 0.6907 0.6742 17.1737 RBI FAElig ArbElig Arb91/2 0.7095 0.6898 13.8957 Runs HRs Walks FAElig Arb91/2 0.7319 0.7099 9.5622 Runs Hits RBI Walks FAElig ArbElig 0.7489 0.7245 6.7673 Runs Hits RBI Walks FAElig ArbElig Arb91/2 0.7613 0.7345 5.2718 Runs Hits RBI Walks FAElig FA91/2 ArbElig Arb91/2 0.7670 0.7370 5.6915 Runs Hits Triples RBI Walks FAElig FA91/2 ArbElig Arb91/2 0.7693 0.7358 7.0334 Runs Hits Triples RBI Walks SOs FAElig FA91/2 ArbElig Arb91/2 6 • Does Forward selection find the 5 variable model with the highest R2 ? • Does Backward selection find the 9 variable model with the highest R2 ? • Does Mixed selection find the 9 variable model with the highest R2 ? A “Best” model is found using 7 variables. The M SError is 431384.83. The analysis of residuals from this model appears in the JMP output BBSalary-resid. (f) [4] Lance Parish was paid 109 thousand dollars. The “Best” model predicts he would be paid 1710.6 thousand dollars. What is the residual for Lance Parish? What is the standardized residual? (g) [6] What is the value of the most extreme studentized residual? Is this value significantly different from zero? Use an overall level of 0.08 and adjust for the fact that you could do 80 tests. Report the degrees of freedom you should use. If this number is not in your t-table use the t value for the closest degrees freedom. (h) [6] What is the value of the most extreme h value? Is this value significantly different from zero? Use an overall level of 0.08 and adjust for the fact that you could do 80 tests. Report the degrees of freedom you should use. If these are not not in your F-table use the F value for the closest degrees freedom. 7 4. [10] In class we looked at the relationship between iris color (Blue, Brown and Green) and the critical flicker frequency (cff) using the ANOVA. Another way to analyze these data is with two dummy variables: • X1 = 1 if iris color is Blue, X1 = 0 otherwise • X2 = 1 if iris color is Brown, X2 = 0 otherwise Refer to the JMP output entitled Eyecff-Fit Least Squares. (a) [6] Give the prediction equation and an interpretation, within the context of the problem, of each of the estimated coefficients. (b) [4] According to this analysis, are Blue eyes different from Green eyes? Brown eyes from Green eyes? Support your answers and use a 0.05 level. (c) [5 Extra Credit] Is there a significant difference between Blue and Brown eyes? You must support your answer by reasoning from the information in (b). 8