Professional Documents
Culture Documents
Software effort estimation, measured in number of hours required to develop software, is an important activity associated with any software development company. It is used for investment planning and pricing of the software development. One approach usually used for software effort estimation is through Function Point Analysis (FPA). First made public by Allan Albrecht of IBM in 1979, the FPA technique quantifies the functions contained within software in terms that are meaningful to the software users. The measure relates directly to the business requirements that the software is intended to address. It can therefore be readily applied across a wide range of development environments and throughout the life of a development project, from early requirements definition to full operational use. Other business measures, such as the productivity of the development process and the cost per unit to support the software, can also be readily derived. Data is collected from all software projects completed by AT &T data center from 1986 through 1991. The data contains 104 observations and 5 variables; namely 1. Number of Worker Hours, 2. Values for function point count, 3. Operating systemused,4. Database management system and 5.Programming language.Variables, function point counts, operating system used, database and programming language are often used to help predict the number of work hours that will be required to complete a proposed software project. Data description: S.No 1 2 3 Variable Name Number of Worker Hours Functional Point Counts Operating System Variable Type Continuous Continuous Categorical 0: Unix 1: MVS Categorical 1: IDMS 2: IMS 3: INFORMIX 4: INGRESS 5: Other Categorical 1: COBOL 2: PLI 3: C 4: Other Code used in Regression NWH FPC 0 = Unix 1 = MVS D1: IDMS D2: IMS D3: INFORMIX D4: INGRESS
Language
A simple linear regression is carried between number of worker hours (Response) and functional point counts (predictor) using SPSS. The SPSS output for the model NWH = 0 + 1 x FPC is shown in tables 1.1 -1.3.
Table 1.1 Descriptive Statistics N Function Points Number of Hours Valid N (listwise) 104 104 104 Minimum 102 283 Maximum 3472 72219 Mean 620.89 9976.23 Std. Deviation 639.324 11944.580
R Square
Square
Dependent: Number of worker hours Use tables 1.1- 1.3 to answer questions 1.1 - 1.3. 1.1 Is there a statistically significant (assume = 0.05) relationship between the function points and the number of worker hours? (2 points)
1.2 What is the rate at which the number of worker hour changes when there is a change in the number of functional point counts?(1 points)
1.3 For software with functional point count of 2000, what is the maximum number of worker hours required to develop the software at 95% confidence level? (3 points)
Table 1.4 shows the regression output between the number of worker hours and the predictors. Answer questions 1.4-1.5 based on the output provided in table 1.4
Table 1.4Coefficients Unstandardized Coefficients Model 1 (Constant) Function Points D1 D2 D3 D4 L1 L2 L3 Operating System B -1054.810 14.101 6718.913 10043.06 -1239.219 -589.396 2400.876 185.927 -279.390 -3368.162 Std. Error 2788.973 1.025 1941.627 2419.735 2197.356 2942.136 1745.254 2432.629 2755.877 2998.553
Standardized Coefficients Beta T -.378 .755 .271 .279 -.041 -.012 .095 .007 -.006 -.140 13.761 3.460 4.150 -.564 -.200 1.376 .076 -.101 -1.123 Sig. .706 .000 .001 .000 .574 .842 .172 .939 .919 .264
1.4 On an average, how many additional worker hours is required if the software is developed using the database INFORMIX instead of INGRESS? Explain. (2 points)
1.5 Among the predictors, which predictor has the least influence on the number of worker hours? (2 point)
A stepwise regression is carried out between number of worker hours (response variable) with functional point counts and the operating system as predictors. The results are shown in tables 1.5 and 1.6. Use tables 1.5 and 1.6 to answer question 1.6.
Table 1.5Model Summaryc Adjusted R Model 1 2 a. Predictors: (Constant), Function Points b. Predictors: (Constant), Function Points, operating System c. Dependent Variable: Number of Worker Hours Table 1.6. Co-efficient Unstandardized Coefficients Model 1 (Constant) Function Point counts 2 (Constant) Function Point counts Operating System -1303.13 14.546 3614.936 B 585.664 Std. Error 965.521 1.086 1179.658 1.062 404.753 0.810 0.240 0.811 0.254 0.787 0.149 0.810 0.810 0.810 T Zero-order Correlations Partial Part R .810
a
R Square .655
Square .652
7046.957 6848.999
1.6 Which of the following statements is true? Justify your response based on regression models 1 and 2. (1 point) (A) Unix has more functional points than MVS. (B) MVS has more functional points than UNIX
A stepwise regression output after including all the predictors in the model building is shown in the following table (table 1.7)
Table 1.7 Coefficients(dependent number of worker hours) Standardized Unstandardized Coefficients Model 1 (Constant) Function Points 2 (Constant) Function Points D2 3 (Constant) Function Points D2 D1 B 585.664 15.124 -202.474 15.101 6418.217 -1958.301 14.382 8628.863 5413.766 Std. Error 965.521 1.086 956.404 1.040 2000.313 998.000 .989 1951.181 1368.884 .770 .240 .218 .808 .179 .810 Coefficients Beta t .607 13.926 -.212 14.524 3.209 -1.962 14.548 4.422 3.955 Sig. .545 .000 .833 .000 .002 .053 .000 .000 .000
1.7 At 95% confidence level, test whether using the database IMS, requires on average at least 2000 worker hours more than the base category. (2 points)
1.8 When all the predictor values are zero, the regression model gives a negative value for worker hours, how can you explain negative value for constant? (2 point)
Question 2 (10 points) Highly publicized CEO salaries in the US have generated sustained interest in the factors related to their compensation packages. The data on 164 CEOs total compensation in dollars, defined as the sum of salary plus any bonuses including stock options, in the financial sector along with some potential explanatory variables is collected and the details are provided in the following Table: Variable Name MBA Age (in years) Years in firm (in years) Return Over 5 years (%) Sales (in millions of dollars) Variable Type Categorical ( 1 for MBA and 0 for No MBA) Numerical Numerical Numerical Numerical
Several models are developed for analyses, which, along with related information are given below: Model 1: R square = .439
Coefficients
a
Model
Unstandardized Coefficients
Standardized Coefficients
B Std. Error Beta t Sig. 1 (Constant) 10.443 .325 32.098 .000 ln(Sales) .523 .047 .662 11.253 .000 a. Dependent Variable: ln(Total Comp) b. ln(sales) is the natural logarithm of sales and ln(total comp) is the natural logarithm of total compensation.
Model
Unstandardized Coefficients B Std. Error (Constant) ln(Sales) MBA 10.420 .522 .093 .327 .047 .118
Coefficientsa
Model
Unstandardized Coefficients
Standardized Coefficients Beta .661 .047 t 31.865 11.213 .790 Sig. .000 .000 .431
B Std. Error 1 (Constant) 10.420 .327 ln(Sales) .522 .047 MBA .093 .118 a. Dependent Variable: ln(Total Comp)
Model 3: Model
R square = .463 Coefficientsa Unstandardized Standardized Coefficients Coefficients B Std. Error Beta 10.575 .380 .500 .094 .054 .046 .633
t 27.834 9.183
Sig.
lnSales_MBA .085 .040 a. Dependent Variable: ln(Total Comp) b. lnsales_MBA is an interaction between ln(sales) and MBA, that is, ln(sales)*MBA 2.1Which variables in Model 3 have a significant relationship with Total Compensation? Clearly state any hypotheses used and assumptions made to draw your inference. (2 points)
2.2 From the models given above what can you conclude about having an MBA - does it or does it not have a significant impact on Total Compensation? If so, what is the impact? Support your inference with adequate explanation/work. (2 points)
2.3 Is Model 3 above better than Model 1? Why or why not?Use appropriate test(s) and give adequate explanations in support of your answer. (2 points)
Model 4: Model Coefficientsa Unstandardized Standardized Coefficients Coefficients B Std. Error Beta 2330831.84 450243.626 7
(Constant)
t 5.177 1.98771
10
2.4 Using Model 4 can we conclude that the average total compensation for CEOs who have an MBA is at least 5% more than those who do not have an MBA? State the hypotheses clearly and show all work. (2 points)
For question2.5, use the following information: The Stepwise Method was used to develop a model for predicting total compensation of CEOs using the data and independent variables described above. The SPSS output obtained is as given below:
Model
Unstandardized Coefficients B Std. Error .325 .047 .322 .045 .003 .522 .045 .003 .008 .548 .045 .003 .009 .005
Stand Coeff Beta .662 .673 .194 .652 .199 .136 .662 .211 .208 -.143 t 32.098 11.253 31.753 11.784 3.389 17.694 11.451 3.523 2.389 16.105 11.724 3.776 3.194 -2.196 Zeroorder
1 2
10.443 .523 10.223 .532 .010 9.233 .516 .010 .020 8.826 .523 .011 .030 -.011
.680 .156 .258 .671 .156 .228 .268 .186 .681 .156 .228 .070 .287 .246 -.172
11
ANOVAe Model 1 Regression Residual Total Regression Residual Total Regression Residual Total Regression
72.883
Residual 70.290 Total a. Predictors: (Constant), ln(Sales) b. Predictors: (Constant), ln(Sales), Return over 5 yrs c. Predictors: (Constant), ln(Sales), Return over 5 yrs, Age d. Predictors: (Constant), ln(Sales), Return over 5 yrs, Age, YearsFirm e. Dependent Variable: ln(Total Comp) 2.5 Determine the value of the coefficient of determination for Model 3 above. (2 points)
Question 3. (10 marks) With agriculture in crisis, the issue of crop insurance has become more important of late. Since it is difficult to verify the crop yield for each farmer, rainfall based insurance has been introduced on a pilot scale in selected Districts. About 100 km from Bangalore, this was introduced in the low and erratic rainfall District of Anantapur. Given the poor soil conditions, only groundnut is grown there. Insurance payments to farmers in a district are based on the rainfall recorded there. The ABC Insurance Company wanted to come up with a model and see how the total production
12
depends on the rainfall. The complication is that it also depends on various factors like the total acreage under irrigation. After some trial regressions, the company analysts settled on a model with the following variables: PROD IRR NON RAIN The total production in thousands of tons Total irrigated area in thousands of hectares Total non-irrigated area in thousands of hectares Total rainfall in millimeters.
Change Statistics Degrees Degrees Std. Error of R R Square .801 Adjusted R Square .787 the Estimate 703.6283 R Square Change .801 F Change of of Sig. F Change .000
.895 a
3.1 If stepwise regression was used to arrive at the table above (table 3.1), how many models did SPSS consider? Give reasons. [1 point]
[1 point]
13
Table 3.2 ANOVA Model Regression Residual Total a. Predictors: (Constant), RAIN, IRR,NON b. Dependent Variable: PROD Sum of Squares 8.762E7 Df
Mean Square
Sig. .000a
3.3 Fill in the blanks in the ANOVA table (table 3.2) above for Sum of Squares,Df, Mean Square and F value. [2 points]
Table 3.3 Coefficientsa Model Unstandardized Coefficients Std. B 1 (Constant) Error Beta t -.847 Sig. .401 Standardized Coefficients 95.0% Confidence Interval for B Lower Bound Upper Bound - 1281.728 3140.948 IRR NON RAIN 2.281 .622 2.112 .192 .159 .670 .809 11.906 .266 .213 3.906 3.152 .000 .000 .003 1.895 .301 .761 2.667 .943 3.462 .979 1.021 .977 1.023 .992 1.008 Tolerance VIF Collinearity Statistics
-929.61 1097.238
[1 point]
3.5 The Normal and Residual Plots are given below. Which assumptions of regression are tested by them, and are they satisfied? [1 points]
14
15
Selected data from the SPSS is output is given below: Table 3.4 SPSS Output on influential observations for portion of the sample
Prod '000 Ton 2930 3450 4250 3860 4370 4710 5180 4560 4810 4990 5060 5300 6000 4260 4410 5730 4630 5130
Year 195253 195354 195455 195556 195657 195758 195859 195960 196061 196162 196263 196364 196465 196566 196667 196768 196869 196970
ZRE 0.05093 0.28396 1.18262 0.07316 0.15086 0.211 1.1752 0.21501 0.32379 0.45626 0.14592 0.74572 0.81488 1.15514 2.07939 0.40227 0.96035 0.33885
MAH 9.3422 14.4868 5.1603 5.0852 6.0918 1.1081 1.3166 1.4856 1.0865 1.9852 2.9114 1.6166 3.6257 5.6126 4.1774 2.4797 1.0713 0.8750
COO 0.0002 0.0147 0.0604 0.0002 0.0012 0.0005 0.0186 0.0007 0.0013 0.0037 0.0005 0.0086 0.0200 0.0633 0.1496 0.0035 0.0110 0.0012
LEV 0.1988 0.3082 0.1098 0.1082 0.1296 0.0236 0.0280 0.0316 0.0231 0.0422 0.0619 0.0344 0.0771 0.1194 0.0889 0.0528 0.0228 0.0186
DFF 10.0838 -97.9940 125.0308 7.6260 -18.7976 6.8998 42.4644 -8.3727 10.4735 21.6117 -9.2660 30.6734 62.2774 132.5887 180.3088 22.4848 -30.8239 -9.7921
DFB0 30.8270 160.9821 421.8451 22.6753 -17.8381 8.5996 101.3696 1.6727 11.8070 -2.3511 27.0517 -32.8552 198.7153 208.5132 527.7487 -58.1897 83.6995 25.1432
DFB1 0.0019 0.0090 0.0426 0.0022 0.0047 0.0064 0.0372 0.0068 0.0098 0.0147 0.0054 0.0246 0.0290 0.0454 0.0528 0.0102 0.0228 0.0071
DFB2 0.0036 0.0297 0.0409 0.0035 0.0046 0.0000 0.0064 0.0006 0.0003 0.0066 0.0050 0.0122 0.0311 0.0551 0.0634 0.0133 0.0170 0.0056
DFB3 0.0101 0.0715 0.1986 0.0041 0.0307 0.0007 0.0259 0.0116 0.0027 0.0383 0.0038 0.0258 0.0662 0.1539 0.3594 0.0209 0.0143 0.0000
16
3.6 Identify observations that are leveraged and/or influential in the table (table 3.4). Explain clearly. [2 points]
Year 1952-53 in the table has the following values for dependent and independent variables: Year IRR NON RAIN PROD* 195253 57.6 4742.4 351 2894.16551 PROD* is the predicted value of PROD using the model with all observations 3.7 If the regression model was developed without this observation (1952-53) in the sample how much would the predicted value for this observation change? [2 points]