Effort Estimation of Web Based Applications
|
|
|
- Mercy Peters
- 10 years ago
- Views:
Transcription
1 Effort Estimation of Web Based Applications A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Bachelor of Technology In Computer Science and Engineering By Abhijit Ghosh Roll No: Department of Computer Science and Engineering National Institute of Technology, Rourkela May, 2007
2 Effort Estimation of Web Based Applications A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Bachelor of Technology In Computer Science and Engineering By Abhijit Ghosh Roll No: Under the guidance of Prof. B.D.Sahoo Department of Computer Science and Engineering National Institute of Technology Rourkela May, 2007 ii
3 National Institute of Technology Rourkela CERTIFICATE This is to certify that the thesis entitled, Effort Estimation Of Web Based Applications submitted by Abhijit Ghosh in partial fulfillment of the requirements for the award of Bachelor of Technology Degree in Computer Science and Engineering at the National Institute of Technology, Rourkela (Deemed University) is an authentic work carried out by him under my supervision and guidance. To the best of my knowledge, the matter embodied in the thesis has not been submitted to any other university / institute for the award of any Degree or Diploma. Date: 10 May, 2007 Prof. B.D.Sahoo Dept. of Computer Science and Engineering National Institute of Technology, Rourkela Rourkela iii
4 ACKNOWLEDGEMENT I am thankful to Prof. B.D. Sahoo, Professor in the department of Computer Science and Engineering, NIT Rourkela for giving me the opportunity to work under him and extending every support at each stage of this project work. I would also like to convey my sincerest gratitude and indebtedness to all other faculty members and staff of Department of Computer Science and Engineering, NIT Rourkela, who bestowed their great effort and guidance at appropriate times without which it would have been very difficult on my part to finish the project work. Date: May 10, 2007 Abhijit Ghosh iv
5 CONTENTS A. ABSTARCT B. LIST OF FIGURES C. CHAPTERS 1. INTRODUCTION 1.1 Introduction Difficulties in Software Effort Estimation Objective of the Estimate Benefits of Software Effort Estimation Effort Estimation For Web Based Hypermedia Applications Conclusion Regression Analysis : An Overview 2.1 Introduction History of regression Definitions and notation used in regression Types of regression Linear regression Nonlinear regression models Non-continuous variables Other models Nonparametric regression Methodology Analysis and Results 4.1 Multiple Linear Regression Analysis Stepwise Multiple Linear Regression Analysis Polynomial Regression Analysis Conclusion and Future Work D. REFERENCES v
6 ABSTRACT Countless organizations around the world have developed commercial and educational applications for the World Wide Web, the best known example of a hypermedia system. But developing good Web applications is expensive, mostly in terms of time and degree of difficulty for the authors. Our study tries to predict the effort needed for the development of web pages of a particular category, here we have restricted our self to the domain of news web sites. We try to forecast the average effort required to code a page of a new site belonging to the same category based on the analysis of the data available from the existing sites. Here we have considered the data from Top Ten News Sites. The number of pages in the site, Average Number of Lines per Page, Average Number of Scripts per Page, Link Density, and Media Density are taken into account while predicting the effort required to code a page of the site. We consider the effort required to be directly proportional to the number of lines of code. It can be expressed to be in man hours only when we have information about the actual effort in man-hours required to code the site. We use Multiple Linear Regression, Stepwise Regression and Polynomial Regression to analyze the data and obtain the graphs which a be used to predict the approximate effort required to code a web page of a news site. Finally we devised a method to estimate the effort required or the number of lines of code required for a webpage of a news site. These results can be used to devise a standard for the coding of newer news web sites. If they are incorporated in to a web authoring software it would help the author to stick to the guidelines and the standards automatically. vi
7 List of Figures Figure 1: Steps Involved In Software Prediction Figure 2: Residual Case Order Plot Figure 3: Tool used for Stepwise Regression Analysis Figure 4: Shows how the lines of code vary as a polynomial of degree 2 with the no of pages in the website Figure 5: Shows how the lines of code vary as a polynomial of degree 4 with the no of links per page Figure 6: Shows how the lines of code vary as a polynomial of degree 5 with the no of scripts per page Figure 7: Shows how the lines of code vary as a polynomial of degree 5 with the no of images per page vii
8 Chapter 1 INTRODUCTION 1.1 Introduction 1.2 Difficulties in Software Effort Estimation 1.3 Objective of the Estimate 1.4 Benefits of Software Effort Estimation 1.5 Effort Estimation For Web Based Hypermedia Applications 1.6 Conclusion 1
9 1.1 Introduction Effective software project estimation is one of the most challenging and important activities in software development. Let us first define what software is. Software is (1) instruction that when executed provide desired function and performance, (2) data structures that enable the programs to adequately manipulate information, and (3) documents that describe the operation and use of programs. The technological and managerial discipline concerned with systematic production and maintenance of software products that are developed and modified on time and with in the cost estimates is known as software engineering. The primary goals of software engineering are to improve the quality of software products and to increase the productivity and job satisfaction for software engineers. Software cost estimation is the process of predicting the effort required to develop a software system. Estimating the cost of a software product is one of the most difficult and error prone tasks in software engineering. It is difficult to make an accurate cost estimation during the planning phase of software development because of the large number of unknown factors at that time, yet contradicting practice often require a firm cost commitment as part of the feasibility study. Accurate software cost estimate in project is necessary to develop a reliable software system. Underestimating a project leads to under-staffing it (resulting in staff burnout), under-scoping the quality assurance effort (running the risk of low quality deliverables), and setting too short a schedule (resulting in loss of credibility as deadlines are missed). Over-estimating a project is likely to cost more than it should (a negative impact on the bottom line), take longer to deliver than necessary (resulting in lost opportunities), and delay the use of your resources on the next project. 2
10 1.2 Difficulties in Software Effort Estimation Software effort estimating has been growing in importance up to today. When the computer era began back in the 1940's, there were few computers in use and applications were mostly small, one person projects. As time moved on, computers became widespread. Applications grew in number, size and importance; costs to develop software grew as well. As a result of that growth, the consequences of errors in software cost estimation became more severe too. Still today, a lot of cost estimates of software projects are not very accurate, mostly too low. This is not a surprising fact if we look at the various difficulties we have to face when estimating software costs. The by far greatest amount of the total costs of a project arises from the salaries of the personnel. Other costs, as license fees or new equipment for example, occur only once and are not too hard to estimate. The costs for the human workers on the other hand are highly correlated to the effort we need to perform the project. Therefore we have to get an accurate enough estimate of the total effort in order to make a reasonable estimate of the costs. The effort is estimated based on the size and complexity of the project, which both derive from the specification. Because the requirements of the software are likely to change, we have to take this into account too when estimating the effort. The big difference in productivity of software developers is one of the hardest problems to solve during the estimation process. An experienced developer will produce far more than a beginner. But because each project is unique, uses it's own tools and languages, the experience level of the development team is hard to judge. Another problem appears when humans ere estimating. We all tend to underestimate immaterial things like software. This is also a reason why software is considered to be expensive by most people, although there is nothing to compare its costs with. Today's world would not be the same if there was no software. 1.3 Objective of the Estimate Software cost estimating performs more than the name indicates. It s not just the total costs that are of interest but a lot more. Quite a few things other than costs are being estimated during the process. Of main interest are the following : 3
11 Size off all deliverables Staff needed Schedule Effort Cost to develop Cost for maintenance/enhancement Quality Reliability But in this paper we are mainly concerned with the software effort in person-month, project duration in calendar time and cost to develop in dollar (or local currency). 1.4 Benefits of Software Effort Estimation The major benefits of a good software cost estimation model that it provides a clear and consistent universe of discourse within which to address a good many of the software engineering issues which arise throughout the software life cycle. A well-defined software cost estimation model can help avoid the frequent misinterpretations, underestimates, over expectations, and outright buy-ins which still plague the software field. In a good cost estimation model, there is no way of reducing the estimated cost without changing some objectively verifiable property of the software project. This does not make it impossible to create an unachievable buy-in, but it significantly raises the threshold of credibility. A related benefit of software cost estimation technology is that it provides a powerful set of insights on how a software organization can improve its productivity. Many of a software cost model s cost-drivers attributes are management control labels: Use of software of tools and modern programming practices, personnel capability and experience and available computer speed, memory and turn around time, software reuse. The cost model helps us determine how to adjust these management controllable to increase productivity, and further provides and estimated 4
12 of how much a productivity increase we are likely to achieve with a given level of investment. 1.5 Effort Estimation Of Web Based Hypermedia Applications By using measurement principles to evaluate the quality and development of existing Web applications, we can obtain feedback that will help us understand, control, improve, and make predictions about these products and their development processes. Prediction is a necessary part of an effective software process, whether it be authoring, design, testing, or Web development as a whole. As with any software project, having realistic estimates of the required effort early in a Web application s life cycle lets project managers and development organizations manage resources effectively. As shown in Figure 1, the prediction process involves capturing data about past projects or past development phases within the same project, identifying size metrics and cost drivers and formulating theories about their relationship with effort, generating prediction models to apply to the project, and assessing the effectiveness of the prediction models. This article focuses on effort prediction for the design and authoring processes. Design covers the methods used for generating the structure and functionality of the application, and typically doesn t include aspects such as hypermedia application requirements analysis, feasibility consideration, and applications maintenance. In addition, our design phase also incorporates the application s conceptual design, reflected in map diagrams showing documents and links. The data we used to generate our prediction models came from a quantitative case study evaluation, in which we measured a set of suggested size metrics and cost drivers for effort prediction. Some of our complexity metrics are adaptations from the literature of software engineering and multimedia. The metrics we propose characterize Web application size from two different perspectives length and complexity. Our prediction models derive from statistical techniques specifically linear regression and stepwise multiple regression. We also use Polynomial regression to study the effect of the predictors individually. 5
13 Figure Conclusion Software cost estimation and particularly that of web based applications is a very complex process. There are a lot of factors that have an influence on the costs of a project. Therefore estimation tools are necessary to produce reliable estimates. Those tools still require some subjective inputs and they also make sure, that no important factor is omitted in the estimating process. Because every project is unique and there is a best method to estimate software costs, is to use several estimation methods. With various regression techniques as tools for predictions we try to estimate the effort required in terms of lines code required per page. 6
14 Chapter 2 Regression Analysis : An Overview 2.1 Introduction 2.2 History of regression 2.3 Definitions and notation used in regression 2.4 Types of regression Linear regression Nonlinear regression models Non-continuous variables Other models Nonparametric regression 7
15 Regression analysis In statistics, regression analysis examines the dependence of a random variable, called a dependent variable (response variable, regressor), on other random or deterministic variables, called independent variables (predictors). The mathematical model of their relationship is the regression equation. Well-known types of regression equations are linear regression, the logistic regression for discrete responses (both generalize in the generalized linear model), and nonlinear regression. Besides the dependent and independent variables, the regression equations usually contain one or more unknown regression parameters (constants), which are estimated from given data. Applications of regression include curve fitting, forecasting of time series, modeling of causal relationships, and testing scientific hypotheses about relationships between variables. * In real-world applications, data could come from any combination of public or private sources. 2.1 Introduction Regression analysis estimates the strength of a modeled relationship between one or more response variables (also called dependent variables, explained variables, predicted variables, or regressands) (usually named Y), and the predictors (also called independent variables, explanatory variables, control variables, or regressors, usually named ). These strengths of the relationships given that the model is correct are parameters of the model, which are estimated from a sample. Other parameters which are sometimes specified include error variances and covariances of the variables. The theoretical population parameters are commonly designated by Greek letters (e.g. β), their estimated values by a "hatted" Greek letter (e.g. ), and the sample coefficients by a Latin letter (e.g. b). This stresses the fact that the sample coefficients are not the same as the population parameters, but the distribution of those parameters in the population can be inferred from the estimates and the sample 8
16 size. This allows researchers to test for the statistical significance of estimated parameters and to measure goodness of fit of the model. Still more generally, regression may be viewed as a special case of density estimation. The joint distribution of the response and explanatory variables can be constructed from the conditional distribution of the response variable and the marginal distribution of the explanatory variables. In some problems, it is convenient to work in the other direction: from the joint distribution, the conditional distribution of the response variable can be derived. Regression lines can be extrapolated, where the line is extended to fit the model for values of the explanatory variables outside their original range. However extrapolation may be very inaccurate and can only be used reliably in certain instances. 2.2 History of regression The term "regression" was used in the nineteenth century to describe a biological phenomenon, namely that the progeny of exceptional individuals tend on average to be less exceptional than their parents, and more like their more distant ancestors. Francis Galton studied this phenomenon and applied the slightly misleading term "regression towards mediocrity" to it. For Galton, regression had only this biological meaning, but his work [1] was later extended by Udny Yule and Karl Pearson to a more general statistical context. [2] 2.3 Definitions and notation used in regression The measured variable, y, is conventionally called the "response variable". Other terms include "endogenous variable," "output variable," "criterion variable," and "dependent variable." The controlled or manipulated variables,, are called the explanatory variables. Other terms include "exogenous variables," "input variables," "predictor variables" and "independent variables." 9
17 2.4 Types of regression Several types of regression analysis can be distinguished; all of these can be seen as special cases of the Generalized Linear Model Linear regression Linear regression is a method for determining the parameters of a linear system, that is a system that can be expressed as follows: where β i is called the parameter and f is only a function of matrix form as. This can be rewritten in where is a row vector that contains each of the function, that is, and is a column vector containing the parameters, that is, The explanatory and response variables may be scalars or vectors. In the case, where both the explanatory and response variables are scalars, then the resulting regression is called simple linear regression. When there are more than one explanatory variable, then the resulting regression is called multiple linear regression. It should be noted that the general formulae are the same for both cases. Two common techniques for solving linear regression models is using least squares analysis or robust regression.we use least square regression for all cases. 10
18 2.4.2 Nonlinear regression models A number of nonlinear regression techniques may be used to obtain a more accurate regression. It should be noted that an often-used alternative is a transformation of the variables such that the relationship of the transformed variables is again linear Non-continuous variables If the variable is not continuous, specific techniques are available. For binary (zero or one) variables, there are the probit and logit model. The multivariate probit model makes it possible to estimate jointly the relationship between several binary dependent variables and some independent variables. For categorical variables with more than two values there is the multinomial logit. For ordinal variables with more than two values, there are the ordered logit and ordered probit models. An alternative to such procedures is linear regression based on polychoric or polyserial correlations between the categorical variables. Such procedures differ in the assumptions made about the distribution of the variables in the population. If the variable is positive with low values and represents the repetition of the occurence of an event, count models like the Poisson regression or the negative binomial model may be appropriate Other models Although these three types are the most common, there also exist supervised learning and unit-weighted regression Nonparametric regression The models described above are called parametric because the researcher must specify the nature of the relationships between the variables in advance. Several nonparametric techniques may be also used to estimate the impact of an explaining variable on a dependent variable. Nonparametric regressions, like kernel regression, require a high number of observations and are computationally intensive. 11
19 Chapter 3 Methodology 12
20 Methodology We first downloaded the top ten news sites from the world wide web.we only consider the pages in its own domain for our study. TOP TEN NEWS SITES We used a VBScript to list all downloaded web pages and their directories of each website in a text file. Then a script in Matlab was run to count the number of pages, Average Number of Lines per Page, Average Number of Scripts per Page, Link Density, and Media Density. We assume the Number of Lines of Code to be directly proportional to the effort required and is thus taken as the observation(y) and the other four parameters as the regressors(x). This regressor data was represented as a matrix x is a 10 * 4 matrix and the observed data is taken as a 10 * 1 matrix (Y) X =
21 y =
22 Chapter 4 Analysis and Results 4.1 Multiple Linear Regression Analysis 4.2 Stepwise Multiple Linear Regression Analysis 4.3Polynomial Regression Analysis 15
23 Analysis and Results 4.1 Multiple Linear Regression Analysis Mathematical Foundations of Multiple Linear Regression The linear model takes its common form y = Xβ + ε where: y is an n-by-1 vector of observations. X is an n-by-p matrix of regressors. Β is a p-by-1 vector of parameters. ε is an n-by-1 vector of random disturbances. The solution to the problem is a vector, b, which estimates the unknown vector of parameters, β. The least squares solution is.b= β=(x T X) -1 X T y This equation is useful for developing later statistical formulas, but has poor numeric properties. regress uses QR decomposition of X followed by the backslash operator to compute b. The QR decomposition is not necessary for computing b, but the matrix R is useful for computing confidence intervals. You can plug b back into the model formula to get the predicted y values at the data points. Y = Xb = Hy X= (X T X) -1 X T The residuals are the difference between the observed and predicted y values. r = y y= (I-H) y The residuals are useful for detecting failures in the model assumptions, since they correspond to the errors, ε, in the model equation. The formal definition of confidence interval is a range of values for a variable of interest constructed so that this range has a specified probability of including the true value of the variable the specified probability is called confidence level and the endpoints are called confidence limits. [b,bint,r,rint,stats] = regress(y,x) 16
24 stats = rint = r = b = 1.0e+003 *
25 rcoplot(r,rint) Figure 2 Figure 2 shows a plot of the residuals with error bars showing 95% confidence intervals on the residuals. The fifth error bar doesn t pass through the zero line, indicating that it is an outlier in the data. 4.2 Stepwise Multiple Regression Analysis Stepwise Regression Stepwise regression is a technique for choosing the variables, i.e., terms, to include in a multiple regression model. Forward stepwise regression starts with no model terms. At each step it adds the most statistically significant term (the one with the highest F statistic or lowest p-value) until there are none left. Backward stepwise regression starts with all the terms in the model and removes the least significant terms until all the remaining terms are statistically significant. It is also possible to start with a subset of all the terms and then add significant terms or remove insignificant terms. An important assumption behind the method is that some input variables in a multiple regression do not have an important explanatory effect on the response. If this assumption is true, then it is a convenient simplification to keep only the statistically significant terms in the model. One common problem in multiple 18
26 regression analysis is multicollinearity of the input variables. The input variables may be as correlated with each other as they are with the response. If this is the case, the presence of one input variable in the model may mask the effect of another input. Stepwise regression might include different variables depending on the choice of starting model and inclusion strategy. Figure 3. Tool used for Stepwise Regression Analysis From Stepwise regression analysis it was found that Average Number of Lines per Page, Average Number of Scripts per Page, Link Density, and Media Density were statistically significant where as the Number of Pages per site was not statistically significant in order to determine the number of lines of code required for a web page. 19
27 4.3 Polynomial Regression Analysis Polynomial Regression Based on the plot, it is possible that the data can be modeled by a polynomial function Y=a t + b t 1 + ct 2 + The unknown coefficients a0, a1, and a2 can be computed by doing a least squares fit, which minimizes the sum of the squares of the deviations of the data from the model. If any fit does not perfectly approximate the data. We could either increase the order of the polynomial fit, or explore some other functional form to get a better approximation. Here we have taken each of the predictors separately and tried to get an optimal polynomial fit at the lowest possible order. Figure 4. Figure 4 shows how the lines of code vary as a polynomial of degree 2 with the no of pages in the website. 20
28 Figure 5 Figure 5 shows how the lines of code vary as a polynomial of degree 4 with the no of links per page. Figure 6 Figure 6 shows how the lines of code vary as a polynomial of degree 5 with the no of scripts per page. 21
29 Figure 7 Figure 7 shows how the lines of code vary as a polynomial of degree 5 with the no of images per page. 22
30 Chapter 5 Conclusion And Future Work 5.1 Conclusion 5.2 Future Work 23
31 5.1 Conclusion We were able to predict how the development of a website depends on various parameters and also find out till what extent it depended on them i.e. which parameter affects the observations more and which can be neglected. In this project work we devised a method to estimate the effort required or the number of lines of code required for a webpage of a news site. These results can be used to devise a standard for the coding of newer news web sites. If they are incorporated in to a web authoring software it would help the author to stick to the guidelines and the standards automatically. 5.2 Future Work Although in this project work we devised a method to estimate the effort required in the number of lines of code required for a webpage of a news site. These results may not be accurate for all cases as it is the actual number of man hours that matters.total effort should be calculated as: Total-effort = Σ PAE+ Σ MAE + Σ PRE where PAE is the page authoring effort, MAE the media authoring effort and PRE the program authoring effort. This data can be made available only by the internal sources. We considered only four predictors to construct this model but there are other measures which should also be included like: 1. Reused Media Count (RMC) - Number of reused/modified media files. 2. Reused Program Count (RPC) - Number of reused/modified programs. 3. Total Page Complexity (TPC) - Average number of different types of media per page. 4. Total Effort (TE) - Effort in person hours to design and author the application 24
32 REFERENCES [1]Chris Triggs and Ian Watson, A Comparison of Development Effort Estimation Techniques for Web Hypermedia Applications.. [2]Emilia Mendes and, Barbara Kitchenham,Within-company Effort Estimation Models for Web Applications. [3] M.J. Shepperd, C. Schofield, and B. Kitchenham, "Effort Estimation Using Analogy." Proc. ICSE-18, IEEE Computer Society Press, Berlin, [4] R. Gray, S. G. MacDonell, and M. J. Shepperd, "FactorsSystematically associated with errors in subjective estimates of software development effort: the stability of expert judgement", IEEE 6th International Metrics Symposium, Boca Raton, November 5-6, [5] Kok, P., B. A. Kitchenham, J. Kirakowski, The MERMAID Approach to software cost estimation, ESPRIT Annual Conference, Brussels, p: , [6] T. DeMarco, Controlling Software Projects: Management,Measurement and Estimation, Yourdon: New York, [7] W. Boehm, Software Engineering Economics. Prentice-Hall: Englewood Cliffs, N.J.,
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)
CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
Volume 5 No. 4, June2015 A Software Cost and Effort Estimation for web Based Application
A Software Cost and Effort Estimation for web Based Application Dr. Tulika Pandey, tulika.tulika @ shiats.edu.in Assistant professor Department of Computer Science & Engineering SHIATS, Allahabad,India
GLM I An Introduction to Generalized Linear Models
GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial
business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar
business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
Simple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics
Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This
Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Chapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -
Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing
South Carolina College- and Career-Ready (SCCCR) Probability and Statistics
South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)
Fairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Directions for using SPSS
Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...
Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University
Practical I conometrics data collection, analysis, and application Christiana E. Hilmer Michael J. Hilmer San Diego State University Mi Table of Contents PART ONE THE BASICS 1 Chapter 1 An Introduction
MTH 140 Statistics Videos
MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative
From the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
16 : Demand Forecasting
16 : Demand Forecasting 1 Session Outline Demand Forecasting Subjective methods can be used only when past data is not available. When past data is available, it is advisable that firms should use statistical
Introduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
Statistics in Retail Finance. Chapter 6: Behavioural models
Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural
SPSS Explore procedure
SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,
Overview... 2. Accounting for Business (MCD1010)... 3. Introductory Mathematics for Business (MCD1550)... 4. Introductory Economics (MCD1690)...
Unit Guide Diploma of Business Contents Overview... 2 Accounting for Business (MCD1010)... 3 Introductory Mathematics for Business (MCD1550)... 4 Introductory Economics (MCD1690)... 5 Introduction to Management
Applying Statistics Recommended by Regulatory Documents
Applying Statistics Recommended by Regulatory Documents Steven Walfish President, Statistical Outsourcing Services [email protected] 301-325 325-31293129 About the Speaker Mr. Steven
Elements of statistics (MATH0487-1)
Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -
ANNUITY LAPSE RATE MODELING: TOBIT OR NOT TOBIT? 1. INTRODUCTION
ANNUITY LAPSE RATE MODELING: TOBIT OR NOT TOBIT? SAMUEL H. COX AND YIJIA LIN ABSTRACT. We devise an approach, using tobit models for modeling annuity lapse rates. The approach is based on data provided
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
Chapter 6: Multivariate Cointegration Analysis
Chapter 6: Multivariate Cointegration Analysis 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie VI. Multivariate Cointegration
Regression Modeling Strategies
Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions
Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.
Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged
Simple Regression Theory II 2010 Samuel L. Baker
SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the
Recall this chart that showed how most of our course would be organized:
Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
Module 5: Multiple Regression Analysis
Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College
Univariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
Nominal and ordinal logistic regression
Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different
NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS
NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS TEST DESIGN AND FRAMEWORK September 2014 Authorized for Distribution by the New York State Education Department This test design and framework document
UNDERGRADUATE DEGREE DETAILS : BACHELOR OF SCIENCE WITH
QATAR UNIVERSITY COLLEGE OF ARTS & SCIENCES Department of Mathematics, Statistics, & Physics UNDERGRADUATE DEGREE DETAILS : Program Requirements and Descriptions BACHELOR OF SCIENCE WITH A MAJOR IN STATISTICS
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
How to Win the Stock Market Game
How to Win the Stock Market Game 1 Developing Short-Term Stock Trading Strategies by Vladimir Daragan PART 1 Table of Contents 1. Introduction 2. Comparison of trading strategies 3. Return per trade 4.
Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com
SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING
Regression III: Advanced Methods
Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models
Example G Cost of construction of nuclear power plants
1 Example G Cost of construction of nuclear power plants Description of data Table G.1 gives data, reproduced by permission of the Rand Corporation, from a report (Mooz, 1978) on 32 light water reactor
CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA
We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical
STATISTICAL DATA ANALYSIS COURSE VIA THE MATLAB WEB SERVER
STATISTICAL DATA ANALYSIS COURSE VIA THE MATLAB WEB SERVER Ale š LINKA Dept. of Textile Materials, TU Liberec Hálkova 6, 461 17 Liberec, Czech Republic e-mail: [email protected] Petr VOLF Dept. of Applied
Lean Six Sigma Analyze Phase Introduction. TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY
TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY Before we begin: Turn on the sound on your computer. There is audio to accompany this presentation. Audio will accompany most of the online
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
Calculating the Probability of Returning a Loan with Binary Probability Models
Calculating the Probability of Returning a Loan with Binary Probability Models Associate Professor PhD Julian VASILEV (e-mail: [email protected]) Varna University of Economics, Bulgaria ABSTRACT The
ECON 523 Applied Econometrics I /Masters Level American University, Spring 2008. Description of the course
ECON 523 Applied Econometrics I /Masters Level American University, Spring 2008 Instructor: Maria Heracleous Lectures: M 8:10-10:40 p.m. WARD 202 Office: 221 Roper Phone: 202-885-3758 Office Hours: M W
Review Jeopardy. Blue vs. Orange. Review Jeopardy
Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?
Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)
Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared
Bachelor's Degree in Business Administration and Master's Degree course description
Bachelor's Degree in Business Administration and Master's Degree course description Bachelor's Degree in Business Administration Department s Compulsory Requirements Course Description (402102) Principles
Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013
Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives
STAT 360 Probability and Statistics. Fall 2012
STAT 360 Probability and Statistics Fall 2012 1) General information: Crosslisted course offered as STAT 360, MATH 360 Semester: Fall 2012, Aug 20--Dec 07 Course name: Probability and Statistics Number
Multiple Regression: What Is It?
Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in
EXTENDED ANGEL: KNOWLEDGE-BASED APPROACH FOR LOC AND EFFORT ESTIMATION FOR MULTIMEDIA PROJECTS IN MEDICAL DOMAIN
EXTENDED ANGEL: KNOWLEDGE-BASED APPROACH FOR LOC AND EFFORT ESTIMATION FOR MULTIMEDIA PROJECTS IN MEDICAL DOMAIN Sridhar S Associate Professor, Department of Information Science and Technology, Anna University,
Econometrics Simple Linear Regression
Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight
Comparing Features of Convenient Estimators for Binary Choice Models With Endogenous Regressors
Comparing Features of Convenient Estimators for Binary Choice Models With Endogenous Regressors Arthur Lewbel, Yingying Dong, and Thomas Tao Yang Boston College, University of California Irvine, and Boston
Moderation. Moderation
Stats - Moderation Moderation A moderator is a variable that specifies conditions under which a given predictor is related to an outcome. The moderator explains when a DV and IV are related. Moderation
Longitudinal Data Analysis. Wiley Series in Probability and Statistics
Brochure More information from http://www.researchandmarkets.com/reports/2172736/ Longitudinal Data Analysis. Wiley Series in Probability and Statistics Description: Longitudinal data analysis for biomedical
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
How To Understand And Solve A Linear Programming Problem
At the end of the lesson, you should be able to: Chapter 2: Systems of Linear Equations and Matrices: 2.1: Solutions of Linear Systems by the Echelon Method Define linear systems, unique solution, inconsistent,
1 Theory: The General Linear Model
QMIN GLM Theory - 1.1 1 Theory: The General Linear Model 1.1 Introduction Before digital computers, statistics textbooks spoke of three procedures regression, the analysis of variance (ANOVA), and the
430 Statistics and Financial Mathematics for Business
Prescription: 430 Statistics and Financial Mathematics for Business Elective prescription Level 4 Credit 20 Version 2 Aim Students will be able to summarise, analyse, interpret and present data, make predictions
CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS
Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE
LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE MAT 119 STATISTICS AND ELEMENTARY ALGEBRA 5 Lecture Hours, 2 Lab Hours, 3 Credits Pre-
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
Chapter 23. Inferences for Regression
Chapter 23. Inferences for Regression Topics covered in this chapter: Simple Linear Regression Simple Linear Regression Example 23.1: Crying and IQ The Problem: Infants who cry easily may be more easily
Data analysis process
Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis
II. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
Multivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
BayesX - Software for Bayesian Inference in Structured Additive Regression
BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich
Biostatistics: Types of Data Analysis
Biostatistics: Types of Data Analysis Theresa A Scott, MS Vanderbilt University Department of Biostatistics [email protected] http://biostat.mc.vanderbilt.edu/theresascott Theresa A Scott, MS
CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression
Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the
MATHEMATICAL METHODS OF STATISTICS
MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS
PS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng [email protected] The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node
Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous
Graduate Certificate in Systems Engineering
Graduate Certificate in Systems Engineering Systems Engineering is a multi-disciplinary field that aims at integrating the engineering and management functions in the development and creation of a product,
Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection
Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.
Regression III: Advanced Methods
Lecture 4: Transformations Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture The Ladder of Roots and Powers Changing the shape of distributions Transforming
HLM software has been one of the leading statistical packages for hierarchical
Introductory Guide to HLM With HLM 7 Software 3 G. David Garson HLM software has been one of the leading statistical packages for hierarchical linear modeling due to the pioneering work of Stephen Raudenbush
SYSTEMS OF REGRESSION EQUATIONS
SYSTEMS OF REGRESSION EQUATIONS 1. MULTIPLE EQUATIONS y nt = x nt n + u nt, n = 1,...,N, t = 1,...,T, x nt is 1 k, and n is k 1. This is a version of the standard regression model where the observations
Regression Analysis: A Complete Example
Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty
UNIT 1: COLLECTING DATA
Core Probability and Statistics Probability and Statistics provides a curriculum focused on understanding key data analysis and probabilistic concepts, calculations, and relevance to real-world applications.
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
