A Direct Approach to Data Fusion


 Myron Scott
 1 years ago
 Views:
Transcription
1 A Direct Approach to Data Fusion Zvi Gilula Department of Statistics Hebrew University Robert E. McCulloch Peter E. Rossi Graduate School of Business University of Chicago 1101 E. 58 th Street Chicago, IL July 2003 This Version, May 2004 Abstract The generic data fusion problem is to make inferences about the joint distribution of two sets of variables without any direct observations of the joint distribution. Instead, information is only available about each set separately along with some other set of common variables. The standard approach to data fusion creates a fused data set with the variables of interest and the common variables. Our approach directly estimates the joint distribution of just the variables of interest. For the case of either discrete or continuous variables, our approach yields a solution that can be implemented with standard statistical models and software. In typical marketing applications, the common variables are psychographic or demographic variables and the variables to be fused involve media viewing and product purchase. For this example, our approach will directly estimate the joint distribution of media viewing and product purchase without including the common variables. This is the object required for marketing decisions. In marketing applications, fusion of discrete variables is required. We develop a method for relaxing the assumption of conditional independence for this case. We illustrate our approach with product purchase and media viewing data from a large survey of British consumers. Keywords Data fusion, Predictive distributions, Bayesian analysis, Media planning and buying Acknowledgements The authors wish to thank BMRB, London for access to their survey data. Rossi thanks the James M. Kilts Center for Marketing, Graduate School of Business, University of Chicago for partial support of this research.
2 1. Introduction Data fusion is the problem of how to make inferences about the joint distribution of two sets of random variables (hereafter called the target variables) when only information on the marginal distribution of each set is available. For example, separate surveys are conducted about buying or purchase behavior and media viewing behavior. Information is available on the marginal distribution of buying behavior and media viewing but there is no direct observation of the joint distribution. In media planning problems, inferences about the joint distribution of buying and viewing are desired. The problem, therefore, is to make inferences about the joint distribution of media viewing and buying without direct observation of these two sets of variables jointly. The general problem of inferences regarding a joint distribution on the basis of the marginals is not solvable. There are many possible joint distributions consistent with the same marginal distributions so that the joint distribution is not identified by knowledge of the marginals alone. Additional information must be brought to bear on the problem to solve it. In the case of the data fusion problem, fusion is made possible by some group of variables common to both sources of information on the marginals of the two sets of target variables. An example of this common information is demographic or psychographic variables. In the media planning example, demographic information is available both in the survey of buying as well as in the survey of media viewership. Data fusion methods utilize this common information to make inferences about the joint distribution. It should be clear that presence of common variables alone is not sufficient to identify the joint distribution of the two sets of target variables. Additional assumptions must be made about the conditional distribution of the target variables given the common variables in order to achieve identification. 1
3 The term data fusion was coined for this problem to connote the merging or fusion of two data sets. One data set has one set of target variables and the common variables and another data set has the other target set of variables (and the same common variables). The data set with the buying data, for example, must be fused with the data set with media viewing habits via the information in the common set of demographic variables. If all dependence between buying and media viewing is via the common variables, then it might be natural to view data fusion problem as a sort of matching problem (see Kadane (1978) and Rodgers (1984)). A given record from the buying data must be matched with one or more records from the viewing data. One example of this matching is the hot deck method in which records from one file are matched with one record from the other file. If the common variables are discrete, one could simply find all the records in the other file that have exactly the same values of the common variable. If there is only one record in the other file, then we simply impute or match this one record. If there is more than one record in the other file with identical values of the common variables, we use some measure of the average value of the target variable or use multiple imputed values as in Rubin (1986). In marketing problems, the data sets are often produced by surveys and all variables are discrete. Moreover, many of the important variables are categorical in nature. The ultimate target variables whose joint distribution we wish to estimate are also discrete. Examples include media viewership and purchase which are binary variables. Multiple imputation methods based on multivariate normal distributional assumptions (see Rassler (2002) for an excellent discussion of multiple imputation methods) are not applicable to many if not most marketing applications. 2
4 The basic idea behind matching is to form groups of observations that are similar as measured by their common variable values. The groups of observations can be used to impute the values of the target variables that are not observed or missing in a particular data set. As such, the problem of data fusion can be cast as a missing data problem as emphasized by Rubin (1986). The imputation groups can be formed by simple rules such as having identical values of the demographic variables or close values as defined by a distance metric (Rassler 2002 (pp. 19, 56, 68), Moriarity and Scheuren(2001, 2003)). Kamakura and Wedel (1997 and 2000) generalize this notion quite elegantly by defining imputations groups implicitly via a finite mixture model. All observations in a given mixture component form the imputation group and the imputation group memberships are inferred from the data rather than imposed via some sort of ad hoc distance metric. In the Kamakura and Wedel approach, multiple imputations or simulation methods can be used to conduct formal inference, properly accounting for uncertainty about imputation group assignment. Our approach is to directly estimate the joint distribution of the target variables rather than use a matching or concatenation approach. The joint distribution can then be used to solve the inference problems desired for marketing decisions. Our approach works equally well with either discrete or continuous target and common variables. In particular, we do not require any explicit modeling of the distribution of the common variables and instead condition on these variables. This reduces the number of parameters estimated as well as the possible specification errors that might occur from postulating a joint distribution of the common variables. Our focus is on marketing decisions for which the joint distribution is the ultimate goal of the analysis. Multiple imputation and other fusions approaches are designed for 3
5 more generic situations in which the ultimate goal of the analysis is not known at the time of fusion. Our approach is also designed to exploit existing methods for modeling the conditional distribution of the target variables given the common variables rather than requiring specialized code. Standard methods (such as logit or regression models) can be used interchangeably with more involved stateoftheart techniques. The paper is organized as follows. In the next section, we outline a general framework for the data fusion problem and present our general approach. The assumption of conditional independence plays an important role in many approaches to data fusion. We discuss how other approaches relate to our general formulation of the problem and how these approaches either do or do not employ conditional independence. We develop a method for relaxing the assumption of conditional independence that is useful for the case where some fused data or prior information is available in section 3. In section 4, we illustrate the value of our approach by using buying and media viewing data from a large survey of British consumers. We show that our approach achieves highly accurate fusion without the use of highly parameterized models or specialized code. 4
6 2. A Framework for Data Fusion In order to develop a general framework for data fusion, we need to propose a precise definition of the data fusion problem. Much of the data fusion literature takes the view that the goal of data fusion is to combine or fuse two datasets into one complete data set. To establish a notation, let ( ) D = x,b i = 1,,N denote a dataset of observations b bi i b on one target variable (denoted by b ) and the common variables x. ( ) D = x,m i = 1,,N denotes the dataset of observations on the other target variable m mi i m (with their associated common variables). We label the target variables b and m to suggest a media buying situation in which b would denote product purchase or usage and m would denote media viewing. Typically, x is a high dimensional vectors of variables. While this notation conforms to our data application, the problem is obviously more general. The data fusion literature regards the problem as to create a one combined dataset with observations on all three sets of variables ( x,b, m ) (this is sometimes referred to as file concatenation ). For example, Rubin (1986) views the problem as a missing data problem. That is, the information on m is missing from the Db dataset and must be imputed. Our view is that the goal of data fusion is to form inferences about the joint distribution of (b,m) using the information in the data D ( D,D ) =. Estimates of the joint distribution of (b,m) can then be used to solve whatever decision problem is required by the marketing application. For example, in media planning, media choices that have a high proportion of viewers who purchase the advertised product are considered desirable. b m Therefore, media choice aspects of the joint distribution of b and m. Below we discuss underwhat circumstances we require the joint probability of b and m or simply the conditional probability of b given m. 5
7 Our goal can then be succinctly stated as the computation of the predictive distribution of (b,m) given the data D ( D,D ) by integrating out over the parameters of the joint distribution. ( ) ( ) pb,md= pb,m θ,ddθ =. The predictive distribution is obtained b Since b and m are not observed jointly but only separately along with the common variables, we must provide a model for the conditional distribution of (b,m) given x. As discussed below in the section on identification, some further assumptions are required to identify this model. We will start with the assumption of conditional independence; that is, (1) p( b,m x, θ ) = p( b x, θ) p( m x, θ ) The idea here is that the source of commonality between b and m are the x variables and that after controlling for or conditioning on these variables the dependence between b and m is eliminated. For situations with a rich array of x variables, this may be a reasonable approximation. What is important to emphasize is that some assumption regarding dependence must be made to solve the data fusion problem. We start with the assumption of conditional independence for which we believe a reasonable argument can be made. However, without direct information on the joint distribution of b and m, this assumption cannot be tested. Parts of the literature on data fusion do not make explicit mention of the assumption of conditional independence but are implicitly assuming this. Others such as Rogers (1984) explicitly make the assumption. In the identification section below, we discuss the other approaches and the implicit and explicit assumptions made regarding conditional independence. We will also develop a method for relaxing conditional independence that can apply to many marketing applications. m 6
8 Under the assumption of conditional independence, the predictive distribution of b and m can be computed as follows. p ( θ D) ( ) = ( θ) ( ) ( θ ) θ ( ) ( ) ( ) pb,md pb,mx, pxp Ddxd = p b x, θ p m x, θ p(x)p θ D dxdθ is the posterior distribution of the parameters given the two datasets and p(x) is the marginal distribution of the common variables. In general, x may not be continuous. Therefore, it may be useful to view the inner integral above as the expectation of the conditional distribution of b,m given x and θ with respect to the marginal distribution of the x variables. ( ) ( ) ( ) pb,md= E pb,mx, θ pθ Ddθ x ( ) = E θ D Ex p b,m x, θ In order to compute the expectation with respect to the marginal distribution of x, it is not necessary to model the distribution of (b,x) and (m,x) or even just the marginal of x. We only require the ability to take the expectation with respect to this distribution. The x variables may exhibit many forms of dependence and mixtures of discrete and continuous distributions. Given that we only require the expectation and not the entire distribution, we can simply approximate this expectation by summing over the observations. This avoids making arbitrary distribution assumptions or the very difficult nonparametric problem of approximating the high dimensional distribution of x. In survey work, we typically have samples of several thousand or more so that this approximation is apt to be very accurate. Our approach to computing the predictive distribution of b and m is simply to form the expectation, 7
9 (2) ( ) = ( θ) pb,md Eθ D Ex pb,mx, E 1 θ D p b x N i, p m x i, b+ N m x D ( θ) ( θ) The summation is over all observations of x in both datasets. The outer expectation with respect to the posterior distribution of θ can easily be achieved by using draws from modern MCMC methods or by even less computationally demanding methods such as importance sampling. As a practical matter, this means that we only have to model the conditional distribution of b given x and m given x to perform data fusion. In typical situations, each element b and m can be either binary variables in which case simple logit models might suffice or continuous variables in which standard regression models could be used. Diagnostics can be performed on these model fits to select among alternative specifications. Modeling the conditional distribution of b given x or m given x is considerably less demanding that modeling the joint distribution of (b,x) and/or (m,x). This reduces computational requirement and guards against model misspecification. 2.1 Joint or Conditional Probability? In order to determine which aspect of the joint distribution of b and m are required, we must examine the media buying decision. Consider the problem of allocating a media buying budget over k possible media (in our case over k possible TV shows). We view the objective as maximizing the total exposure to consumers who have revealed an interest in the product via purchase (b). Thus, the media buying decision could be formalized as k k k k ( = = ) max Pr b 1 and m 1 Q s.t. P Q = E k k 8
10 E is the total media budget. P k is the price per exposure for medium k, Q k is the number of exposures purchased for medium k. Note that the total number of exposures will be proportional to the probability of consumers viewing medium k and purchasing the product which is simply the joint probability of b and m k. This takes into account both the total viewership of medium k as well as the proportion of viewers of medium k who have expressed an interest in the product category via purchase. The solution to this problem posed above is to purchase the medium with the highest ratio, ( ) Pr b and m /P. This implies that the joint probability of P and M k is the k k object of interest for media planning. However, if the price of a medium is proportional to the size of the viewership, P cpr( m 1) k = =, then this optimality condition simply becomes to choose the medium with the highest conditional probability. ( k) Pr ( b and mk) P cpr( m ) Pr b and m k k k ( ) = Pr b m k 2.2 Situations in which only b m is Desired As discussed above, there are some situations in which we do not require estimates of the entire joint distribution of (b,m) but only require the conditional distribution of b m. In these cases, some computational simplifications can be achieved over the approach outlined above. The goal now becomes to compute the predictive distribution of the conditional distribution of b m. ( ) ( ) ( ) ( ) p b m,d = p b, θ m,d dθ= p b θ,m,d p θ m,d dθ We now introduce the conditioning x variables into the expression. ( ) ( ) ( ) ( ) ( ) = p b,x θ,m dx p θd dθ= p b x, θ,m p x θ,m dx p θ D dθ 9
11 Using the assumption of conditional independence, we obtain ( ) ( ) ( ) ( ) ( ) ( ) = pbx, θ px θ,mdxpθ Ddθ = pbx, θ pxmdxp θddθ This expression means that we average the conditional distribution of b x with respect to the conditional distribution of x m. ( ) ( ) = E p b x, θ p θ D dθ xm We can approximate this conditional expectation by summing over the observations of x for the given value of m. (3) ( ) 1 ( ) ( ) pbm,d pbx, θ p θ Dd θ Nxm xm N xm is the number of x observations given m takes on a particular value. In the media viewing case, this simply means that we sum over the empirical distribution of x for a specific media. Thus, if we are only interested in computing b m, we can simply model b x and sum over the relevant values of the x variables. We avoid the effort and possible model misspecification errors which are associated with modeling m x. 2.3 Identification and the Assumption of Conditional Independence There is a fundamental identification issue in the data fusion problem (see also, Rassler (2002), p.5). The identification problem stems from the fact that we only observe data on the two marginal distributions of (b,x) and (m,x). The goal is to make inferences about the joint distribution of (b,m). In our data fusion method, the distribution of (b,m) is obtained from conditional distribution by averaging over the marginal distribution of x, ( ) ( ) pb,m = p(b,mx)pxdx. To see the identification problem, consider the alternative 10
12 definition that the joint distribution of (b,m) is a marginal of the joint (b,m,x), pb,m ( ) = p(b,m,x)dx. For any given marginal distributions of (b,x) and (m,x), there are many possible joint distributions (b,m,x). This means that the data fusion problem is fundamentally unidentified without some sort of restrictions on the joint distribution of (b,m,x) or, equivalently, on the conditional distribution of b, m x. We start with the restriction that b,m are independent conditional on x. This is based on the view is that if the x vector is rich enough then b and m can be approximately independent. Clearly, if the x vector does not have sufficient explanatory power, the assumption of conditional independence can be violated. If a source of prior information (e.g. a sample of fused data) is available, we can incorporate deviations from conditional independence as illustrated in section 3. In many situations, the assumption of conditional independence may be reasonable. However, it is clear that there may be some situations in which the content of the x vector may not be sufficient to insure conditional independence. For example, consider the case in which the x vector contains only demographic variables. In order to ensure conditional independence, there must be no common component between category purchase and media viewership conditional on x. If the media is narrowly focused on a specific interest, then the assumption of conditional independence might be violated. For example, consider the category of photographic equipment. Interest in photography is certainly related to demographics but is unlikely to be perfectly predicted by demographics alone. This means that there is likely to be a common component (interest in photography) that is present in both b (purchase of cameras) and m (readership of a photographic magazine). However, for more general media such as TV programs, radio shows, newspapers, and general interest magazines, this is less likely to be a problem. It is important to realize that some restriction 11
13 is required and, that without an additional source of data on (b,m), this assumption cannot be tested. It is instructive to examine other approaches to data fusion to see what identification assumptions are imposed either explicitly or implicitly. The oldest approaches to data fusion involve data matching of some sort. Equivalent groups of observations are identified using the x variables. For example, in the hot deck approach observations with the same values of the x vector are assumed to be equivalent or to be a random sample from the conditional distribution of (b,m) given x. While it is not explicitly stated, the justification for these matching procedures is that conditional independence is approximately held (see also, Rodgers (1984)). Data matching approaches that define a distance metric in the x space (e.g. Soong and de Montigny (2001)) and use observations that are close in terms of their x values also use the assumption of conditional independence. Kamakura and Wedel (1997) do not assume conditional independence and use a finite mixture of independent multinomials to approximate the joint distribution (b,m,x). It is not clear whether their procedure will give rise to estimates of the joint distribution that display conditional independence. 12
14 3. Relaxing the Conditional Independence Assumption Our view is that conditional independence is a useful default or maintained model assumption. If the set of x variables is comprehensive and predictive of b, m behavior, conditional independence is likely to hold. Relaxing the assumption of conditional independence requires additional information beyond the sample information as the joint distribution of b and m is not identified. Supplemental information can come from a variety of sources. We will consider the possibility that a subset of data for which the complete distribution of b, m and x is observed. There are many ways to incorporate conditional dependence by replacing (1) with some model of the conditional joint distribution of (b,m) x. For example, Rassler (2002) introduces a prior distribution that captures some view of dependence for the case of multivariate normal b, m, and x variables. The problem is that the results are very sensitive to the choice of this prior and assessment of the prior can be difficult. Our view is that this prior information must ultimately come from jointly observing data on comparable (b,m,x). Models of conditional dependence will depend on whether or not b and m are discrete or continuous and even in the discrete case, the number of values that b and m can take on. The literature has focused on multivariate normal models which are of questionable relevance in marketing applications. Here we develop an approach to adding dependence for binary b and m variables which is the most important case for many marketing applications. The table of the joint distribution is a 4 dimensional multinomial distribution given x, with probabilities 13
15 ( ) p11 x = Pr b= 0,m = 0 x ( ) p12 x = Pr b= 0,m = 1x ( ) p21 x = Pr b= 1,m = 0 x ( ) p22 x = Pr b= 1,m = 1x. In general, our approach involves building models for b x and m x. Let θ = (θ b, θ m ) where θ b denotes the parameters of the model for b x and θ m denotes the parameters of the model for m x. Let p = pr( b= 1 x, θ ) and p Pr( m 1 x, ) b b m = = θ. For example, if m we use a binary logit model, exp( x ' θb = ) + ( θ ). If we assume conditional p b 1 exp x' b independence, the multinomial probabilities are given in an array P ( )( ) ( ) p ( 1 p ) p p 1 p 1 p 1 p p b m b m = pij = b m b m We can provide for a departure from conditional independence by introducing a parameter λ ( λ 1). For positive λ, let a min( ( 1 pb) p m,pb( 1 pm) ) (( )( ) ) b m b m. =λ. For negative λ, let a =λmin 1 p 1 p,p p. a can be used to perturb the P array to represent a new multinomial distribution with conditional dependence. (4) ( )( ) ( ) ( ) 1 pb 1 pm + a 1 pb pm a P = pb 1 pm a pbpm + a If λ< 1, then this will constitute a valid multinomial distribution. Positive values of λ will provide for positive conditional dependence and vice versa. We note that the parameterization in (4) will preserve the marginals of b and m while accommodating a specific degree of conditional dependence indexed by λ. The likelihood function for λθ, θ is given by b m 14
16 n 2 2 I pi,j, = 1i= 1j= 1 (5) ( ) i,j, L λ = Ii,j, is an indicator function for each of the four possibilities represented in the multinomial distribution. Given a prior on λ, we can easily implement a conditional Bayesian analysis. We have prior information that whatever conditional dependence exists that is likely to be small. A reasonable prior for this case would be (6) p( ) λ 1 ( 1 +λ) α Note that (4) gives a joint model for (b,m) x, θb, θm, λ. If we integrate out either b or m from the joint model, we will obtain the same marginal model for b x or b x as used to construct the joint. Thus, in the empirical application, we will infer about λ conditional on the values of θˆ ˆ b, θ m obtained by fitting the models b x, θ b and m x, θ m. While one could estimate all model parameters jointly, we do not expect to lose much precision by our conditional approach. The conditional approach has the benefit of a simple implementation. 4. An Empirical Application One of the most common applications of data fusion methods is the fusing of buying behavior and media exposure. There are general purpose surveys of exposure to print and television media. Typically, these surveys collect demographic information as well. If a marketer is designing a marketing communication strategy for a product or group of products in a particular category, it is useful to know which media types are efficient for communication. This means that the marketer is interested in b m for a specific set of m s whose coverage is observed in a media exposure survey. Fusion is made feasible by a set of 15
17 demographic variables, available in a separate buying survey, that are common to both the b and m data sets. 4.1 The BMRB dataset Our data comes from a survey of Great Britain consumers conducted by the British Market Research Bureau (BMRB) in This is a general purpose survey of more than 20,000 consumers. The BMRB survey collects detailed information on viewership of most popular GB TV shows along with extensive demographic information. Table 1 lists 19 demographics variables available in this data. These data include standard household demographics as well as geographic information. The BMRB survey also collects information on purchases of a variety of different categories of products. Table 1 lists 15 such product categories. These product categories have penetration rates of between 20 and 86 per cent. We choose these 15 product categories from the approximately 35 total available in the data including only those categories for this there was no missing data. In a typical application, the buying information would come from a separate survey design specifically for this task or diary panel data sets in which there would be few missing values. The BMRB survey is designed mostly for the purpose of obtaining purchase data and lifestyle information. This includes measuring media exposure. We confined attention to information on TV viewing of 64 shows surveyed and no missing data (table 1 provides a list of the shows). All of our B variables and M variables are binary with B=1 defined as using the product and M=1 as specifically choose to watch this program. The sample size is 24,497. The BMRB data set provides fused data in the sense that both b and m variables are observed for the same survey respondent. This enables us to gauge the performance of our proposed methods. 16
18 Ultimately the goal of data fusion is to estimate the joint distribution of b and m. Specifically, we will estimate the conditional distribution, b m, which we have indicated would be used to make media selection decisions. In the BMRB dataset, each of the b and m variables are binary variables and we have an extensive set of x variables. Our predictive approach requires estimation of the conditional distributions, b x and, in the case of the joint approach (eqn (2)) m x as well. We start with a logistic regression specification of both conditional distributions. The X variables are a mixture of ordinal, categorical and discretized continuous variables (age and education). We specifiy a logit fit with all variables, except age, entered as dummy variables for all (except one, of course) of their possible values. This logit specification guards against potential misspecifications of the form that the independent variables enter via additive, but possibly nonlinear, functions, but does not defend against misspecification of the probability locus and the singleindex assumption. To check for violations of model assumptions, we perform a simple graphical diagnostic. For each of 15 b variables, we have separate fitted logit model and associated fitted probabilities. We sort the data into k groups on the basis of the fitted probability. Actual frequencies for the dependent variable are then plotted against the expected frequency or average probability from the model fit. We use k=20 groups in this example which means that each group comprises over 1000 observations and the sample frequencies will be highly accurate estimates of the true probability that the dependent variable is 1. If the model is properly specified, the sample frequencies and modelbased expected frequencies should agree closely (note: the HosmerLemeshow test for model misspecification is based on a test statistic which is the sum of the squared discrepancies between the sample and expected frequencies). We find the plot to be more informative. 17
19 Figure 1 displays the best (7. Restaurant patronage in the evening) and the worst performing b variables (15. Vitamins). Even the worst (in the sense of greatest deviation from a line) case provides strong evidence in favor of the logit specification. Similar results were obtained for the 64 models of m x. Given the fitted logit models, we implement the joint method (equation 2) in which we average the distribution of (b,m) x over all possible values in the data. An alternative would be to only fit b x and simply average over only x values for which m= 1 or 0 (equation 3). The first method we call the joint method and the second method is termed the direct method. It should be noted that it is not clear which method will do a better job of estimating the conditional distribution of b m. The joint method uses a large sample of x values (the entire dataset) to average but incurs sampling and misspecification error associated with modeling m x. The direct method avoids the cost of modeling m x but only averages over a smaller subset of x values. Given the large size of our dataset and the fact that the logit models seem to be very well specified, it is no surprise that the results based on the joint and direct methods agree quite closely. There are negligible differences between these two estimates over the 960 = 15 (b vars) x 64 (m vars) in our data. The correlation between these probabilities is.99985). Given that we have direct measurement of the joint distribution of b and m in our data, we can check our estimates against the value of b m in the data. j ( j) { } b ˆp = ; M = :m = 1 actual i, i,j j j,l M dim M We do not need to subset our data to test our method since we do not use any aspect of the joint distribution of b and m in computing our estimator in (3). Figure 2 plots ˆp direct i,j vs. actual ˆp i,j for all of the 960 pairs. As pointed out in Kamakura and Wedel (1997), it would be 18
20 deceptive to plot the raw value of these probability estimates against each other. If the marginal probability of b varies a great deal over the set of 15 b variables, then a terrible estimate such as reporting only the marginal of each b variable would still have a reasonably high correlation with the actual sample values. For this reason, we subtract the marginal probability of each b variable from the estimates. That is, we plot p direct i,j vs. actual i,j p where p i,j = pˆ i,j pˆ i. ˆp i is the marginal probability of b variable i. Figure 2 shows a very close correspondence between our estimates and the actual sample values based on the full sample of 24,497 pairs of b and m. The correlation is.98 direct actual and the mean absolute deviation, MAD = p p /( I J), is The dark i line in figure 2 is the 45 degree line. The dotted line is a least squares fitted line through the j i,j i,j cloud of points. It is evident that there are two dimensions along which the direct and actual estimates differ. The first, and most evident, is that the bulk of the points lie below the 45 degree line, indicating that our estimates are a bit too low. This downward bias is slight but discernable. The second is that the point cloud is rotated slightly in the clockwise direction from the 45 degree line as indicated by the difference between the 45 degree and least squares fitted lines. As we will show, both of these discrepancies from perfect fit (up to sampling error) are the result of the assumption of conditional independence. The rotation is caused by a combination of both positive and negative association deviations from conditional independence. The downward bias is caused by the preponderance of positive association deviation relative to negative. 4.2 Comparison to Matching Procedures 19
Auxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationApplied Data Mining Analysis: A StepbyStep Introduction Using RealWorld Data Sets
Applied Data Mining Analysis: A StepbyStep Introduction Using RealWorld Data Sets http://info.salfordsystems.com/jsm2015ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationHandling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza
Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and
More informationExploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
More informationMarketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationImputing Values to Missing Data
Imputing Values to Missing Data In federated data, between 30%70% of the data points will have at least one missing attribute  data wastage if we ignore all records with a missing value Remaining data
More informationThe Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
More informationIt is important to bear in mind that one of the first three subscripts is redundant since k = i j +3.
IDENTIFICATION AND ESTIMATION OF AGE, PERIOD AND COHORT EFFECTS IN THE ANALYSIS OF DISCRETE ARCHIVAL DATA Stephen E. Fienberg, University of Minnesota William M. Mason, University of Michigan 1. INTRODUCTION
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationChapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
More informationChi Square Tests. Chapter 10. 10.1 Introduction
Contents 10 Chi Square Tests 703 10.1 Introduction............................ 703 10.2 The Chi Square Distribution.................. 704 10.3 Goodness of Fit Test....................... 709 10.4 Chi Square
More informationLocal outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationBootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN13: 9780470860809 ISBN10: 0470860804 Editors Brian S Everitt & David
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationThe Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities
The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities Elizabeth GarrettMayer, PhD Assistant Professor Sidney Kimmel Comprehensive Cancer Center Johns Hopkins University 1
More informationTailDependence an Essential Factor for Correctly Measuring the Benefits of Diversification
TailDependence an Essential Factor for Correctly Measuring the Benefits of Diversification Presented by Work done with Roland Bürgi and Roger Iles New Views on Extreme Events: Coupled Networks, Dragon
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationMINITAB ASSISTANT WHITE PAPER
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. OneWay
More informationApproaches for Analyzing Survey Data: a Discussion
Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata
More informationA Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit nonresponse. In a survey, certain respondents may be unreachable or may refuse to participate. Item
More informationStatistical matching: Experimental results and future research questions
Statistical matching: Experimental results and future research questions 2015 19 Ton de Waal Content 1. Introduction 4 2. Methods for statistical matching 5 2.1 Introduction to statistical matching 5 2.2
More informationRegression Modeling Strategies
Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions
More informationBNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
More informationMultivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
More informationExploratory Data Analysis
Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
More informationStatistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
More informationHandling attrition and nonresponse in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 6372 Handling attrition and nonresponse in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
More informationA Primer in Internet Audience Measurement
A Primer in Internet Audience Measurement By Bruce JeffriesFox Introduction There is a growing trend toward people using the Internet to get their news and to investigate particular issues and organizations.
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Webbased Analytics Table
More informationPrinciple Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationInsurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationA crash course in probability and Naïve Bayes classification
Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationCOMMON CORE STATE STANDARDS FOR
COMMON CORE STATE STANDARDS FOR Mathematics (CCSSM) High School Statistics and Probability Mathematics High School Statistics and Probability Decisions or predictions are often based on data numbers in
More informationBenchmarking of different classes of models used for credit scoring
Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want
More informationInferential Statistics
Inferential Statistics Sampling and the normal distribution Zscores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationData analysis process
Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis
More informationRegularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, JukkaPekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
More informationCross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
More informationPenalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20thcentury statistics dealt with maximum likelihood
More informationTechnology StepbyStep Using StatCrunch
Technology StepbyStep Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate
More informationCommon Core Unit Summary Grades 6 to 8
Common Core Unit Summary Grades 6 to 8 Grade 8: Unit 1: Congruence and Similarity 8G18G5 rotations reflections and translations,( RRT=congruence) understand congruence of 2 d figures after RRT Dilations
More informationLOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
More informationDoptimal plans in observational studies
Doptimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationStatistical Analysis with Missing Data
Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is
More informationGetting the Most from Demographics: Things to Consider for Powerful Market Analysis
Getting the Most from Demographics: Things to Consider for Powerful Market Analysis Charles J. Schwartz Principal, Intelligent Analytical Services Demographic analysis has become a fact of life in market
More informationCHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,
More informationPoisson Models for Count Data
Chapter 4 Poisson Models for Count Data In this chapter we study loglinear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture  17 ShannonFanoElias Coding and Introduction to Arithmetic Coding
More informationGlobal Seasonal Phase Lag between Solar Heating and Surface Temperature
Global Seasonal Phase Lag between Solar Heating and Surface Temperature Summer REU Program Professor Tom Witten By Abstract There is a seasonal phase lag between solar heating from the sun and the surface
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multiclass classification.
More informationSome Essential Statistics The Lure of Statistics
Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived
More informationFairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
More information2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
More informationIra J. Haimowitz Henry Schwarz
From: AAAI Technical Report WS9707. Compilation copyright 1997, AAAI (www.aaai.org). All rights reserved. Clustering and Prediction for Credit Line Optimization Ira J. Haimowitz Henry Schwarz General
More informationPrediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
More informationCurriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 20092010
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 20092010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different
More informationCurrent Standard: Mathematical Concepts and Applications Shape, Space, and Measurement Primary
Shape, Space, and Measurement Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two and threedimensional shapes by demonstrating an understanding of:
More informationAccurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios
Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are
More informationService courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
More informationStatistics in Retail Finance. Chapter 6: Behavioural models
Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics: Behavioural
More informationCorrelation key concepts:
CORRELATION Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson s coefficient of correlation c) Spearman s Rank correlation coefficient d)
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationChapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.dbbook.com for conditions on reuse Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More informationThe Gravity Model: Derivation and Calibration
The Gravity Model: Derivation and Calibration Philip A. Viton October 28, 2014 Philip A. Viton CRP/CE 5700 () Gravity Model October 28, 2014 1 / 66 Introduction We turn now to the Gravity Model of trip
More informationTHE HYBRID CARTLOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CATLOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most datamining projects involve classification problems assigning objects to classes whether
More informationA Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND
Paper D022009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
More informationBusiness Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGrawHill/Irwin, 2008, ISBN: 9780073319889. Required Computing
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationProbabilistic user behavior models in online stores for recommender systems
Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationA THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA
A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University Agency Internal User Unmasked Result Subjects
More informationBehavioral Segmentation
Behavioral Segmentation TM Contents 1. The Importance of Segmentation in Contemporary Marketing... 2 2. Traditional Methods of Segmentation and their Limitations... 2 2.1 Lack of Homogeneity... 3 2.2 Determining
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationStatistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 16233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
More informationThe Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
More informationSummary of Probability
Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More information11. Analysis of Casecontrol Studies Logistic Regression
Research methods II 113 11. Analysis of Casecontrol Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
More informationOrganizing Your Approach to a Data Analysis
Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize
More informationAddressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association
Addressing Analytics Challenges in the Insurance Industry Noe Tuason California State Automobile Association Overview Two Challenges: 1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects
More informationMaking the Most of Missing Values: Object Clustering with Partial Data in Astronomy
Astronomical Data Analysis Software and Systems XIV ASP Conference Series, Vol. XXX, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds. P2.1.25 Making the Most of Missing Values: Object Clustering
More informationNumerical Summarization of Data OPRE 6301
Numerical Summarization of Data OPRE 6301 Motivation... In the previous session, we used graphical techniques to describe data. For example: While this histogram provides useful insight, other interesting
More information