A Direct Approach to Data Fusion

Size: px
Start display at page:

Download "A Direct Approach to Data Fusion"

Transcription

1 A Direct Approach to Data Fusion Zvi Gilula Department of Statistics Hebrew University Robert E. McCulloch Peter E. Rossi Graduate School of Business University of Chicago 1101 E. 58 th Street Chicago, IL July 2003 This Version, May 2004 Abstract The generic data fusion problem is to make inferences about the joint distribution of two sets of variables without any direct observations of the joint distribution. Instead, information is only available about each set separately along with some other set of common variables. The standard approach to data fusion creates a fused data set with the variables of interest and the common variables. Our approach directly estimates the joint distribution of just the variables of interest. For the case of either discrete or continuous variables, our approach yields a solution that can be implemented with standard statistical models and software. In typical marketing applications, the common variables are psycho-graphic or demographic variables and the variables to be fused involve media viewing and product purchase. For this example, our approach will directly estimate the joint distribution of media viewing and product purchase without including the common variables. This is the object required for marketing decisions. In marketing applications, fusion of discrete variables is required. We develop a method for relaxing the assumption of conditional independence for this case. We illustrate our approach with product purchase and media viewing data from a large survey of British consumers. Keywords Data fusion, Predictive distributions, Bayesian analysis, Media planning and buying Acknowledgements The authors wish to thank BMRB, London for access to their survey data. Rossi thanks the James M. Kilts Center for Marketing, Graduate School of Business, University of Chicago for partial support of this research.

2 1. Introduction Data fusion is the problem of how to make inferences about the joint distribution of two sets of random variables (hereafter called the target variables) when only information on the marginal distribution of each set is available. For example, separate surveys are conducted about buying or purchase behavior and media viewing behavior. Information is available on the marginal distribution of buying behavior and media viewing but there is no direct observation of the joint distribution. In media planning problems, inferences about the joint distribution of buying and viewing are desired. The problem, therefore, is to make inferences about the joint distribution of media viewing and buying without direct observation of these two sets of variables jointly. The general problem of inferences regarding a joint distribution on the basis of the marginals is not solvable. There are many possible joint distributions consistent with the same marginal distributions so that the joint distribution is not identified by knowledge of the marginals alone. Additional information must be brought to bear on the problem to solve it. In the case of the data fusion problem, fusion is made possible by some group of variables common to both sources of information on the marginals of the two sets of target variables. An example of this common information is demographic or psycho-graphic variables. In the media planning example, demographic information is available both in the survey of buying as well as in the survey of media viewership. Data fusion methods utilize this common information to make inferences about the joint distribution. It should be clear that presence of common variables alone is not sufficient to identify the joint distribution of the two sets of target variables. Additional assumptions must be made about the conditional distribution of the target variables given the common variables in order to achieve identification. 1

3 The term data fusion was coined for this problem to connote the merging or fusion of two data sets. One data set has one set of target variables and the common variables and another data set has the other target set of variables (and the same common variables). The data set with the buying data, for example, must be fused with the data set with media viewing habits via the information in the common set of demographic variables. If all dependence between buying and media viewing is via the common variables, then it might be natural to view data fusion problem as a sort of matching problem (see Kadane (1978) and Rodgers (1984)). A given record from the buying data must be matched with one or more records from the viewing data. One example of this matching is the hot deck method in which records from one file are matched with one record from the other file. If the common variables are discrete, one could simply find all the records in the other file that have exactly the same values of the common variable. If there is only one record in the other file, then we simply impute or match this one record. If there is more than one record in the other file with identical values of the common variables, we use some measure of the average value of the target variable or use multiple imputed values as in Rubin (1986). In marketing problems, the data sets are often produced by surveys and all variables are discrete. Moreover, many of the important variables are categorical in nature. The ultimate target variables whose joint distribution we wish to estimate are also discrete. Examples include media viewership and purchase which are binary variables. Multiple imputation methods based on multivariate normal distributional assumptions (see Rassler (2002) for an excellent discussion of multiple imputation methods) are not applicable to many if not most marketing applications. 2

4 The basic idea behind matching is to form groups of observations that are similar as measured by their common variable values. The groups of observations can be used to impute the values of the target variables that are not observed or missing in a particular data set. As such, the problem of data fusion can be cast as a missing data problem as emphasized by Rubin (1986). The imputation groups can be formed by simple rules such as having identical values of the demographic variables or close values as defined by a distance metric (Rassler 2002 (pp. 19, 56, 68), Moriarity and Scheuren(2001, 2003)). Kamakura and Wedel (1997 and 2000) generalize this notion quite elegantly by defining imputations groups implicitly via a finite mixture model. All observations in a given mixture component form the imputation group and the imputation group memberships are inferred from the data rather than imposed via some sort of ad hoc distance metric. In the Kamakura and Wedel approach, multiple imputations or simulation methods can be used to conduct formal inference, properly accounting for uncertainty about imputation group assignment. Our approach is to directly estimate the joint distribution of the target variables rather than use a matching or concatenation approach. The joint distribution can then be used to solve the inference problems desired for marketing decisions. Our approach works equally well with either discrete or continuous target and common variables. In particular, we do not require any explicit modeling of the distribution of the common variables and instead condition on these variables. This reduces the number of parameters estimated as well as the possible specification errors that might occur from postulating a joint distribution of the common variables. Our focus is on marketing decisions for which the joint distribution is the ultimate goal of the analysis. Multiple imputation and other fusions approaches are designed for 3

5 more generic situations in which the ultimate goal of the analysis is not known at the time of fusion. Our approach is also designed to exploit existing methods for modeling the conditional distribution of the target variables given the common variables rather than requiring specialized code. Standard methods (such as logit or regression models) can be used interchangeably with more involved state-of-the-art techniques. The paper is organized as follows. In the next section, we outline a general framework for the data fusion problem and present our general approach. The assumption of conditional independence plays an important role in many approaches to data fusion. We discuss how other approaches relate to our general formulation of the problem and how these approaches either do or do not employ conditional independence. We develop a method for relaxing the assumption of conditional independence that is useful for the case where some fused data or prior information is available in section 3. In section 4, we illustrate the value of our approach by using buying and media viewing data from a large survey of British consumers. We show that our approach achieves highly accurate fusion without the use of highly parameterized models or specialized code. 4

6 2. A Framework for Data Fusion In order to develop a general framework for data fusion, we need to propose a precise definition of the data fusion problem. Much of the data fusion literature takes the view that the goal of data fusion is to combine or fuse two datasets into one complete data set. To establish a notation, let ( ) D = x,b i = 1,,N denote a dataset of observations b bi i b on one target variable (denoted by b ) and the common variables x. ( ) D = x,m i = 1,,N denotes the dataset of observations on the other target variable m mi i m (with their associated common variables). We label the target variables b and m to suggest a media buying situation in which b would denote product purchase or usage and m would denote media viewing. Typically, x is a high dimensional vectors of variables. While this notation conforms to our data application, the problem is obviously more general. The data fusion literature regards the problem as to create a one combined dataset with observations on all three sets of variables ( x,b, m ) (this is sometimes referred to as file concatenation ). For example, Rubin (1986) views the problem as a missing data problem. That is, the information on m is missing from the Db dataset and must be imputed. Our view is that the goal of data fusion is to form inferences about the joint distribution of (b,m) using the information in the data D ( D,D ) =. Estimates of the joint distribution of (b,m) can then be used to solve whatever decision problem is required by the marketing application. For example, in media planning, media choices that have a high proportion of viewers who purchase the advertised product are considered desirable. b m Therefore, media choice aspects of the joint distribution of b and m. Below we discuss underwhat circumstances we require the joint probability of b and m or simply the conditional probability of b given m. 5

7 Our goal can then be succinctly stated as the computation of the predictive distribution of (b,m) given the data D ( D,D ) by integrating out over the parameters of the joint distribution. ( ) ( ) pb,md= pb,m θ,ddθ =. The predictive distribution is obtained b Since b and m are not observed jointly but only separately along with the common variables, we must provide a model for the conditional distribution of (b,m) given x. As discussed below in the section on identification, some further assumptions are required to identify this model. We will start with the assumption of conditional independence; that is, (1) p( b,m x, θ ) = p( b x, θ) p( m x, θ ) The idea here is that the source of commonality between b and m are the x variables and that after controlling for or conditioning on these variables the dependence between b and m is eliminated. For situations with a rich array of x variables, this may be a reasonable approximation. What is important to emphasize is that some assumption regarding dependence must be made to solve the data fusion problem. We start with the assumption of conditional independence for which we believe a reasonable argument can be made. However, without direct information on the joint distribution of b and m, this assumption cannot be tested. Parts of the literature on data fusion do not make explicit mention of the assumption of conditional independence but are implicitly assuming this. Others such as Rogers (1984) explicitly make the assumption. In the identification section below, we discuss the other approaches and the implicit and explicit assumptions made regarding conditional independence. We will also develop a method for relaxing conditional independence that can apply to many marketing applications. m 6

8 Under the assumption of conditional independence, the predictive distribution of b and m can be computed as follows. p ( θ D) ( ) = ( θ) ( ) ( θ ) θ ( ) ( ) ( ) pb,md pb,mx, pxp Ddxd = p b x, θ p m x, θ p(x)p θ D dxdθ is the posterior distribution of the parameters given the two datasets and p(x) is the marginal distribution of the common variables. In general, x may not be continuous. Therefore, it may be useful to view the inner integral above as the expectation of the conditional distribution of b,m given x and θ with respect to the marginal distribution of the x variables. ( ) ( ) ( ) pb,md= E pb,mx, θ pθ Ddθ x ( ) = E θ D Ex p b,m x, θ In order to compute the expectation with respect to the marginal distribution of x, it is not necessary to model the distribution of (b,x) and (m,x) or even just the marginal of x. We only require the ability to take the expectation with respect to this distribution. The x variables may exhibit many forms of dependence and mixtures of discrete and continuous distributions. Given that we only require the expectation and not the entire distribution, we can simply approximate this expectation by summing over the observations. This avoids making arbitrary distribution assumptions or the very difficult non-parametric problem of approximating the high dimensional distribution of x. In survey work, we typically have samples of several thousand or more so that this approximation is apt to be very accurate. Our approach to computing the predictive distribution of b and m is simply to form the expectation, 7

9 (2) ( ) = ( θ) pb,md Eθ D Ex pb,mx, E 1 θ D p b x N i, p m x i, b+ N m x D ( θ) ( θ) The summation is over all observations of x in both datasets. The outer expectation with respect to the posterior distribution of θ can easily be achieved by using draws from modern MCMC methods or by even less computationally demanding methods such as importance sampling. As a practical matter, this means that we only have to model the conditional distribution of b given x and m given x to perform data fusion. In typical situations, each element b and m can be either binary variables in which case simple logit models might suffice or continuous variables in which standard regression models could be used. Diagnostics can be performed on these model fits to select among alternative specifications. Modeling the conditional distribution of b given x or m given x is considerably less demanding that modeling the joint distribution of (b,x) and/or (m,x). This reduces computational requirement and guards against model mis-specification. 2.1 Joint or Conditional Probability? In order to determine which aspect of the joint distribution of b and m are required, we must examine the media buying decision. Consider the problem of allocating a media buying budget over k possible media (in our case over k possible TV shows). We view the objective as maximizing the total exposure to consumers who have revealed an interest in the product via purchase (b). Thus, the media buying decision could be formalized as k k k k ( = = ) max Pr b 1 and m 1 Q s.t. P Q = E k k 8

10 E is the total media budget. P k is the price per exposure for medium k, Q k is the number of exposures purchased for medium k. Note that the total number of exposures will be proportional to the probability of consumers viewing medium k and purchasing the product which is simply the joint probability of b and m k. This takes into account both the total viewership of medium k as well as the proportion of viewers of medium k who have expressed an interest in the product category via purchase. The solution to this problem posed above is to purchase the medium with the highest ratio, ( ) Pr b and m /P. This implies that the joint probability of P and M k is the k k object of interest for media planning. However, if the price of a medium is proportional to the size of the viewership, P cpr( m 1) k = =, then this optimality condition simply becomes to choose the medium with the highest conditional probability. ( k) Pr ( b and mk) P cpr( m ) Pr b and m k k k ( ) = Pr b m k 2.2 Situations in which only b m is Desired As discussed above, there are some situations in which we do not require estimates of the entire joint distribution of (b,m) but only require the conditional distribution of b m. In these cases, some computational simplifications can be achieved over the approach outlined above. The goal now becomes to compute the predictive distribution of the conditional distribution of b m. ( ) ( ) ( ) ( ) p b m,d = p b, θ m,d dθ= p b θ,m,d p θ m,d dθ We now introduce the conditioning x variables into the expression. ( ) ( ) ( ) ( ) ( ) = p b,x θ,m dx p θd dθ= p b x, θ,m p x θ,m dx p θ D dθ 9

11 Using the assumption of conditional independence, we obtain ( ) ( ) ( ) ( ) ( ) ( ) = pbx, θ px θ,mdxpθ Ddθ = pbx, θ pxmdxp θddθ This expression means that we average the conditional distribution of b x with respect to the conditional distribution of x m. ( ) ( ) = E p b x, θ p θ D dθ xm We can approximate this conditional expectation by summing over the observations of x for the given value of m. (3) ( ) 1 ( ) ( ) pbm,d pbx, θ p θ Dd θ Nxm xm N xm is the number of x observations given m takes on a particular value. In the media viewing case, this simply means that we sum over the empirical distribution of x for a specific media. Thus, if we are only interested in computing b m, we can simply model b x and sum over the relevant values of the x variables. We avoid the effort and possible model mis-specification errors which are associated with modeling m x. 2.3 Identification and the Assumption of Conditional Independence There is a fundamental identification issue in the data fusion problem (see also, Rassler (2002), p.5). The identification problem stems from the fact that we only observe data on the two marginal distributions of (b,x) and (m,x). The goal is to make inferences about the joint distribution of (b,m). In our data fusion method, the distribution of (b,m) is obtained from conditional distribution by averaging over the marginal distribution of x, ( ) ( ) pb,m = p(b,mx)pxdx. To see the identification problem, consider the alternative 10

12 definition that the joint distribution of (b,m) is a marginal of the joint (b,m,x), pb,m ( ) = p(b,m,x)dx. For any given marginal distributions of (b,x) and (m,x), there are many possible joint distributions (b,m,x). This means that the data fusion problem is fundamentally unidentified without some sort of restrictions on the joint distribution of (b,m,x) or, equivalently, on the conditional distribution of b, m x. We start with the restriction that b,m are independent conditional on x. This is based on the view is that if the x vector is rich enough then b and m can be approximately independent. Clearly, if the x vector does not have sufficient explanatory power, the assumption of conditional independence can be violated. If a source of prior information (e.g. a sample of fused data) is available, we can incorporate deviations from conditional independence as illustrated in section 3. In many situations, the assumption of conditional independence may be reasonable. However, it is clear that there may be some situations in which the content of the x vector may not be sufficient to insure conditional independence. For example, consider the case in which the x vector contains only demographic variables. In order to ensure conditional independence, there must be no common component between category purchase and media viewership conditional on x. If the media is narrowly focused on a specific interest, then the assumption of conditional independence might be violated. For example, consider the category of photographic equipment. Interest in photography is certainly related to demographics but is unlikely to be perfectly predicted by demographics alone. This means that there is likely to be a common component (interest in photography) that is present in both b (purchase of cameras) and m (readership of a photographic magazine). However, for more general media such as TV programs, radio shows, newspapers, and general interest magazines, this is less likely to be a problem. It is important to realize that some restriction 11

13 is required and, that without an additional source of data on (b,m), this assumption cannot be tested. It is instructive to examine other approaches to data fusion to see what identification assumptions are imposed either explicitly or implicitly. The oldest approaches to data fusion involve data matching of some sort. Equivalent groups of observations are identified using the x variables. For example, in the hot deck approach observations with the same values of the x vector are assumed to be equivalent or to be a random sample from the conditional distribution of (b,m) given x. While it is not explicitly stated, the justification for these matching procedures is that conditional independence is approximately held (see also, Rodgers (1984)). Data matching approaches that define a distance metric in the x space (e.g. Soong and de Montigny (2001)) and use observations that are close in terms of their x values also use the assumption of conditional independence. Kamakura and Wedel (1997) do not assume conditional independence and use a finite mixture of independent multinomials to approximate the joint distribution (b,m,x). It is not clear whether their procedure will give rise to estimates of the joint distribution that display conditional independence. 12

14 3. Relaxing the Conditional Independence Assumption Our view is that conditional independence is a useful default or maintained model assumption. If the set of x variables is comprehensive and predictive of b, m behavior, conditional independence is likely to hold. Relaxing the assumption of conditional independence requires additional information beyond the sample information as the joint distribution of b and m is not identified. Supplemental information can come from a variety of sources. We will consider the possibility that a subset of data for which the complete distribution of b, m and x is observed. There are many ways to incorporate conditional dependence by replacing (1) with some model of the conditional joint distribution of (b,m) x. For example, Rassler (2002) introduces a prior distribution that captures some view of dependence for the case of multivariate normal b, m, and x variables. The problem is that the results are very sensitive to the choice of this prior and assessment of the prior can be difficult. Our view is that this prior information must ultimately come from jointly observing data on comparable (b,m,x). Models of conditional dependence will depend on whether or not b and m are discrete or continuous and even in the discrete case, the number of values that b and m can take on. The literature has focused on multivariate normal models which are of questionable relevance in marketing applications. Here we develop an approach to adding dependence for binary b and m variables which is the most important case for many marketing applications. The table of the joint distribution is a 4 dimensional multinomial distribution given x, with probabilities 13

15 ( ) p11 x = Pr b= 0,m = 0 x ( ) p12 x = Pr b= 0,m = 1x ( ) p21 x = Pr b= 1,m = 0 x ( ) p22 x = Pr b= 1,m = 1x. In general, our approach involves building models for b x and m x. Let θ = (θ b, θ m ) where θ b denotes the parameters of the model for b x and θ m denotes the parameters of the model for m x. Let p = pr( b= 1 x, θ ) and p Pr( m 1 x, ) b b m = = θ. For example, if m we use a binary logit model, exp( x ' θb = ) + ( θ ). If we assume conditional p b 1 exp x' b independence, the multinomial probabilities are given in an array P ( )( ) ( ) p ( 1 p ) p p 1 p 1 p 1 p p b m b m = pij = b m b m We can provide for a departure from conditional independence by introducing a parameter λ ( λ 1). For positive λ, let a min( ( 1 pb) p m,pb( 1 pm) ) (( )( ) ) b m b m. =λ. For negative λ, let a =λmin 1 p 1 p,p p. a can be used to perturb the P array to represent a new multinomial distribution with conditional dependence. (4) ( )( ) ( ) ( ) 1 pb 1 pm + a 1 pb pm a P = pb 1 pm a pbpm + a If λ< 1, then this will constitute a valid multinomial distribution. Positive values of λ will provide for positive conditional dependence and vice versa. We note that the parameterization in (4) will preserve the marginals of b and m while accommodating a specific degree of conditional dependence indexed by λ. The likelihood function for λθ, θ is given by b m 14

16 n 2 2 I pi,j, = 1i= 1j= 1 (5) ( ) i,j, L λ = Ii,j, is an indicator function for each of the four possibilities represented in the multinomial distribution. Given a prior on λ, we can easily implement a conditional Bayesian analysis. We have prior information that whatever conditional dependence exists that is likely to be small. A reasonable prior for this case would be (6) p( ) λ 1 ( 1 +λ) α Note that (4) gives a joint model for (b,m) x, θb, θm, λ. If we integrate out either b or m from the joint model, we will obtain the same marginal model for b x or b x as used to construct the joint. Thus, in the empirical application, we will infer about λ conditional on the values of θˆ ˆ b, θ m obtained by fitting the models b x, θ b and m x, θ m. While one could estimate all model parameters jointly, we do not expect to lose much precision by our conditional approach. The conditional approach has the benefit of a simple implementation. 4. An Empirical Application One of the most common applications of data fusion methods is the fusing of buying behavior and media exposure. There are general purpose surveys of exposure to print and television media. Typically, these surveys collect demographic information as well. If a marketer is designing a marketing communication strategy for a product or group of products in a particular category, it is useful to know which media types are efficient for communication. This means that the marketer is interested in b m for a specific set of m s whose coverage is observed in a media exposure survey. Fusion is made feasible by a set of 15

17 demographic variables, available in a separate buying survey, that are common to both the b and m data sets. 4.1 The BMRB dataset Our data comes from a survey of Great Britain consumers conducted by the British Market Research Bureau (BMRB) in This is a general purpose survey of more than 20,000 consumers. The BMRB survey collects detailed information on viewership of most popular GB TV shows along with extensive demographic information. Table 1 lists 19 demographics variables available in this data. These data include standard household demographics as well as geographic information. The BMRB survey also collects information on purchases of a variety of different categories of products. Table 1 lists 15 such product categories. These product categories have penetration rates of between 20 and 86 per cent. We choose these 15 product categories from the approximately 35 total available in the data including only those categories for this there was no missing data. In a typical application, the buying information would come from a separate survey design specifically for this task or diary panel data sets in which there would be few missing values. The BMRB survey is designed mostly for the purpose of obtaining purchase data and lifestyle information. This includes measuring media exposure. We confined attention to information on TV viewing of 64 shows surveyed and no missing data (table 1 provides a list of the shows). All of our B variables and M variables are binary with B=1 defined as using the product and M=1 as specifically choose to watch this program. The sample size is 24,497. The BMRB data set provides fused data in the sense that both b and m variables are observed for the same survey respondent. This enables us to gauge the performance of our proposed methods. 16

18 Ultimately the goal of data fusion is to estimate the joint distribution of b and m. Specifically, we will estimate the conditional distribution, b m, which we have indicated would be used to make media selection decisions. In the BMRB dataset, each of the b and m variables are binary variables and we have an extensive set of x variables. Our predictive approach requires estimation of the conditional distributions, b x and, in the case of the joint approach (eqn (2)) m x as well. We start with a logistic regression specification of both conditional distributions. The X variables are a mixture of ordinal, categorical and discretized continuous variables (age and education). We specifiy a logit fit with all variables, except age, entered as dummy variables for all (except one, of course) of their possible values. This logit specification guards against potential mis-specifications of the form that the independent variables enter via additive, but possibly nonlinear, functions, but does not defend against mis-specification of the probability locus and the single-index assumption. To check for violations of model assumptions, we perform a simple graphical diagnostic. For each of 15 b variables, we have separate fitted logit model and associated fitted probabilities. We sort the data into k groups on the basis of the fitted probability. Actual frequencies for the dependent variable are then plotted against the expected frequency or average probability from the model fit. We use k=20 groups in this example which means that each group comprises over 1000 observations and the sample frequencies will be highly accurate estimates of the true probability that the dependent variable is 1. If the model is properly specified, the sample frequencies and model-based expected frequencies should agree closely (note: the Hosmer-Lemeshow test for model misspecification is based on a test statistic which is the sum of the squared discrepancies between the sample and expected frequencies). We find the plot to be more informative. 17

19 Figure 1 displays the best (7. Restaurant patronage in the evening) and the worst performing b variables (15. Vitamins). Even the worst (in the sense of greatest deviation from a line) case provides strong evidence in favor of the logit specification. Similar results were obtained for the 64 models of m x. Given the fitted logit models, we implement the joint method (equation 2) in which we average the distribution of (b,m) x over all possible values in the data. An alternative would be to only fit b x and simply average over only x values for which m= 1 or 0 (equation 3). The first method we call the joint method and the second method is termed the direct method. It should be noted that it is not clear which method will do a better job of estimating the conditional distribution of b m. The joint method uses a large sample of x values (the entire dataset) to average but incurs sampling and misspecification error associated with modeling m x. The direct method avoids the cost of modeling m x but only averages over a smaller subset of x values. Given the large size of our dataset and the fact that the logit models seem to be very well specified, it is no surprise that the results based on the joint and direct methods agree quite closely. There are negligible differences between these two estimates over the 960 = 15 (b vars) x 64 (m vars) in our data. The correlation between these probabilities is.99985). Given that we have direct measurement of the joint distribution of b and m in our data, we can check our estimates against the value of b m in the data. j ( j) { } b ˆp = ; M = :m = 1 actual i, i,j j j,l M dim M We do not need to subset our data to test our method since we do not use any aspect of the joint distribution of b and m in computing our estimator in (3). Figure 2 plots ˆp direct i,j vs. actual ˆp i,j for all of the 960 pairs. As pointed out in Kamakura and Wedel (1997), it would be 18

20 deceptive to plot the raw value of these probability estimates against each other. If the marginal probability of b varies a great deal over the set of 15 b variables, then a terrible estimate such as reporting only the marginal of each b variable would still have a reasonably high correlation with the actual sample values. For this reason, we subtract the marginal probability of each b variable from the estimates. That is, we plot p direct i,j vs. actual i,j p where p i,j = pˆ i,j pˆ i. ˆp i is the marginal probability of b variable i. Figure 2 shows a very close correspondence between our estimates and the actual sample values based on the full sample of 24,497 pairs of b and m. The correlation is.98 direct actual and the mean absolute deviation, MAD = p p /( I J), is The dark i line in figure 2 is the 45 degree line. The dotted line is a least squares fitted line through the j i,j i,j cloud of points. It is evident that there are two dimensions along which the direct and actual estimates differ. The first, and most evident, is that the bulk of the points lie below the 45 degree line, indicating that our estimates are a bit too low. This downward bias is slight but discernable. The second is that the point cloud is rotated slightly in the clockwise direction from the 45 degree line as indicated by the difference between the 45 degree and least squares fitted lines. As we will show, both of these discrepancies from perfect fit (up to sampling error) are the result of the assumption of conditional independence. The rotation is caused by a combination of both positive and negative association deviations from conditional independence. The downward bias is caused by the preponderance of positive association deviation relative to negative. 4.2 Comparison to Matching Procedures 19

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Imputing Values to Missing Data

Imputing Values to Missing Data Imputing Values to Missing Data In federated data, between 30%-70% of the data points will have at least one missing attribute - data wastage if we ignore all records with a missing value Remaining data

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

More information

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3. IDENTIFICATION AND ESTIMATION OF AGE, PERIOD AND COHORT EFFECTS IN THE ANALYSIS OF DISCRETE ARCHIVAL DATA Stephen E. Fienberg, University of Minnesota William M. Mason, University of Michigan 1. INTRODUCTION

More information

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the

More information

Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities

The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities Elizabeth Garrett-Mayer, PhD Assistant Professor Sidney Kimmel Comprehensive Cancer Center Johns Hopkins University 1

More information

Chi Square Tests. Chapter 10. 10.1 Introduction

Chi Square Tests. Chapter 10. 10.1 Introduction Contents 10 Chi Square Tests 703 10.1 Introduction............................ 703 10.2 The Chi Square Distribution.................. 704 10.3 Goodness of Fit Test....................... 709 10.4 Chi Square

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Approaches for Analyzing Survey Data: a Discussion

Approaches for Analyzing Survey Data: a Discussion Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata

More information

Tail-Dependence an Essential Factor for Correctly Measuring the Benefits of Diversification

Tail-Dependence an Essential Factor for Correctly Measuring the Benefits of Diversification Tail-Dependence an Essential Factor for Correctly Measuring the Benefits of Diversification Presented by Work done with Roland Bürgi and Roger Iles New Views on Extreme Events: Coupled Networks, Dragon

More information

A Basic Introduction to Missing Data

A Basic Introduction to Missing Data John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

More information

Statistical matching: Experimental results and future research questions

Statistical matching: Experimental results and future research questions Statistical matching: Experimental results and future research questions 2015 19 Ton de Waal Content 1. Introduction 4 2. Methods for statistical matching 5 2.1 Introduction to statistical matching 5 2.2

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Regression Modeling Strategies

Regression Modeling Strategies Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

More information

A Primer in Internet Audience Measurement

A Primer in Internet Audience Measurement A Primer in Internet Audience Measurement By Bruce Jeffries-Fox Introduction There is a growing trend toward people using the Internet to get their news and to investigate particular issues and organizations.

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Benchmarking of different classes of models used for credit scoring

Benchmarking of different classes of models used for credit scoring Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

COMMON CORE STATE STANDARDS FOR

COMMON CORE STATE STANDARDS FOR COMMON CORE STATE STANDARDS FOR Mathematics (CCSSM) High School Statistics and Probability Mathematics High School Statistics and Probability Decisions or predictions are often based on data numbers in

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Data analysis process

Data analysis process Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

More information

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

Common Core Unit Summary Grades 6 to 8

Common Core Unit Summary Grades 6 to 8 Common Core Unit Summary Grades 6 to 8 Grade 8: Unit 1: Congruence and Similarity- 8G1-8G5 rotations reflections and translations,( RRT=congruence) understand congruence of 2 d figures after RRT Dilations

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

More information

Getting the Most from Demographics: Things to Consider for Powerful Market Analysis

Getting the Most from Demographics: Things to Consider for Powerful Market Analysis Getting the Most from Demographics: Things to Consider for Powerful Market Analysis Charles J. Schwartz Principal, Intelligent Analytical Services Demographic analysis has become a fact of life in market

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different

More information

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

Global Seasonal Phase Lag between Solar Heating and Surface Temperature

Global Seasonal Phase Lag between Solar Heating and Surface Temperature Global Seasonal Phase Lag between Solar Heating and Surface Temperature Summer REU Program Professor Tom Witten By Abstract There is a seasonal phase lag between solar heating from the sun and the surface

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR) 2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

More information

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are

More information

Some Essential Statistics The Lure of Statistics

Some Essential Statistics The Lure of Statistics Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:

More information

The Gravity Model: Derivation and Calibration

The Gravity Model: Derivation and Calibration The Gravity Model: Derivation and Calibration Philip A. Viton October 28, 2014 Philip A. Viton CRP/CE 5700 () Gravity Model October 28, 2014 1 / 66 Introduction We turn now to the Gravity Model of trip

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

Correlation key concepts:

Correlation key concepts: CORRELATION Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson s coefficient of correlation c) Spearman s Rank correlation coefficient d)

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University Agency Internal User Unmasked Result Subjects

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Probabilistic user behavior models in online stores for recommender systems

Probabilistic user behavior models in online stores for recommender systems Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

Ira J. Haimowitz Henry Schwarz

Ira J. Haimowitz Henry Schwarz From: AAAI Technical Report WS-97-07. Compilation copyright 1997, AAAI (www.aaai.org). All rights reserved. Clustering and Prediction for Credit Line Optimization Ira J. Haimowitz Henry Schwarz General

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

The Optimality of Naive Bayes

The Optimality of Naive Bayes The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

More information

Organizing Your Approach to a Data Analysis

Organizing Your Approach to a Data Analysis Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

More information

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association Addressing Analytics Challenges in the Insurance Industry Noe Tuason California State Automobile Association Overview Two Challenges: 1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance. Chapter 6: Behavioural models Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

More information

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

More information

Numerical Summarization of Data OPRE 6301

Numerical Summarization of Data OPRE 6301 Numerical Summarization of Data OPRE 6301 Motivation... In the previous session, we used graphical techniques to describe data. For example: While this histogram provides useful insight, other interesting

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information

Dealing with Missing Data

Dealing with Missing Data Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904

More information

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy Astronomical Data Analysis Software and Systems XIV ASP Conference Series, Vol. XXX, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds. P2.1.25 Making the Most of Missing Values: Object Clustering

More information

Regression 3: Logistic Regression

Regression 3: Logistic Regression Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing

More information