Lasso on Categorical Data



Similar documents
Regularized Logistic Regression for Mind Reading with Parallel Validation

Package bigdata. R topics documented: February 19, 2015

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Marketing Mix Modelling and Big Data P. M Cain

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Penalized Logistic Regression and Classification of Microarray Data

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Least Squares Estimation

BLOCKWISE SPARSE REGRESSION

2. Linear regression with multiple regressors

Fitting Subject-specific Curves to Grouped Longitudinal Data

Machine Learning Big Data using Map Reduce

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Server Load Prediction

Bayesian Penalized Methods for High Dimensional Data

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

How to assess the risk of a large portfolio? How to estimate a large covariance matrix?

Penalized regression: Introduction

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Blockwise Sparse Regression

PS 271B: Quantitative Methods II. Lecture Notes

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Least-Squares Intersection of Lines

Predicting Health Care Costs by Two-part Model with Sparse Regularization

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Lecture 3: Linear methods for classification

Applications to Data Smoothing and Image Processing I

Statistical Machine Learning

PARTIAL LEAST SQUARES IS TO LISREL AS PRINCIPAL COMPONENTS ANALYSIS IS TO COMMON FACTOR ANALYSIS. Wynne W. Chin University of Calgary, CANADA

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Dimensionality Reduction: Principal Components Analysis

STA 4273H: Statistical Machine Learning

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Efficient variable selection in support vector machines via the alternating direction method of multipliers

Introduction to Matrix Algebra

UNDERSTANDING THE TWO-WAY ANOVA

1 Review of Least Squares Solutions to Overdetermined Systems

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

Microsoft Azure Machine learning Algorithms

Machine Learning Logistic Regression

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Introduction to General and Generalized Linear Models

Statistical machine learning, high dimension and big data

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Statistics 2014 Scoring Guidelines

Module 5: Multiple Regression Analysis

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Multiple Linear Regression in Data Mining

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Regression with a Binary Dependent Variable

SYSTEMS OF REGRESSION EQUATIONS

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Moderation. Moderation

The Loss in Efficiency from Using Grouped Data to Estimate Coefficients of Group Level Variables. Kathleen M. Lang* Boston College.

Multiple Imputation for Missing Data: A Cautionary Tale

Do Supplemental Online Recorded Lectures Help Students Learn Microeconomics?*

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Collaborative Filtering. Radek Pelánek

Degrees of Freedom and Model Search

Module 3: Correlation and Covariance

Design of Experiments (DOE)

Cross Validation. Dr. Thomas Jensen Expedia.com

Regression III: Advanced Methods

INTEREST RATES AND FX MODELS

Mean squared error matrix comparison of least aquares and Stein-rule estimators for regression coefficients under non-normal disturbances

16 : Demand Forecasting

Statistical Models in R

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

Using Four-Quadrant Charts for Two Technology Forecasting Indicators: Technology Readiness Levels and R&D Momentum

Machine Learning in Statistical Arbitrage

Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

11. Analysis of Case-control Studies Logistic Regression

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Beating the NCAA Football Point Spread

Additional sources Compilation of sources:

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Econometrics Simple Linear Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression

Introduction to Fixed Effects Methods

Multiple group discriminant analysis: Robustness and error rate

Multivariate Analysis of Variance (MANOVA)

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Distances, Clustering, and Classification. Heatmaps

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Transcription:

Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality. However, it is often difficult to fit a linear model on such data, especially when some or all of the explanatory variables are categorical. When the response variable is the only categorical variable, it is common to use the logit model to overcome the defects of ordinary least squares. However, when the covariates are also categorical, corresponding variables are coded using dummy variables into our design matrix. In this approach, the data matrix becomes sparse, the column dimension increases, and columns might be highly correlated. This might result in a singular data matrix causing coefficients of Linear Square Estimation(LSE) impossible. In order to avoid this pitfall, researchers use Group Lasso(GL). Though GL has beneficial properties when dealing with categorical data, it is not devised specifically to analyze factor variables and there still remains room for improvement. In this project, we propose Modified Group Lasso(MGL) for improvements in categorical explanatory data. It performs better than Lasso or GL in various settings, particularly for large column dimension and big group sizes. Also MGL is robust to parameter selection and has less risk of being critically damaged by biased trial samples. 2 Background 2.1 Lasso Lasso is a penalized linear regression. In linear regression, the underlying assumption is E(Y X = x) = x T β (where, Y R N and X R N p ). While LSE minimizes 1 2 Y Xβ 2 2 with respect to β, Lasso minimizes the penalty function 1 2 Y Xβ 2 2 + λ β 1. Since the additional L 1 penalty term is non-smooth, Lasso has variable selection property that can deal with the multicollinearity in data matrix. This property suggests that only the selected variables via the procedure will be included in the model. In this sense, Lasso is a proper method for factor data analysis, as it takes care of difficulties described above. However, the variable selection property of Lasso yields a new problem: partial selection of dummy variables. It is not reasonable to select only a portion of dummy variables derived from one categorical variable. Many researchers use Group Lasso to bypass this problem. 2.2 Group Lasso(GL) Group Lasso performs similarly to Lasso except that it selects a group of variables rather than a single variable at each step of selection. The groups were pre-assigned on covariates. Therefore, in 1

categorical case, we group the stack of dummy variables originated from a factor variable and each group represents a corresponding factor variable. This group selection property of GL has been facilitated by minimzing: 1 L 2 Y X (l) β (l) 2 2 + λ L pl β (l) 2 (1) Here l {1, 2,..., L} denotes the index of a group, p l, the size of the l-th group, X (l), corresponding submatrix and β (l), the corresponding coefficient vector. The additional L 2 terms(not squared) in (1) takes the role of L 1 term in Lasso. 3 Modified Group Lasso for Categorical Data Matrix(MGL) GL has been developed in order to select an assigned group of variables at each selection stage, regardless of the data type. However, GL is not specialized for the categorical/mix data case. Note that the value of (1) becomes bigger as p l increases. As a consequence, GL tends to exclude groups with big dimensions. As for categorical data, all the variables in a same group basically represents one explanatory variable. Thus, favoring small groups can cause severe bias. For example, nationality variable which has more than 150 levels(us, Canada, Mexico, etc.,) and gender variable with two levels(male and Female) should have equal chance of getting in the model when other conditions are the same. Our Modified Group Lasso gets rid of this effect by minimizing: 1 L 2 Y X (l) β (l) 2 2 + λ L β (l) 2. (2) Unlike (1), the additional L 2 terms in (2) are equally weighted. This approach ignores the size of groups in the( selection procedure, and agrees with Ravikumar s sparse additive model fitting idea [ Ravi]: min 1 2 Y L f l(x) 2 2 + λ ) L f l(x) 2 One great advantage in this method is that we can still use the optimization algorithms in GL. GL and MGL have functionally the same form of target function to minimize and adaptive LARS can be adopted in both cases[3]. In this project, we implemented modified LARS[8]. 4 Application 4.1 Simulation We have applied MGL under various settings. In each case, N, the total number of observations is 200. Each row of the data sub-matrix follows multinomial distribution with equal probability when the corresponding group size is bigger than 1, otherwise, follows standard normal distribution. Thus, a multinomial distributed row of sub-matrix X (l) represents dummy variables of the categorical variables. Using this categorical data matrix X, the response vector Y was generated as Y = Xβ + ɛ = L X(l) β(l) + ɛ. Here the coefficient vector β and the noise vector ɛ is adjusted to have Signal-to-Noise-Ratio=3. Also, 30 percent of β elements are set to be 0 to observe the model selection property. Estimation of models were conducted by comparing two types of errors: := y ŷ 2 2 2

and coefficients error:= β ˆβ 1 In the figures below, black, red and green lines represent MGL, GL and Lasso respectively. The and were provided in first row and second row respectively. In every graph, x axis is the constraint coefficient λ. Each column in the figures implies two error estimations from the same simulation run. For the model fitting, following the convention, we choose the which minimizes the. Thus to compare the performance of each models, it is reasonable to compare the minimum values over. In this simulation, with respect to the, it seems that MGL surpasses other methods when the number of covariates is large and the size of groups is big. This coincides with our minimization criteria since when the size of groups are small, the distinction between MGL, GL, and Lasso would vanish. Except for several exceptions MGL performs better than GL ans Lasso. In terms of, it seems to be unstable and no method has dominance over others. This can be explained by the multicollinearity in the data matrix. As mentioned above, using dummy variables can introduce severe multicollinearity as the number of covariates and the size of p grow. When there exists multicollinearity in the data matrix,it is possible that there exists a linear model other than the true model which explains the data as good as the true one. This can explain the poor outcome in while the estimation performance is nice. One remarkable thing about MGL is that it is robust to the choice of λ. We observe from the figures that MGL has the smallest curvature in any case. As a consequence, even though we choose the wrong in application, the fitted y would not be catastrophically deviated from the true y. group size:= 1 or 4 0.80 0.85 0.90 0.95 group size:= 1 or 4 0.6 0.8 1.0 1.2 1.4 1.08 1.10 1.12 1.14 1.2 1.4 1.6 1.8 2.0 2.2 2.4 Figure 1: Number of covariates=2:from left to right, categorical variables with level 1)1 and 4, 2)1 and 8 3

group size:= 2 or 4 group size:=4 or 8 0.92 0.94 0.96 0.98 1.00 group size:= 2 or 4 12.7 12.8 12.9 13.0 13.1 13.2 1.00 1.02 1.04 1.06 1.08 1.10 1.12 16 18 20 22 24 26 1.2 1.3 1.4 1.5 1.6 1.7 0.0 0.5 1.0 1.5 2.0 2.5 3.0 4 6 8 10 12 14 16 18 1.20 1.22 1.24 1.26 group size:=4 or 8 coeffecient error 25.8 26.0 26.2 26.4 26.6 26.8 27.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Figure 2: Number of covariates=4:from left to right, categorical variables with level 1) 2 and 4, 2) 2 and 8, 3) 1 and 8, 4) 4 and 8 group size= 1 or 2 group size:= 1 or 4 group size:=4 or 8 1.1 1.2 1.3 1.4 1.5 groupsize:=1 or 2 13 14 15 16 17 18 1.2 1.3 1.4 1.5 group size:=1 or 4 25 26 27 28 29 30 1.10 1.12 1.14 1.16 70 75 80 85 1.05 1.10 1.15 group size:= 4 or 8 95 100 105 0.0 0.5 1.0 1.5 2.0 Figure 3: Number of covariates=4:from left to right, categorical variables with level 1) 1 and 2, 2) 1 and 4, 3) 2 and 8, 4) 4 and 8 4

4.2 Real Data MGL, GL and Lasso are applied on the real mixed data in education. The data was collected from 1988-92 National Education Longitudinal Study (NELS). Data consists of 278 observations with 31 mixed types of covariates; such as gender and race for the factor variables, and units of math courses and average science score as quantitative variables. The response variable is the average math score. CV 4.8 4.9 5.0 5.1 5.2 modified lasso group lasso lasso 2 4 6 8 10 Figure 4: 20-fold Cross Validation on Educational Study Data for Out of 278 data points, we chose 200 random points as a trial set and tested on the remaining. Figure 4 is a 20-fold Cross Validation result. Broken lines represent one standard deviation of the error for each method. The tuning parameter λs were chosen via one-standarddeviation rule following the convention of Lasso regression. The convention is to choose the biggest within the range of one-standard-deviation from the minimum value in order to avoid overfitting and bias in the test data. MGL achieves the smallest error on the test set, even though its CV is the worst. This can be a good example of robustness in MGL. On the test set, the ratio showed a significant improvement for MGL to Lasso, MGL to GL, and GL to Lasso, which was 0.947, 0.951 and 1.005 respectively. We can improve the performance on CV with more factor variables. 5 Conclusion and Discussion Among previously introduced linear models, MGL performs relatively well. It has two main advantages: small and robustness to parameter selection in categorical data. With these advantages, we can apply to a wide academic realm dealing with categorical data, such as social science and education. Future research goal is to 1) analyze the case in which MGL has dominance both theoretically and empirically and 2) develop an analogous approach in logistic regression. References [1] Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58:267288,1996. 5

[2] Park, M. and Hastie, T. Regularization path algorithms for detecting gene interactions. Available at http://www.stat.stanford.edu/ hastie/pub.htm, 2006. [3] Yuan, M. and Lin, Y. Model Selection and Estimation in Regression with Grouped Variables. Journal of Royal Statistical Society, Series B, 68(1):49-67, 2007. [4] Chesneau, Ch. and Hebiri, M. Some Theoretical Results on the Grouped Variables Lasso. Mathematical Methods of Statistics, 17:317-326, 2008. [5] Freidman, J. H., Hastie, T. and Tibshirani, R. Regularized paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 2008. [6] Wang, H. and Leng, C. A note on adaptive group lasso. Computational Statistics and Data Analysis, 2008. [7] Zou, H. and Hastie, T. Regularization and Variable Selection via the Elastic Net. Journal of Royal Statistical Society, Series B, 67(2):301-320, 2008. [8] Ravikumar, P., Lafferty, J., Liu, H., and Wasserman, L. Sparse additive models. Journals of the Royal Statistical Society: Series B, 71(5): 10091030, 2009. ISSN 14679868. [9] Kim, S. and Xing, E. Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity. Proceedings of the 27th International Conference on Machine Learning, 2010. [10] Simon, N. and Tibshirani, R. Standardization and the Group Lasso Penalty. Statistica Sinica, 22:983-1001, 2012 6