DATA SCIENCE Workshop November 12-13, 2015

Transcription

1 DATA SCIENCE Workshop November 12-13, 2015 Paris-Dauphine University Place du Maréchal de Lattre de Tassigny, Paris DATA SCIENCE Workshop will be held at Paris-Dauphine university, on November 12 th and 13 th, 2015 «DATA SCIENCE» Workshop (cf. is a satellite workshop of SDA2015 (cf. ) Organizers: Edwin Diday, Patrice Bertrand (CEREMADE Paris-Dauphine University) Tristan Cazenave, Suzanne Pinson (LAMSADE Paris-Dauphine University) Registration is free but mandatory due to a limited number of participants. To register, please send an to E. Diday (diday@ceremade.dauphine.fr) Interested students may have their travel expenses reimbursed; Please send your request with a CV.when your inscription is confirmed. CONTEXT AND AIM OF THIS «DATA SCIENCE» SATELLITE WORKSHOP. A Data Scientist is someone who is able to extract knew knowledge from Standard, Big and Complex Data: unstructured data, unpaired samples, multi sources data (as mixture of numerical, textual, image, social networks data). The fusion of such data can be done into classes of row statistical units which are considered as new statistical units. The description of these classes can be vectors of intervals, probability distributions, weighted sequences, functions, and the like, in order to express the within-class variability. One of the advantage of this approach is that unstructured data and unpaired samples at the level of row units, become structured and paired at the level of classes. The study of such new type of data, built in order to describe classes in an explanatory way, has led to a new domain called Symbolic Data Analysis (SDA). Recently, four international journals have published special issues on SDA, including ADAC which is now known as a leading international journal of classification.

2 In this satellite meeting of the next SDA 2015 workshop (Orléans, November 17-19, the talks will concern the state of the art and recent advances in SDA, or more generally visualization in Data Science. SCHEDULE Nov.12 th. Welcome speech: 14:00 to 14:15 Lynne Billard: 14:15 to 15:15 Data Science and Statistics followed by Maximum Likelihood Estimation for Interval-valued Data Chun-houh Chen: 15:15 to 15:45 Matrix Visualization: New Generation of Exploratory Data Analysis Coffee Break: 15:45 to 16:15 Oldemar Rodrıguez: 16:15 to 16:45 Shrinkage linear regression for symbolic interval-valued variables Edwin Diday, Richard Emilion: 16:45 to 17:30 Nov. 13 th Symbolic Bayesian Networks Welcome Breakfast: 8h30 Edwin Diday: 9:15 to 10:00 Thinking by classes in Data Sciences: the Symbolic Data Analysis paradigm for Big and Complex Data Oldemar Rodrıguez:: 10:00 to 10:45 Probabilistic/statistical setting of SDA Coffee Break: 10:45 to 11:00 Richard Emilion 11:00 to 11:45 Latest developments of the RSDA: An R package for Symbolic Data Analysis Lunch: 12:00 to 14:00 Manabu Ichino: 14:00 to 14:45 The Lookup Table Regression Model for Symbolic Data Paula Brito: 14:45 to 15:30 Multivariate Parametric Analysis of Interval Data Coffee Break: 15:30 to 16:00 Chun-houh Chen: 16:00 to 16:45 Some Extensions of Matrix Visualization: the GAP Approach for Standard and Symbolic Data Analysis. Cheng Wang: 16:45 to 17:30 Multiple Correspondence Analysis for Mixed Symbolic Data

3 ABSTRACTS Nov. 12 th Lynne Billard (University of Georgia, USA) Title: Maximum Likelihood Estimation for Interval-valued Data Abstract: Bertrand and Goupil (2000) obtained empirical formulas for the mean and variance of interval-valued observations. These are in effect moment estimators. We show how, under certain probability assumptions, these are the same as the maximum likelihood estimators for the corresponding population parameters. Chun-houh Chen (Institute of Stat. Science, Academia Sinica, Taiwan) Title: Matrix Visualization: New Generation of Exploratory Data Analysis Abstract: It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it (Exploratory Data Analysis: John Tukey, 1977). Data analysts and statistics practitioners nowadays are facing difficulties in understanding higher and higher dimensional data with more and more complex nature while conventional graphics/visualization tools do not answer the needs. It is statisticians responsibility for coming up with graphics/visualization environment that can help users really understand what one CAN DO for complex data generated from modern techniques and sophisticated experiments. Matrix visualization (MV) for continuous, binary, ordinal, and nominal data with various types of extensions provide users more comprehensive information embedded in complex high dimensional data than conventional EDA tools such as boxplot, scatterplot, with dimension reduction techniques such as principal component analysis and multiple correspondence analysis. In this talk I ll summarize our works on creating MV environment for conducting statistical analyses and introducing statistical concepts into MV environment for visualizing more versatile and complex data structure. Many real world examples will be demonstrated in this talk for illustrating the strength of MV for visualizing all types of datasets collected from scientific experiments and social surveys. Oldemar Rodrıguez (University of Costa Rica, San José, Costa Rica) Title: Shrinkage linear regression for symbolic interval-valued variables Abstract: This paper proposes a new approach to fit a linear regression for symbolic internalvalued variables, which improves both the Center Method suggested by Billard and Diday (2006) and the Center and Range Method suggested by Lima-Neto, E.A. and De Carvalho, F.A.T. (2008). Just in the Centers Method and the Center and Range Method, the new methods proposed fit the linear regression model on the midpoints and in the half of the length of the intervals as an additional variable (ranges) assumed by the predictor variables in the training data set, but to make these fitments in the regression models, the methods Ridge Regression, Lasso, and Elastic Net proposed by Tibshirani, R. Hastie, T., and Zou H are used. The prediction of the lower and

4 upper of the interval response (dependent) variable is carried out from their midpoints and ranges, which are estimated from the linear regression models with shrinkage generated in the midpoints and the ranges of the interval-valued predictors. Methods presented in this document are applied to three real data sets cardiologic interval data set, Prostate interval data set and US Murder interval data set to then compare their performance and facility of interpretation regarding the Center Method and the Center and Range Method. For this evaluation, the rootmean-squared error and the correlation coefficient are used. Besides, the reader may use all the methods presented herein and verify the results using the RSDA package written in R language, that can be downloaded and installed directly from CRAN. Edwin Diday (Ceremade, University Paris-Dauphine, France), Richard Emilion (University of Orléans, France) Title: Symbolic Bayesian Networks Abstract: We first consider a n x p table of probability vectors. Each vector in column j is the probability distribution of a random variable taking values in a finite set Vj that only depends on j, with Vj = dj. Column j is considered as a sample of size n of a random distribution Pj on Vj. We are considering the problem of building a Bayesian network from these samples in order to express the joint distribution of (P1,, Pj,..., Pp). This problem is very popular for estimating the joint distribution of p real-valued random variables, we extend it to the case it of random distributions. A first solution to the finite sets case consists in discretizing the probability vectors. A second solution consists in using partial distance correlations (Székély-Rizzo, Ann. Stat. 2014) that evaluate the influence of Pj on Pj'. The general case will be discussed. Nov. 13 th Edwin Diday (Ceremade, Paris-Dauphine University, France) Title: Thinking by classes in Data Sciences: the Symbolic Data Analysis paradigm for Big and Complex Data Abstract: Data science is, in general terms, the extraction of knowledge from data, considered as a science by itself. The Symbolic Data Analysis (SDA) gives a new way of thinking in Data Sciences by extending standard data to symbolic data in order to extract knowledge from aggregated classes of individual entities. The SDA is born from the classification domain by considering classes of a given population to be units of a higher level population to be studied. Such classes allow a summary of the population and often represent the real units of interest. In order to take care of the variability between the members of each class, these classes are described by intervals, distributions, set of categories or numbers sometimes weighted and the like. In that way, we obtain new kinds of data expressing variability, called "symbolic" as they cannot be reduced to numbers without losing much information. The aim of SDA is to study and

5 extract new knowledge from these new kinds of data by at least an extension of Computer Statistics and Data Mining to symbolic data. We show that SDA is a new paradigm which opens up a vast domain of research and applications to standard, complex and big data. Richard Emilion (University of Orléans, France) Title: Probabilistic/statistical setting of SDA Abstract: Given some raw units described by some variables and a specific class variable, Symbolic Data Analysis (SDA) deals with objects described by probability distributions describing classes of raw units. Our SDA formalism hinges on the notion of random distribution. In the case of paired samples, we show that this formalism depends on a regular conditional probability existence theorem. We also show the interest of SDA in the case of unpaired samples. We then discuss on the extension of some classical methods such as PCA and probabilistic classification. Oldemar Rodrıguez (University of Costa Rica, San José, Costa Rica) Title: Latest developments of the RSDA: An R package for Symbolic Data Analysis Abstract: This package aims to execute some models on Symbolic Data Analysis. Symbolic Data Analysis was propose by the professor E. DIDAY in 1987 in his paper Introduction à l approche symbolique en Analyse des Données. Premières Journées Symbolique-Numérique. Université Paris IX Dauphine. Décembre A very good reference to symbolic data analysis can be found in From the Statistics of Data to the Statistics of Knowledge: Symbolic Data Analysis of L. Billard and E. Diday that is the journal American Statistical Association Journal of the American Statistical Association June 2003, Vol. 98. The main purpose of Symbolic Data Analysis is to substitute a set of rows (cases) in a data table for an concept (second order statistical unit). For example, all of the transactions performed by one person (or any object) for a single transaction that summarizes all the original ones (Symbolic-Object) so that millions of transactions could be summarized in only one that keeps the customary behavior of the person. This is achieved thanks to the fact that the new transaction will have in its fields, not only numbers (like current transactions), but can also have objects such as intervals, histograms, or rules. This representation of an object as a conjunction of properties fits within a data analytic framework concerning symbolic data and symbolic objects, which has proven useful in dealing with big databases. In RSDA version 1.2, methods like centers interval principal components analysis, histogram principal components analysis, multi-valued correspondence analysis, interval multidemensional scaling (INTERSCAL), symbolic hierarchical clustering, CM, CRM, Lasso, Ridge and Elastic Net Linear regression model to interval variables have been implemented. This new version also includes new features to manipulate symbolic data through a new data structure that implements Symbolic Data Frames and methods for converting SODAS and XML SODAS files to RSDA files. Manabu Ichino (Tokyo Denki University, Japan) Title: The Lookup Table Regression Model for Symbolic Data Abstract: This paper presents a preliminary research on the lookup table regression model for symbolic data. We apply the quantile method to the given symbolic data table of the size (N

6 objects) (d feature variables), and we represent each object by (m+1) d-dimensional numerical vectors, called the quantile vectors, for a preselected integer m. The integer m controls the granularity of the descriptions for symbolic objects. In the new data table of the size {N (m+1)} d, we interchange N (m+1) rows according to the values of the selected objective variable from the smallest to the largest. For each of remained d-1 features, i.e., columns, we execute the segmentation of feature values into blocks so that the generated blocks satisfy the monotone property. We discard columns that have only a single block. Then, we execute the segmentation of the objective variable according to the blocks of the remained explanatory feature variables. Finally, we obtain the lookup table of the size N d, where N N (m+1) and d d. Each element of the table is an interval value corresponding to the segmented block. We realize the interval-value estimation rule for the objective variable by the search of the nearest element in the lookup table. We present examples to illustrate the lookup table regression model. Paula Brito (Porto University, Portugal) Title: Multivariate Parametric Analysis of Interval Data Abstract: In this work we focus on the study of interval data, i.e., when the variables' values are intervals of R. Parametric probabilistic models for interval-valued variables have been proposed and studied in (Brito & Duarte Silva, 2012). These models are based on the representation of each observed interval by its MidPoint and LogRange, and Multivariate Normal and Skew-Normal distributions are assumed for the whole set of 2p MidPoints and LogRanges of the original p interval-valued variables. The intrinsic nature of the interval-valued variables leads to different structures of the variance-covariance matrix, represented by different possible configurations. For all cases, maximum likelihood estimators of the corresponding parameters have been derived. This framework may be applied to different statistical multivariate methodologies, thereby allowing for inference approaches for symbolic data. The proposed modelling has first been applied to (M)ANOVA of interval data, using a likelihood-ratio approach. Linear and quadratic models for discriminant analysis of data described by interval-valued variables have been obtained, and their performance compared with alternative distance-based approaches. We have also addressed the problem of mixture distributions, developing model-based clustering using the proposed models. For the Gaussian model, the problem of outlier identification is addressed, using Mahalanobis distances based on robust estimations of the joint mean values and the covariance matrices. The referred modelling, for the Gaussian case, has been implemented in the R-package MAINT.DATA, available on CRAN. MAINT.DATA introduces a data class for representing interval data and includes functions for modeling and analysing these data. In particular, maximum likelihood estimation and statistical tests for the different considered configurations are addressed. Methods for (M)ANOVA and Linear and Quadratic Discriminant Analysis of this data class are also currently provided. Chun-houh Chen (Institute of Stat. Science, Academia Sinica, Taiwan) Title: Some Extensions of Matrix Visualization: the GAP Approach

7 Abstract: Exploratory data analysis (EDA, Tukey, 1977) has been extensively used for nearly 40 years yet boxplot and scatterplot are still the major EDA tools for visualizing continuous data in the 21st century. Many extended modules of matrix visualization via the Generalized Association Plots (GAP) approach have been developed or under developing. Some details of the following MV modules will be provided in this talk: 1. Matrix visualization for high-dimensional categorical data structure. For categorical data, MCA (multiple correspondence analysis) is most popular for visualizing reduced joint space for samples and variables of categorical nature. But similar to it s continuous counter part: PCA (principal component analysis), MCA loses its efficiency when data dimensionality gets really high. In this study we extend the framework of matrix visualization from continuous data to categorical data. Categorical matrix visualization can effectively present complex information patterns for thousands of subjects on thousands of categorical variables in a single matrix visualization display. 2. Matrix Visualization for High-Dimensional Data with a Cartography Link. When a cartography link is attached to each subject of a high-dimensional categorical data, it is necessary to use a geographical map to illustrate the pattern of subject (region)-clusters with variable-groups embedded in the high-dimensional space. This study presents an interactive cartography system with systematic color-coding by integrating the homogeneity analysis into matrix visualization. 3. Matrix visualization for symbolic data analysis. Symbolic data analysis (SDA) has gained popularity over the past few years because of its potential for handling data having a dependent and hierarchical nature. Here we introduce matrix visualization (MV) for visualizing and clustering SDA data using interval-valued symbolic data as an example; it is by far the most popular SDA data type in the literature and the most commonly encountered one in practice. Many MV techniques for visualizing and clustering conventional data are converted to SDA data, and several techniques are newly developed for SDA data. Various examples of data with simple to complex structures are brought in to illustrate the proposed methods. 4. Covariate-adjusted matrix visualization via correlation decomposition. In this study, we extend the framework of matrix visualization (MV) by incorporating a covariate adjustment process through the estimation of conditional correlations. MV can explore the grouping and/or clustering structure of high-dimensional large-scale data sets effectively without dimension reduction. The benefit is in the exploration of conditional association structures among the subjects or variables that cannot be done with conventional MV. Several biomedical examples will be employed for illustrating the versatility of the GAP approach matrix visualization. (cf. Cheng Wang (Beihang University), Edwin Diday, Richard Emilion, Huiwen Wang Title: Multiple Correspondence Analysis for Mixed Symbolic Data Abstract: Under the circumstance of cross-platform data collection technology develops rapidly and the big data era is coming, there are always a mixture of single-valued data, histogram data, composition data and functional data in one table, which can be called mixed feature-data. Different types of data may be belong to different space, which leads to that it is a pretty complicated problem to conduct crosstab analysis among several data types. In this paper, we propose a Multiple Correspondence Analysis (MCA) for mixed data to detect and represent underlying structures involved. Before MCA, we first transfer different types of data to vector

8 data, which is further converted to nominal data. Two ways are considered to convert the vector data to nominal data, respectively is hierarchal clustering and discretization. An empirical analysis is conducted to compare the performance of MCA for mixed data based on these two different ways.