DATA SCIENCE Workshop November 12-13, 2015
|
|
- Karin Dean
- 8 years ago
- Views:
Transcription
1 DATA SCIENCE Workshop November 12-13, 2015 Paris-Dauphine University Place du Maréchal de Lattre de Tassigny, Paris DATA SCIENCE Workshop will be held at Paris-Dauphine university, on November 12 th and 13 th, 2015 «DATA SCIENCE» Workshop (cf. is a satellite workshop of SDA2015 (cf. ) Organizers: Edwin Diday, Patrice Bertrand (CEREMADE Paris-Dauphine University) Tristan Cazenave, Suzanne Pinson (LAMSADE Paris-Dauphine University) Registration is free but mandatory due to a limited number of participants. To register, please send an to E. Diday (diday@ceremade.dauphine.fr) Interested students may have their travel expenses reimbursed; Please send your request with a CV.when your inscription is confirmed. CONTEXT AND AIM OF THIS «DATA SCIENCE» SATELLITE WORKSHOP. A Data Scientist is someone who is able to extract knew knowledge from Standard, Big and Complex Data: unstructured data, unpaired samples, multi sources data (as mixture of numerical, textual, image, social networks data). The fusion of such data can be done into classes of row statistical units which are considered as new statistical units. The description of these classes can be vectors of intervals, probability distributions, weighted sequences, functions, and the like, in order to express the within-class variability. One of the advantage of this approach is that unstructured data and unpaired samples at the level of row units, become structured and paired at the level of classes. The study of such new type of data, built in order to describe classes in an explanatory way, has led to a new domain called Symbolic Data Analysis (SDA). Recently, four international journals have published special issues on SDA, including ADAC which is now known as a leading international journal of classification.
2 In this satellite meeting of the next SDA 2015 workshop (Orléans, November 17-19, the talks will concern the state of the art and recent advances in SDA, or more generally visualization in Data Science. SCHEDULE Nov.12 th. Welcome speech: 14:00 to 14:15 Lynne Billard: 14:15 to 15:15 Data Science and Statistics followed by Maximum Likelihood Estimation for Interval-valued Data Chun-houh Chen: 15:15 to 15:45 Matrix Visualization: New Generation of Exploratory Data Analysis Coffee Break: 15:45 to 16:15 Oldemar Rodrıguez: 16:15 to 16:45 Shrinkage linear regression for symbolic interval-valued variables Edwin Diday, Richard Emilion: 16:45 to 17:30 Nov. 13 th Symbolic Bayesian Networks Welcome Breakfast: 8h30 Edwin Diday: 9:15 to 10:00 Thinking by classes in Data Sciences: the Symbolic Data Analysis paradigm for Big and Complex Data Oldemar Rodrıguez:: 10:00 to 10:45 Probabilistic/statistical setting of SDA Coffee Break: 10:45 to 11:00 Richard Emilion 11:00 to 11:45 Latest developments of the RSDA: An R package for Symbolic Data Analysis Lunch: 12:00 to 14:00 Manabu Ichino: 14:00 to 14:45 The Lookup Table Regression Model for Symbolic Data Paula Brito: 14:45 to 15:30 Multivariate Parametric Analysis of Interval Data Coffee Break: 15:30 to 16:00 Chun-houh Chen: 16:00 to 16:45 Some Extensions of Matrix Visualization: the GAP Approach for Standard and Symbolic Data Analysis. Cheng Wang: 16:45 to 17:30 Multiple Correspondence Analysis for Mixed Symbolic Data
3 ABSTRACTS Nov. 12 th Lynne Billard (University of Georgia, USA) Title: Maximum Likelihood Estimation for Interval-valued Data Abstract: Bertrand and Goupil (2000) obtained empirical formulas for the mean and variance of interval-valued observations. These are in effect moment estimators. We show how, under certain probability assumptions, these are the same as the maximum likelihood estimators for the corresponding population parameters. Chun-houh Chen (Institute of Stat. Science, Academia Sinica, Taiwan) Title: Matrix Visualization: New Generation of Exploratory Data Analysis Abstract: It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it (Exploratory Data Analysis: John Tukey, 1977). Data analysts and statistics practitioners nowadays are facing difficulties in understanding higher and higher dimensional data with more and more complex nature while conventional graphics/visualization tools do not answer the needs. It is statisticians responsibility for coming up with graphics/visualization environment that can help users really understand what one CAN DO for complex data generated from modern techniques and sophisticated experiments. Matrix visualization (MV) for continuous, binary, ordinal, and nominal data with various types of extensions provide users more comprehensive information embedded in complex high dimensional data than conventional EDA tools such as boxplot, scatterplot, with dimension reduction techniques such as principal component analysis and multiple correspondence analysis. In this talk I ll summarize our works on creating MV environment for conducting statistical analyses and introducing statistical concepts into MV environment for visualizing more versatile and complex data structure. Many real world examples will be demonstrated in this talk for illustrating the strength of MV for visualizing all types of datasets collected from scientific experiments and social surveys. Oldemar Rodrıguez (University of Costa Rica, San José, Costa Rica) Title: Shrinkage linear regression for symbolic interval-valued variables Abstract: This paper proposes a new approach to fit a linear regression for symbolic internalvalued variables, which improves both the Center Method suggested by Billard and Diday (2006) and the Center and Range Method suggested by Lima-Neto, E.A. and De Carvalho, F.A.T. (2008). Just in the Centers Method and the Center and Range Method, the new methods proposed fit the linear regression model on the midpoints and in the half of the length of the intervals as an additional variable (ranges) assumed by the predictor variables in the training data set, but to make these fitments in the regression models, the methods Ridge Regression, Lasso, and Elastic Net proposed by Tibshirani, R. Hastie, T., and Zou H are used. The prediction of the lower and
4 upper of the interval response (dependent) variable is carried out from their midpoints and ranges, which are estimated from the linear regression models with shrinkage generated in the midpoints and the ranges of the interval-valued predictors. Methods presented in this document are applied to three real data sets cardiologic interval data set, Prostate interval data set and US Murder interval data set to then compare their performance and facility of interpretation regarding the Center Method and the Center and Range Method. For this evaluation, the rootmean-squared error and the correlation coefficient are used. Besides, the reader may use all the methods presented herein and verify the results using the RSDA package written in R language, that can be downloaded and installed directly from CRAN. Edwin Diday (Ceremade, University Paris-Dauphine, France), Richard Emilion (University of Orléans, France) Title: Symbolic Bayesian Networks Abstract: We first consider a n x p table of probability vectors. Each vector in column j is the probability distribution of a random variable taking values in a finite set Vj that only depends on j, with Vj = dj. Column j is considered as a sample of size n of a random distribution Pj on Vj. We are considering the problem of building a Bayesian network from these samples in order to express the joint distribution of (P1,, Pj,..., Pp). This problem is very popular for estimating the joint distribution of p real-valued random variables, we extend it to the case it of random distributions. A first solution to the finite sets case consists in discretizing the probability vectors. A second solution consists in using partial distance correlations (Székély-Rizzo, Ann. Stat. 2014) that evaluate the influence of Pj on Pj'. The general case will be discussed. Nov. 13 th Edwin Diday (Ceremade, Paris-Dauphine University, France) Title: Thinking by classes in Data Sciences: the Symbolic Data Analysis paradigm for Big and Complex Data Abstract: Data science is, in general terms, the extraction of knowledge from data, considered as a science by itself. The Symbolic Data Analysis (SDA) gives a new way of thinking in Data Sciences by extending standard data to symbolic data in order to extract knowledge from aggregated classes of individual entities. The SDA is born from the classification domain by considering classes of a given population to be units of a higher level population to be studied. Such classes allow a summary of the population and often represent the real units of interest. In order to take care of the variability between the members of each class, these classes are described by intervals, distributions, set of categories or numbers sometimes weighted and the like. In that way, we obtain new kinds of data expressing variability, called "symbolic" as they cannot be reduced to numbers without losing much information. The aim of SDA is to study and
5 extract new knowledge from these new kinds of data by at least an extension of Computer Statistics and Data Mining to symbolic data. We show that SDA is a new paradigm which opens up a vast domain of research and applications to standard, complex and big data. Richard Emilion (University of Orléans, France) Title: Probabilistic/statistical setting of SDA Abstract: Given some raw units described by some variables and a specific class variable, Symbolic Data Analysis (SDA) deals with objects described by probability distributions describing classes of raw units. Our SDA formalism hinges on the notion of random distribution. In the case of paired samples, we show that this formalism depends on a regular conditional probability existence theorem. We also show the interest of SDA in the case of unpaired samples. We then discuss on the extension of some classical methods such as PCA and probabilistic classification. Oldemar Rodrıguez (University of Costa Rica, San José, Costa Rica) Title: Latest developments of the RSDA: An R package for Symbolic Data Analysis Abstract: This package aims to execute some models on Symbolic Data Analysis. Symbolic Data Analysis was propose by the professor E. DIDAY in 1987 in his paper Introduction à l approche symbolique en Analyse des Données. Premières Journées Symbolique-Numérique. Université Paris IX Dauphine. Décembre A very good reference to symbolic data analysis can be found in From the Statistics of Data to the Statistics of Knowledge: Symbolic Data Analysis of L. Billard and E. Diday that is the journal American Statistical Association Journal of the American Statistical Association June 2003, Vol. 98. The main purpose of Symbolic Data Analysis is to substitute a set of rows (cases) in a data table for an concept (second order statistical unit). For example, all of the transactions performed by one person (or any object) for a single transaction that summarizes all the original ones (Symbolic-Object) so that millions of transactions could be summarized in only one that keeps the customary behavior of the person. This is achieved thanks to the fact that the new transaction will have in its fields, not only numbers (like current transactions), but can also have objects such as intervals, histograms, or rules. This representation of an object as a conjunction of properties fits within a data analytic framework concerning symbolic data and symbolic objects, which has proven useful in dealing with big databases. In RSDA version 1.2, methods like centers interval principal components analysis, histogram principal components analysis, multi-valued correspondence analysis, interval multidemensional scaling (INTERSCAL), symbolic hierarchical clustering, CM, CRM, Lasso, Ridge and Elastic Net Linear regression model to interval variables have been implemented. This new version also includes new features to manipulate symbolic data through a new data structure that implements Symbolic Data Frames and methods for converting SODAS and XML SODAS files to RSDA files. Manabu Ichino (Tokyo Denki University, Japan) Title: The Lookup Table Regression Model for Symbolic Data Abstract: This paper presents a preliminary research on the lookup table regression model for symbolic data. We apply the quantile method to the given symbolic data table of the size (N
6 objects) (d feature variables), and we represent each object by (m+1) d-dimensional numerical vectors, called the quantile vectors, for a preselected integer m. The integer m controls the granularity of the descriptions for symbolic objects. In the new data table of the size {N (m+1)} d, we interchange N (m+1) rows according to the values of the selected objective variable from the smallest to the largest. For each of remained d-1 features, i.e., columns, we execute the segmentation of feature values into blocks so that the generated blocks satisfy the monotone property. We discard columns that have only a single block. Then, we execute the segmentation of the objective variable according to the blocks of the remained explanatory feature variables. Finally, we obtain the lookup table of the size N d, where N N (m+1) and d d. Each element of the table is an interval value corresponding to the segmented block. We realize the interval-value estimation rule for the objective variable by the search of the nearest element in the lookup table. We present examples to illustrate the lookup table regression model. Paula Brito (Porto University, Portugal) Title: Multivariate Parametric Analysis of Interval Data Abstract: In this work we focus on the study of interval data, i.e., when the variables' values are intervals of R. Parametric probabilistic models for interval-valued variables have been proposed and studied in (Brito & Duarte Silva, 2012). These models are based on the representation of each observed interval by its MidPoint and LogRange, and Multivariate Normal and Skew-Normal distributions are assumed for the whole set of 2p MidPoints and LogRanges of the original p interval-valued variables. The intrinsic nature of the interval-valued variables leads to different structures of the variance-covariance matrix, represented by different possible configurations. For all cases, maximum likelihood estimators of the corresponding parameters have been derived. This framework may be applied to different statistical multivariate methodologies, thereby allowing for inference approaches for symbolic data. The proposed modelling has first been applied to (M)ANOVA of interval data, using a likelihood-ratio approach. Linear and quadratic models for discriminant analysis of data described by interval-valued variables have been obtained, and their performance compared with alternative distance-based approaches. We have also addressed the problem of mixture distributions, developing model-based clustering using the proposed models. For the Gaussian model, the problem of outlier identification is addressed, using Mahalanobis distances based on robust estimations of the joint mean values and the covariance matrices. The referred modelling, for the Gaussian case, has been implemented in the R-package MAINT.DATA, available on CRAN. MAINT.DATA introduces a data class for representing interval data and includes functions for modeling and analysing these data. In particular, maximum likelihood estimation and statistical tests for the different considered configurations are addressed. Methods for (M)ANOVA and Linear and Quadratic Discriminant Analysis of this data class are also currently provided. Chun-houh Chen (Institute of Stat. Science, Academia Sinica, Taiwan) Title: Some Extensions of Matrix Visualization: the GAP Approach
7 Abstract: Exploratory data analysis (EDA, Tukey, 1977) has been extensively used for nearly 40 years yet boxplot and scatterplot are still the major EDA tools for visualizing continuous data in the 21st century. Many extended modules of matrix visualization via the Generalized Association Plots (GAP) approach have been developed or under developing. Some details of the following MV modules will be provided in this talk: 1. Matrix visualization for high-dimensional categorical data structure. For categorical data, MCA (multiple correspondence analysis) is most popular for visualizing reduced joint space for samples and variables of categorical nature. But similar to it s continuous counter part: PCA (principal component analysis), MCA loses its efficiency when data dimensionality gets really high. In this study we extend the framework of matrix visualization from continuous data to categorical data. Categorical matrix visualization can effectively present complex information patterns for thousands of subjects on thousands of categorical variables in a single matrix visualization display. 2. Matrix Visualization for High-Dimensional Data with a Cartography Link. When a cartography link is attached to each subject of a high-dimensional categorical data, it is necessary to use a geographical map to illustrate the pattern of subject (region)-clusters with variable-groups embedded in the high-dimensional space. This study presents an interactive cartography system with systematic color-coding by integrating the homogeneity analysis into matrix visualization. 3. Matrix visualization for symbolic data analysis. Symbolic data analysis (SDA) has gained popularity over the past few years because of its potential for handling data having a dependent and hierarchical nature. Here we introduce matrix visualization (MV) for visualizing and clustering SDA data using interval-valued symbolic data as an example; it is by far the most popular SDA data type in the literature and the most commonly encountered one in practice. Many MV techniques for visualizing and clustering conventional data are converted to SDA data, and several techniques are newly developed for SDA data. Various examples of data with simple to complex structures are brought in to illustrate the proposed methods. 4. Covariate-adjusted matrix visualization via correlation decomposition. In this study, we extend the framework of matrix visualization (MV) by incorporating a covariate adjustment process through the estimation of conditional correlations. MV can explore the grouping and/or clustering structure of high-dimensional large-scale data sets effectively without dimension reduction. The benefit is in the exploration of conditional association structures among the subjects or variables that cannot be done with conventional MV. Several biomedical examples will be employed for illustrating the versatility of the GAP approach matrix visualization. (cf. Cheng Wang (Beihang University), Edwin Diday, Richard Emilion, Huiwen Wang Title: Multiple Correspondence Analysis for Mixed Symbolic Data Abstract: Under the circumstance of cross-platform data collection technology develops rapidly and the big data era is coming, there are always a mixture of single-valued data, histogram data, composition data and functional data in one table, which can be called mixed feature-data. Different types of data may be belong to different space, which leads to that it is a pretty complicated problem to conduct crosstab analysis among several data types. In this paper, we propose a Multiple Correspondence Analysis (MCA) for mixed data to detect and represent underlying structures involved. Before MCA, we first transfer different types of data to vector
8 data, which is further converted to nominal data. Two ways are considered to convert the vector data to nominal data, respectively is hierarchal clustering and discretization. An empirical analysis is conducted to compare the performance of MCA for mixed data based on these two different ways.
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
More informationExploratory Data Analysis with MATLAB
Computer Science and Data Analysis Series Exploratory Data Analysis with MATLAB Second Edition Wendy L Martinez Angel R. Martinez Jeffrey L. Solka ( r ec) CRC Press VV J Taylor & Francis Group Boca Raton
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationTOWARD BIG DATA ANALYSIS WORKSHOP
TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)
More informationNew Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationCHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,
More informationGeneralized association plots (GAP): Dimension free information visualization environment for multivariate data structure
Generalized association plots (GAP): Dimension free information visualization environment for multivariate data structure Chun-houh Chen, hun-chuan Chang, Yueh-Yun Chi, and Chih-Wen Ou-Young Academia inica,
More informationLearning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
More informationPRACTICAL DATA MINING IN A LARGE UTILITY COMPANY
QÜESTIIÓ, vol. 25, 3, p. 509-520, 2001 PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY GEORGES HÉBRAIL We present in this paper the main applications of data mining techniques at Electricité de France,
More informationCONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationPrinciples of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
More informationMarketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
More informationStatistics for BIG data
Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before
More informationHow To Identify Noisy Variables In A Cluster
Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,
More informationStatistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
More informationHow To Understand The Theory Of Probability
Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL
More information2015 Workshops for Professors
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
More informationLecture 2. Summarizing the Sample
Lecture 2 Summarizing the Sample WARNING: Today s lecture may bore some of you It s (sort of) not my fault I m required to teach you about what we re going to cover today. I ll try to make it as exciting
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationData Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
More informationExploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationPrincipal Component Analysis
Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded
More informationSilvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com
SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING
More informationGeostatistics Exploratory Analysis
Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationClassification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
More informationLecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
More informationPrinciple Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationIBM SPSS Direct Marketing 19
IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS
More informationData analysis process
Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler
Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationMATHEMATICAL METHODS OF STATISTICS
MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
More informationTeaching Multivariate Analysis to Business-Major Students
Teaching Multivariate Analysis to Business-Major Students Wing-Keung Wong and Teck-Wong Soon - Kent Ridge, Singapore 1. Introduction During the last two or three decades, multivariate statistical analysis
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationCurriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different
More informationData Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationMachine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.
Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,
More informationCOLLEGE OF SCIENCE. John D. Hromi Center for Quality and Applied Statistics
ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM COLLEGE OF SCIENCE John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE: COS-STAT-747 Principles of Statistical Data Mining
More informationTHE MULTIVARIATE ANALYSIS RESEARCH GROUP. Carles M Cuadras Departament d Estadística Facultat de Biologia Universitat de Barcelona
THE MULTIVARIATE ANALYSIS RESEARCH GROUP Carles M Cuadras Departament d Estadística Facultat de Biologia Universitat de Barcelona The set of statistical methods known as Multivariate Analysis covers a
More informationKATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics
ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary
More informationEM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationDISCRIMINANT FUNCTION ANALYSIS (DA)
DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant
More informationMultivariate Statistical Inference and Applications
Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim
More informationCOM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3
COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping
More informationIntroduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
More informationUsing Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean
Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen
More informationEXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
More informationEasily Identify the Right Customers
PASW Direct Marketing 18 Specifications Easily Identify the Right Customers You want your marketing programs to be as profitable as possible, and gaining insight into the information contained in your
More informationHow to report the percentage of explained common variance in exploratory factor analysis
UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: Lorenzo-Seva, U. (2013). How to report
More informationSAS Certificate Applied Statistics and SAS Programming
SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and Advanced SAS Programming Brigham Young University Department of Statistics offers an Applied Statistics and
More informationMultivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
More informationCS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing
CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationIBM SPSS Direct Marketing 20
IBM SPSS Direct Marketing 20 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This edition applies to IBM SPSS Statistics 20 and to
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationHandling attrition and non-response in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationCustomer Analytics. Turn Big Data into Big Value
Turn Big Data into Big Value All Your Data Integrated in Just One Place BIRT Analytics lets you capture the value of Big Data that speeds right by most enterprises. It analyzes massive volumes of data
More informationElements of statistics (MATH0487-1)
Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -
More informationMATH2210 Notebook 1 Fall Semester 2016/2017. 1 MATH2210 Notebook 1 3. 1.1 Solving Systems of Linear Equations... 3
MATH0 Notebook Fall Semester 06/07 prepared by Professor Jenny Baglivo c Copyright 009 07 by Jenny A. Baglivo. All Rights Reserved. Contents MATH0 Notebook 3. Solving Systems of Linear Equations........................
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationInstitute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More informationCS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 3: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.
More informationIntroduction to Principal Components and FactorAnalysis
Introduction to Principal Components and FactorAnalysis Multivariate Analysis often starts out with data involving a substantial number of correlated variables. Principal Component Analysis (PCA) is a
More informationData Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
More informationStatistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
More informationBusiness Analytics using Data Mining Project Report. Optimizing Operation Room Utilization by Predicting Surgery Duration
Business Analytics using Data Mining Project Report Optimizing Operation Room Utilization by Predicting Surgery Duration Project Team 4 102034606 WU, CHOU-CHUN 103078508 CHEN, LI-CHAN 102077503 LI, DAI-SIN
More information2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
More informationBig Data: a new era for Statistics
Big Data: a new era for Statistics Richard J. Samworth Abstract Richard Samworth (1996) is a Professor of Statistics in the University s Statistical Laboratory, and has been a Fellow of St John s since
More informationDimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
More informationCHAPTER 4 EXAMPLES: EXPLORATORY FACTOR ANALYSIS
Examples: Exploratory Factor Analysis CHAPTER 4 EXAMPLES: EXPLORATORY FACTOR ANALYSIS Exploratory factor analysis (EFA) is used to determine the number of continuous latent variables that are needed to
More informationStrategic Online Advertising: Modeling Internet User Behavior with
2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew
More informationVisualization of textual data: unfolding the Kohonen maps.
Visualization of textual data: unfolding the Kohonen maps. CNRS - GET - ENST 46 rue Barrault, 75013, Paris, France (e-mail: ludovic.lebart@enst.fr) Ludovic Lebart Abstract. The Kohonen self organizing
More informationSPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011
SPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011 Statistical techniques to be covered Explore relationships among variables Correlation Regression/Multiple regression Logistic regression Factor analysis
More informationImplications of Big Data for Statistics Instruction 17 Nov 2013
Implications of Big Data for Statistics Instruction 17 Nov 2013 Implications of Big Data for Statistics Instruction Mark L. Berenson Montclair State University MSMESB Mini Conference DSI Baltimore November
More informationSpatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
More informationMachine Learning for Data Science (CS4786) Lecture 1
Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:
More informationMonitoring chemical processes for early fault detection using multivariate data analysis methods
Bring data to life Monitoring chemical processes for early fault detection using multivariate data analysis methods by Dr Frank Westad, Chief Scientific Officer, CAMO Software Makers of CAMO 02 Monitoring
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationHigh-Dimensional Data Visualization by PCA and LDA
High-Dimensional Data Visualization by PCA and LDA Chaur-Chin Chen Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan Abbie Hsu Institute of Information Systems & Applications,
More informationMS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
More informationMachine Learning and Data Mining. Fundamentals, robotics, recognition
Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,
More information