Exploratory Data Analysis

Similar documents
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Organizing Your Approach to a Data Analysis

Data, Measurements, Features

Statistics Review PSY379

Appendix B Checklist for the Empirical Cycle

Recall this chart that showed how most of our course would be organized:

Elementary Statistics

Chapter 7. One-way ANOVA

How To Check For Differences In The One Way Anova

Descriptive Statistics and Measurement Scales

Data Exploration Data Visualization

Chapter ML:IV. IV. Statistical Learning. Probability Basics Bayes Classification Maximum a-posteriori Hypotheses

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Fairfield Public Schools

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Basic Concepts in Research and Data Analysis

Lecture 2: Types of Variables

Normality Testing in Excel

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Statistics. Measurement. Scales of Measurement 7/18/2012

Gerry Hobbs, Department of Statistics, West Virginia University

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Framing Business Problems as Data Mining Problems

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Information Visualization WS 2013/14 11 Visual Analytics

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Learning outcomes. Knowledge and understanding. Competence and skills

Exploratory data analysis (Chapter 2) Fall 2011

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length

Microsoft Azure Machine learning Algorithms

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

Social Media Mining. Data Mining Essentials

CHAPTER 12 TESTING DIFFERENCES WITH ORDINAL DATA: MANN WHITNEY U

Additional sources Compilation of sources:

1.3.1 Position, Distance and Displacement

What Is Specific in Load Testing?

Measurement in ediscovery

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

LOGNORMAL MODEL FOR STOCK PRICES

SAS Certificate Applied Statistics and SAS Programming

Measurement and Measurement Scales

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Statistics in Retail Finance. Chapter 2: Statistical models of default

Data Warehousing and Data Mining in Business Applications

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Term Project: Roulette

Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects

Quantitative Methods for Finance

IBM SPSS Direct Marketing 23

The impact of Discussion Classes on ODL Learners in Basic Statistics Ms S Muchengetwa and Mr R Ssekuma muches@unisa.ac.za, ssekur@unisa.ac.

A Statistical Text Mining Method for Patent Analysis

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

Published entries to the three competitions on Tricky Stats in The Psychologist

DATA COLLECTION AND ANALYSIS

Foundation of Quantitative Data Analysis

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Introduction to Statistics and Quantitative Research Methods

Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes

The Scientific Data Mining Process

Experiment #1, Analyze Data using Excel, Calculator and Graphs.

Exploratory Data Analysis for Ecological Modelling and Decision Support

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

White Paper Combining Attitudinal Data and Behavioral Data for Meaningful Analysis

Competitive Analysis of On line Randomized Call Control in Cellular Networks

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

- 1 - intelligence. showing the layout, and products moving around on the screen during simulation

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

A very short history of networking

SECTION 10-2 Mathematical Induction

College Tuition: Data mining and analysis

Elements of statistics (MATH0487-1)

II. DISTRIBUTIONS distribution normal distribution. standard scores

Business Statistics: Intorduction

YOU CAN COUNT ON NUMBER LINES

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

(Following Paper ID and Roll No. to be filled in your Answer Book) Roll No. METHODOLOGY

CHANCE ENCOUNTERS. Making Sense of Hypothesis Tests. Howard Fincher. Learning Development Tutor. Upgrade Study Advice Service

DATA INTERPRETATION AND STATISTICS

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

GAO. Quantitative Data Analysis: An Introduction. Report to Program Evaluation and Methodology Division. United States General Accounting Office

Spatial Data Analysis

Chapter 5 Analysis of variance SPSS Analysis of variance

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Transcription:

Exploratory Data Analysis 4 March 2009 Research Methods for Empirical Computer Science CMPSCI 691DD

Edwin Hubble

What did Hubble see?

What did Hubble see?

Hubble s Law V = H 0 r Where: V = recessional velocity H 0 = Hubble constant r = distance (mpc) E. Hubble (1929). A relation between distance and radial velocity among extra-galactic nebulae. Proceedings of the National Academy of Sciences 15(3).

Hubble s Law

The tool that is so dull that you cannot cut yourself on it is not likely to be sharp enough to be either useful or helpful. John W. Tukey

Random variables The embarrassingly dogmatic misnomer They are neither random, nor are they variables A random variable is... a function that maps from instances to scales the numeric result of a non deterministic experiment They can be distinguished from fixed variables whose value can be set or predetermined before the experiment They are not the individual values (e.g., 5.92), but rather the process of assigning value to instances or (colloquially) the set of values so assigned

Examples Recall of an IR system, given query, corpus, and designated relevant documents Size and speed of code produced by a compiler, given source and a target processor Number of database rows returned, given an anytime query processor, query, database, and time Lines of code written, given an assignment, language, development environment, and programmer

Notes The objects of study are usually the systems that enable random variables (e.g., IR systems), rather than the instances that the measures are on (e.g., queries). What we define as a random variable for a particular experiment can change as we discover deterministic and causal relationships in a given system

Representation of data instances i.i.d. instances are commonly assumed Independent Knowing something about one instance tells you nothing about another Identically distributed Drawn from the same probability distribution Examples? Queries in TREC data Programs in SPEC benchmarks Data sets in UCI repository Some alternatives Time series (e.g., users submitting sets of slightly modified queries) Relational (e.g., router performance embedded in a network)

Populations and samples A population is a specified set of instances An actual finite set of instances (e.g., the UCI data sets for machine learning research) A generalization of an actual finite set (e.g., the set of all data sets that might be produced by a particular simulator in infinite time) A purely hypothetical set which can be described mathematically (e.g., the set of all correct Java programs) Samples are finite subsets of populations

Examples Populations All possible IR queries All possible programs written in Java All Java programmers active in 2005 The SPECjvm98 benchmarks Actual data samples The TREC 2005 HARD queries The SPECjvm98 benchmarks Students taking CMPSCI 320 in Fall 2005 A subset of the benchmarks

Four stages of defining a sample The target population (e.g., all computer programs) The sampling frame (all programs written in Java or C++) The selected sample (all programs written by CS undergraduate students in 200 level courses at UMass) The actual sample (all programs actually turned in)

Why is sampling difficult?

The target population The sampling frame The selected sample The actual sample Sampling problems

Random sampling in CS Random sampling isn't easy in CS...but it's not easy in most sciences Answer isn't to give up, but to consider how to get closer to the ideal Define the ideal population Identify sources of bias in sampling and in subsequent steps of sample definition Remove or mitigate as many sources of bias as possible Modify your confidence in your ability generalize based on your assessment of the match between your actual sample and your desired population

Types of scales Categorical, discrete, or nominal Values contain no ordering information (e.g., multiple access protocols for underwater networking) Ordinal Values indicate order, but no arithmetic operations are meaningful (e.g., "novice", "experienced", and "expert" as designations of programmers participating in an experiment) Interval Distances between values are meaningful, but zero point is not meaningful. (e.g., degrees Fahrenheit) Ratio Distances are meaningful and a zero point is meaningful (e.g., degrees K)

Data transformations Downgrading type (e.g., interval to ordinal) Shifting intervals Tukey's ladder of powers : trans = original^(1 b) E.g.: 2 > original^3, 0.5 > sqrt(original), 2 > 1/original Combining several variables Normalize measurements (e.g., Simsek & Jensen 2005, normalized to optimal) Remove unwanted factors (e.g., remove file read times from total compile times) Consider relation of two variables (e.g., Kirkpatrick & Selman, vertex/edge ratio)

Exploratory data analysis Exploratory data analysis (EDA) employs a variety of techniques to maximize insight into a data set; uncover underlying structure; extract important variables; detect outliers and anomalies; test underlying assumptions; develop parsimonious models; and determine optimal factor settings The EDA approach is precisely that an approach not a set of techniques, but an attitude/philosophy about how a data analysis should be carried out. Source: http://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm

Why EDA? Data analysis tools are typically used for Hypothesis testing Parameter estimation Graphics tools are typically used for presentation However, much of the quality of scientific work is determined by the quality of the hypotheses and models used by the researcher Can data analysis help suggest hypotheses?

Resources Books Exploratory Data Analysis, Tukey, (1977) Data Analysis and Regression, Mosteller and Tukey (1977) Interactive Data Analysis, Hoaglin (1977) The ABC's of EDA, Velleman and Hoaglin (1981) Software Data Desk (Data Description) Fathom (Keypress) XGobi (AT&T Research)

Demo