Today. COM 578 Empirical Methods in Machine Learning and Data Mining. Staff, Office Hours, Topics. Dull organizational stuff.

Similar documents
CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Information Management course

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Statistics W4240: Data Mining Columbia University Spring, 2014

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Comparison of Data Mining Techniques used for Financial Data Analysis

Assessing Data Mining: The State of the Practice

: Introduction to Machine Learning Dr. Rita Osadchy

Data Mining and Machine Learning in Bioinformatics

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

COLLEGE OF SCIENCE. John D. Hromi Center for Quality and Applied Statistics

Introduction to Data Science: CptS Syllabus First Offering: Fall 2015

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

CS Data Science and Visualization Spring 2016

Course Description This course will change the way you think about data and its role in business.

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Introduction to Data Mining

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

MS1b Statistical Data Mining

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Principles of Data Mining by Hand&Mannila&Smyth

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Faculty of Science School of Mathematics and Statistics

Course Syllabus. Purposes of Course:

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

Machine Learning Capacity and Performance Analysis and R

Business Analytics. 0. Overview. Lars Schmidt-Thieme. Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany

An Introduction to Data Mining

Social Media Mining. Data Mining Essentials

Statistics for BIG data

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

Data Mining: Overview. What is Data Mining?

MD - Data Mining

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Dynamic Data in terms of Data Mining Streams

Machine Learning with MATLAB David Willingham Application Engineer

Azure Machine Learning, SQL Data Mining and R

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Microsoft Azure Machine learning Algorithms

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Chapter 6. The stacking ensemble approach

MA2823: Foundations of Machine Learning

Data Mining Part 5. Prediction

Random forest algorithm in big data environment

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Introduction to Machine Learning Using Python. Vikram Kamath

Prediction of Stock Performance Using Analytical Techniques

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining Practical Machine Learning Tools and Techniques

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Master of Science in Health Information Technology Degree Curriculum

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition

Least Squares Estimation

Machine Learning Introduction

not possible or was possible at a high cost for collecting the data.

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Chapter 12 Discovering New Knowledge Data Mining

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Data Mining - Evaluation of Classifiers

Learning outcomes. Knowledge and understanding. Competence and skills

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare

How To Understand And Understand A Problem

Data Mining and Pattern Recognition for Large-Scale Scientific Data

Model Validation Techniques

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Statistics Graduate Courses

MACHINE LEARNING BASICS WITH R

Employer Health Insurance Premium Prediction Elliott Lui

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Lecture: Mon 13:30 14:50 Fri 9:00-10:20 ( LTH, Lift 27-28) Lab: Fri 12:00-12:50 (Rm. 4116)

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Knowledge Discovery and Data Mining

BOR 6335 Data Mining. Course Description. Course Bibliography and Required Readings. Prerequisites

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

CAS CS 565, Data Mining

Comparison of K-means and Backpropagation Data Mining Algorithms

Data Mining. Nonlinear Classification

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS

Predictive Modeling and Big Data

Supervised Learning (Big Data Analytics)

Learning is a very general term denoting the way in which agents:

Transcription:

COM 578 Empirical Methods in Machine Learning and Data Mining Rich Caruana http://www.cs cs.cornell.edu/courses/cs578/2007fa Today Dull organizational stuff Course Summary Grading Office hours Homework Final Project Fun stuff Historical Perspective on Statistics, Machine Learning, and Data Mining Staff, Office Hours, Rich Caruana Upson Hall 4157 Tue 4:30-5:00pm Wed 10:30-11:00am caruana@cs cs.cornell.eduedu TA: Daria Sorokina Upson Hall 5156 TBA daria@cs.cornell.edu TA: Ainur Yessenalina Upson Hall 4156 TBA ainur@cs.cornell.edu TA: Alex Niculescu-Mizil Upson Hall 5154 TBA alexn@cs.cornell.edu edu Admin: Melissa Totman Upson Hall 4147 M-F 9:00am-4:00pm Decision Trees K-Nearest Neighbor Artificial Neural Nets Support Vector Machines Association Rules Clustering Boosting/Bagging Cross Validation Topics ~30% overlap with CS478 Performance Metrics Data Transformation Feature Selection Missing Values Case Studies: Medical prediction Protein folding Autonomous vehicle navigation 1

4 credit course Grading 25% take-home mid-term (late-october) 25% open-book final (????) 30% homework assignments (3 assignments) 20% course project (teams of 1-4 people) late penalty: one letter grade per day 90-100 = A-, A, A+ 80-90 = B-, B, B+ 70-80 = C-, C, C+ Homeworks short programming and experiment assignments e.g., implement backprop and test on a dataset goal: get familiar with a variety of learning methods two or more weeks to complete each assignment C, C++, Java, Perl,, shell scripts, or Matlab must be done individually hand in code with summary and analysis of results emphasis on understanding and analysis of results, not generating a pretty report short course in Unix and writing shell scripts Project Data Mining Mini Competition Train best model on problem(s) we give you decision trees k-nearest neighbor artificial neural nets SVMs bagging, boosting, model averaging,... Given train and test sets Have target values on train set No target values on test set Send us predictions and we calculate performance Performance on test sets is part of project grade Due before exams & study period Text Books Required Text: Machine Learning by Tom Mitchell Optional Texts: Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie, Tibshirani,, and Friedman Pattern Classification,, 2nd ed., by Richard Duda,, Peter Hart, & David Stork Pattern Recognition and Machine Learning by Chris Bishop Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber Selected papers 2

Fun Stuff Statistics, Machine Learning, and Data Mining Past, Present, and Future Once upon a time... 3

before statistics Pre-Statistics: Ptolmey-1850 First Data Sets created Positions of mars in orbit: Tycho Brahe (1546-1601) Star catalogs Tycho catalog had 777 stars with 1-2 arcmin precision Messier catalog (100+ dim fuzzies that look like comets) Triangulation of meridian in France Not just raw data - processing is part of data Tychonic System: anti-copernican, many epicycles No theory of errors - human judgment Kepler knew Tycho s data was never in error by 8 arcmin Few models of data - just learning about modeling Kepler s Breakthrough: Copernican model and 3 laws of orbits Pre-Statistics: 1790-1850 Statistics: 1850-1950 The Metric System: uniform system of weights and measures Meridian from Dunkirk to Barcelona through Paris triangulation Meter = Distance (pole to equator)/10,000,000 Most accurate survey made at that time 1000 s s of measurements spanning 10-20 years! Data is available in a 3-volume book that analyses it No theory of error: surveyors use judgment to correct data for better consistency and accuracy! Data collection starts to separate from analysis Hand-collected data sets Physics, Astronomy, Agriculture,... Quality control in manufacturing Many hours to collect/process each data point Usually Small: 1 to 1000 data points Low dimension: 1 to 10 variables Exist only on paper (sometimes in text books) Experts get to know data inside out Data is clean: human has looked at each point 4

Statistics: 1850-1950 Calculations done manually manual decision making during analysis Mendel s s genetics human calculator pools for larger problems Simplified models of data to ease computation Gaussian, Poisson, Keep computations tractable Get the most out of precious data careful examination of assumptions outliers examined individually Statistics: 1850-1950 Analysis of errors in measurements What is most efficient estimator of some value? How much error in that estimate? Hypothesis testing: is this mean larger than that mean? are these two populations different? Regression: what is the value of y when x=x i or x=x j? How often does some event occur? p(fail(part 1 )) = p 1 ; p(fail(part 2 )) = p 2 ; p(crash(plane)) =? Statistics would look very different if it had been born after the computer instead of 100 years before the computer Statistics meets Computers 5

Machine Learning: 1950-2000... Medium size data sets become available 100 to 100,000 records Higher dimension: 5 to 250 dimensions (more if vision) Fit in memory Exist in computer, usually not on paper Too large for humans to read and fully understand Data not clean Missing values, errors, outliers, Many attribute types: boolean,, continuous, nominal, discrete, ordinal Humans can t t afford to understand/fix each point Machine Learning: 1950-2000... Computers can do very complex calculations on medium size data sets Models can be much more complex than before Empirical evaluation methods instead of theory don t t calculate expected error, measure it from sample cross validation e.g., 95% confidence interval from data, not Gaussian model Fewer statistical assumptions about data Make machine learning as automatic as possible Don t t know right model => OK to have multiple models (vote them) Machine Learning: 1950-2000... Regression Multivariate Adaptive Regression Splines (MARS) Linear perceptron Artificial neural nets Decision trees K-nearest neighbor Support Vector Machines (SVMs( SVMs) Ensemble Methods: Bagging and Boosting Clustering ML: Pneumonia Risk Prediction Age Gender Blood Pressure Pre-Hospital Attributes Chest X-Ray Pneumonia Risk Albumin Blood po2 White Count In-Hospital Attributes RBC Count 6

ML: Autonomous Vehicle Navigation Steering Direction Can t yet buy cars that drive themselves, and few hospitals use artificial neural nets yet to make critical decisions about patients. Machine Learning: 1950-2000... New Problems: Can t t understand many of the models Less opportunity for human expertise in process Good performance in lab doesn t t necessarily mean good performance in practice Brittle systems, work well on typical cases but often break on rare cases Can t t handle heterogeneous data sources Machine Learning Leaves the Lab Computers get Bigger/Faster but Data gets Bigger/Faster, too 7

Data Mining: 1995-20?? Huge data sets collected fully automatically large scale science: genomics, space probes, satellites Cornell s Arecibo Radio Telescope Project: terabytes per day petabytes over life of project too much data to move over internet -- they use FedEx! Protein Folding 8

Data Mining: 1995-20?? Huge data sets collected fully automatically large scale science: genomics, space probes, satellites consumer purchase data web: > 500,000,000 pages of text clickstream data (Yahoo!: terabytes per day!) many heterogeneous data sources High dimensional data low of 45 attributes in astronomy 100 s s to 1000 s s of attributes common linkage makes many 1000 s s of attributes possible Data Mining: 1995-20?? Data exists only on disk (can t t fit in memory) Experts can t t see even modest samples of data Calculations done completely automatically large computers efficient (often simplified) algorithms human intervention difficult Models of data complex models possible but complex models may not be affordable (Google( Google) Get something useful out of massive, opaque data data tombs 9

Data Mining: 1990-20?? What customers will respond best to this coupon? Who is it safe to give a loan to? What products do consumers purchase in sets? What is the best pricing strategy for products? Are there unusual stars/galaxies in this data? Do patients with gene X respond to treatment Y? What job posting best matches this employee? How do proteins fold? Data Mining: 1995-20?? New Problems: Data too big Algorithms must be simplified and very efficient (linear in size of data if possible, one scan is best!) Reams of output too large for humans to comprehend Very messy uncleaned data Garbage in, garbage out Heterogeneous data sources Ill-posed questions Privacy Statistics, Machine Learning, and Data Mining Historic revolution and refocusing of statistics Statistics, Machine Learning, and Data Mining merging into a new multi-faceted field Old lessons and methods still apply, but are used in new ways to do new things Those who don t t learn the past will be forced to reinvent it => Computational Statistics, ML, DM, Change in Scientific Methodology Traditional: Formulate hypothesis Design experiment Collect data Analyze results Review hypothesis Repeat/Publish New: Design large experiment Collect large data Put data in large database Formulate hypothesis Evaluate hyp on database Run limited experiments to drive nail in coffin Review hypothesis Repeat/Publish 10

ML/DM Here to Stay Will infiltrate all areas of science, engineering, public policy, marketing, economics, Adaptive methods as part of engineering process Engineering from simulation Wright brothers on steroids! But we can t t manually verify models are right! Can we trust results of automatic learning/mining? 11