Introduction to Big Data with Apache Spark UC BERKELEY

Similar documents

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Azure Machine Learning, SQL Data Mining and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Data, Measurements, Features

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

An Introduction to Data Mining

HT2015: SC4 Statistical Data Mining and Machine Learning

Big Data Analytics and Optimization

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Big Data Analytics and Optimization

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Machine Learning for Data Science (CS4786) Lecture 1

Analytics on Big Data

Multivariate Normal Distribution

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Social Media Mining. Data Mining Essentials

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Machine Learning using MapReduce

Microsoft Azure Machine learning Algorithms

MS1b Statistical Data Mining

2015 Workshops for Professors

Statistics Graduate Courses

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Statistics for BIG data

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Predictive Modeling Techniques in Insurance

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Learning outcomes. Knowledge and understanding. Competence and skills

Fairfield Public Schools

Lecture 1: Review and Exploratory Data Analysis (EDA)

STAT 360 Probability and Statistics. Fall 2012

Data Exploration Data Visualization

Using Data Mining for Mobile Communication Clustering and Characterization

Our Raison d'être. Identify major choice decision points. Leverage Analytical Tools and Techniques to solve problems hindering these decision points

Bayesian networks - Time-series models - Apache Spark & Scala

Exploratory Data Analysis

Environmental Remote Sensing GEOG 2021

Server Load Prediction

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Big Data and Marketing

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Stat 20: Intro to Probability and Statistics

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Measures of Central Tendency and Variability: Summarizing your Data for Others

Statistics Review PSY379

ANALYTICS CENTER LEARNING PROGRAM

How To Understand The Theory Of Probability

Descriptive Statistics

Machine Learning Big Data using Map Reduce

Machine Learning with MATLAB David Willingham Application Engineer

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Data Mining and Visualization

DATA ANALYSIS. QEM Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. Howard University

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Street Address: 1111 Franklin Street Oakland, CA Mailing Address: 1111 Franklin Street Oakland, CA 94607

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Introduction to Data Science: CptS Syllabus First Offering: Fall 2015

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

MLlib: Scalable Machine Learning on Spark

Statistical Machine Learning

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Big Data Analytics CSCI 4030

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Normality Testing in Excel

Machine learning for algo trading

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

CS Data Science and Visualization Spring 2016

2013 MBA Jump Start Program. Statistics Module Part 3

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Advanced In-Database Analytics

Maschinelles Lernen mit MATLAB

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Data Mining with SQL Server Data Tools

The Data Mining Process

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

SEIZE THE DATA SEIZE THE DATA. 2015

Analysis Tools and Libraries for BigData

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Additional sources Compilation of sources:

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Transcription:

Introduction to Big Data with Apache Spark UC BERKELEY

This Lecture Exploratory Data Analysis Some Important Distributions Spark mllib Machine Learning Library

Descriptive vs. Inferential Statistics Descriptive:» E.g., Median describes data but can t be generalized beyond that» We will talk about Exploratory Data Analysis in this lecture Inferential:» E.g., t-test enables inferences about population beyond our data» Techniques leveraged for Machine Learning and Prediction

Examples of Business Questions Simple (descriptive) Stats» Who are the most profitable customers? Hypothesis Testing» Is there a difference in value to the company of these customers? Segmentation/Classification» What are the common characteristics of these customers? Prediction» Will this new customer become a profitable customer?» If so, how profitable? adapted from Provost and Fawcett, Data Science for Business

Applying Techniques Most business questions are causal» What would happen if I show this ad? Easier to ask correlational questions» What happened in this past when I showed this ad? Supervised Learning: Classification and Regression Unsupervised Learning: Clustering and Dimension reduction Note: UL often used inside a larger SL problem» E.g., auto-encoders for image recognition neural nets

Supervised Learning:» knn (k Nearest Neighbors)» Naive Bayes» Logistic Regression» Support Vector Machines» Random Forests Unsupervised Learning:» Clustering» Factor Analysis» Latent Dirichlet Allocation Learning Techniques

Exploratory Data Analysis (1977) Based on insights developed at Bell Labs in 1960 s Techniques for visualizing and summarizing data What can the data tell us? (vs confirmatory data analysis) Introduced many basic techniques:» 5-number summary, box plots, stem and leaf diagrams, 5-Number summary:» Extremes (min and max)» Median & Quartiles» More robust to skewed and long-tailed distributions

The Trouble with Summary Stats Property in each set Value Mean of x 9 Sample variance of x 11 Mean of y 7.50 Sample variance of y 4.122 Linear Regression y = 3 + 0.5x Anscombe's Quartet 1973

Looking at The Data

Looking at The Data Takeaways: Important to look at data graphically before analyzing it Basic statistics properties often fail to capture real-world complexities

Data Presentation Data Art Visualizing Friendships https://www.facebook.com/note.php?note_id=469716398919

The R Language Evolution of the S language developed at Bell labs for EDA Idea: allow interactive exploration and visualization of data Preferred language for statisticians, used by many data scientists Features:» The most comprehensive collection of statistical models and distributions» CRAN: large resource of open source statistical models Jeff Hammerbacher 2012 course at UC Berkeley

Normal Distributions, Mean, Variance The mean of a set of values is the average of the values Variance is a measure of the width of a distribution The standard deviation is the square root of variance A normal distribution is characterized by mean and variance mean Standard deviation

Central Limit Theorem The distribution of sum (or mean) of n identically-distributed random variables X i approaches a normal distribution as n Common parametric statistical tests (t-test & ANOVA) assume normally-distributed data, but depend on sample mean and variance Tests work reasonably well for data that are not normally distributed as long as the samples are not too small

Correcting Distributions Many statistical tools (mean, variance, t-test, ANOVA) assume data are normally distributed Very often this is not true examine the histogram

Other Important Distributions Poisson: distribution of counts that occur at a certain rate» Observed frequency of a given term in a corpus» Number of visits to web site in a fixed time interval» Number of web site clicks in an hour Exponential: interval between two such events

Other Important Distributions Zipf/Pareto/Yule distributions:» Govern frequencies of different terms in a document, or web site visits Binomial/Multinomial:» Number of counts of events» Example: 6 die tosses out of n trials Understand your data s distribution before applying any model

Autonomy Corp Rhine Paradox* Joseph Rhine was a parapsychologist in the 1950 s» Experiment: subjects guess whether 10 hidden cards were red or blue He found that about 1 person in 1,000 had Extra Sensory Perception!» They could correctly guess the color of all 10 cards *Example from Jeff Ullman/Anand Rajaraman

Autonomy Corp Rhine Paradox Called back psychic subjects and had them repeat test» They all failed Concluded that act of telling psychics that they have psychic abilities causes them to lose it (!) Q: What s wrong with his conclusion?

Autonomy Corp Rhine s Error What s wrong with his conclusion? 2 10 = 1,024 combinations of red and blue of length 10 0.98 probability at least 1subject in 1,000 will guess correctly

Spark s Machine Learning Toolkit mllib: scalable, distributed machine learning library» Scikit-learn like ML toolkit, Interoperates with NumPy Classification:» SVM, Logistic Regression, Decision Trees, Naive Bayes, Regression: Linear, Lasso, Ridge, Miscellaneous:» Alternating Least Squares, K-Means, SVD» Optimization primitives (SGD, L-BGFS)»

Lab: Collaborative Filtering Goal: predict users movie ratings based on past ratings of other movies Ratings = Movies 1?? 4 5? 3?? 3 5?? 3 5? 5??? 1 4???? 2? Users

Model and Algorithm Model Ratings as product of User (A) and Movie Feature (B) matrices of size U K and M K K: rank R = A B T Learn K factors for each user Learn K factors for each movie

Model and Algorithm Model Ratings as product of User (A) and Movie Feature (B) matrices of size U K and M K Alternating Least Squares (ALS)» Start with random A and B vectors» Optimize user vectors (A) based on movies» Optimize movie vectors (B) based on users» Repeat until converged R = A B T

Learn More about Spark and ML Scalable ML BerkeleyX MOOC» Starts June 29, 2015