Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD
Presenter
Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing, BI, and analytics: High-quality, vendor-neutral educational offerings Independent analyst research staff and thought leadership Trusted sources of emerging information and trends Ability to bring together qualified BI/DW professionals and solution providers www.tdwi.org Premium Membership, conferences, seminars, research, publications, topical portals, whitepaper library, and numerous online programs 3
Agenda Introduction to big data and predictive analytics Popular predictive analytics methodologies Examples Guidelines Deployment models
Big Data M2M/IoT Volume Mobile/Location Social Text Formats 5
A Confusion of Words
Big Data Analytics Text analy,cs Stream mining Predic,ve analy,cs Link analysis Big Data Analy,cs Slicing and dicing Visual discovery Etc.
Predictive Analytics A statistical or data mining solution consisting of algorithms and techniques that can be used on both structured and unstructured data to determine outcomes
A Lot of It Is Used to Predict Behavior People Churn Marketing Fraud detection Machine Operations maintenance And much, much more! Good source for use cases
Of Course, It Isn t Just About Modeling CRISP Lifecycle
A Vast Array of Techniques Source: TDWI BPR on Predictive Analytics, 2014; n=242
Supervised Use it when you know outcomes of interest Leave vs. stay Revenue prediction Need enough data for training, testing, validation
Unsupervised Does not include target information Looks for commonalities/hidden structures in data May not produce useful insight Is it prediction?
Techniques Supervised Classification Regression Neural networks Unsupervised Clustering Association Supervised Deep learning, auto-encoders Decision trees, random forests, gradient boosting Support vector machines, Bayesian classifiers, principal component, discriminant analysis Unsupervised Nearest-neighbor mapping, k-means clustering, selforganizing maps Factor analysis, link analysis
Decision Trees Good for classification and prediction with known, discrete outcomes
Linear Regression Used to predict a continuous variable from independent variables
Artificial Neural Networks (1) Biological to Mathematical Source agh.edu
Artificial Neural Networks (2) Can be used on a range of problems; good for classification and estimation Source: Commonsenseatheism.com
Clustering Used to group observa,ons by perceived similarity Source: Babelomics
Association Rule Mining Transac'on Items 1 milk, leduce 2 leduce, diapers, beer, cookies 3 milk, diapers, beer, plas,c bags 4 leduce, milk, diapers, beer 5 leduce, milk, diapers, plas,c bags Diapers -> Beer Used to find relationships Two concepts: support and confidence
Quick Quiz How much revenue will this customer bring? Regression Who is going to take a certain action? Classification What are my customer segments? Clustering If a customer buys X, what else might it buy? Association rules
Strengths & Weaknesses: Decision Trees Strengths Easy to understand Rules vs. equations Easy to explain Not a black box Data doesn t have to follow any distribution Can handle interactions between variables Weaknesses Continuous value predictions Can be computationally expensive to train Can have problems if many classes and few training examples Overfitting
Strengths & Weaknesses: Regression Strengths Simple to use Easy to explain through independent variables Weaknesses Relationship needs to be linear Hard-to-handle categorical variables or variables that interact Outliers hard to model
Strengths & Weaknesses: Neural Networks Strengths Good for a specific class of problems May be easy to implement Non-linear/interaction variables Weaknesses Hard-to-explain output (black box) Output might be unpredictable Training can take a long time
Strengths & Weaknesses: K Means Strengths Good for large datasets Simple Efficient Weaknesses Need to specify K upfront Sensitive to outliers, which may result in incorrect cluster boundaries Needs a mean (categorical data?)
Strengths & Weaknesses: Association Rules Strengths Simple Text data (categorical) Weaknesses Can be computationally expensive Potential for spurious patterns Rules do not mean causality
Ensemble Modeling Multiple models are combined to solve a problem
Vendors Are Offering a Range of Options for Predictive Analytics UI easier to use: visual vs. code based Automation Collaboration/interactivity Cloud options Operationalizing and embedding advanced analytics
Operationalizing An example: 29
TDWI Big Data Maturity Model
QUESTIONS?