An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015



Similar documents
Social Media Mining. Data Mining Essentials

An Introduction to Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining Classification: Decision Trees

Data Mining: Overview. What is Data Mining?

Machine learning for algo trading

Predict Influencers in the Social Network

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Machine Learning with MATLAB David Willingham Application Engineer

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Data Mining Practical Machine Learning Tools and Techniques

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Classification algorithm in Data mining: An Overview

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Supervised Learning (Big Data Analytics)

Data Mining Part 5. Prediction

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Foundations of Artificial Intelligence. Introduction to Data Mining

MA2823: Foundations of Machine Learning

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Data Mining for Knowledge Management. Classification

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Comparison of Data Mining Techniques used for Financial Data Analysis

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

Spam Detection A Machine Learning Approach

An Overview of Knowledge Discovery Database and Data mining Techniques

Chapter 12 Discovering New Knowledge Data Mining

Introduction to Learning & Decision Trees

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Data Mining - Evaluation of Classifiers

Data Mining for Fun and Profit

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

Introduction to Data Mining

Predicting Student Performance by Using Data Mining Methods for Classification

Data Mining Algorithms Part 1. Dejan Sarka

Fast Analytics on Big Data with H20

from Larson Text By Susan Miertschin

Machine Learning for Data Science (CS4786) Lecture 1

Data quality in Accounting Information Systems

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

CS Data Science and Visualization Spring 2016

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Machine Learning Capacity and Performance Analysis and R

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Machine Learning using MapReduce

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Learning is a very general term denoting the way in which agents:

Grid Density Clustering Algorithm

Data Mining Techniques

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Principles of Data Mining by Hand&Mannila&Smyth

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Introduction to Data Mining Techniques

Data Mining Essentials

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Using Data Mining for Mobile Communication Clustering and Characterization

The Scientific Data Mining Process

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Advanced analytics at your hands

Information Management course

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Azure Machine Learning, SQL Data Mining and R

Data Mining with Weka

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

How To Solve The Kd Cup 2010 Challenge

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Role of Neural network in data mining

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Contemporary Techniques for Data Mining Social Media

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Web Document Clustering

Data, Measurements, Features

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Visualizing class probability estimators

Machine Learning: Overview

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Classification and Prediction

Transcription:

An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content Instagram users post almost 220,000 photos 72 new hours of video are uploaded to YouTube Twitter users tweet 300,000 times Every day: we create over 2.5 quintillion bytes of data 2.5 10 bytes 90% of data in the world today was created in the last two years What is Data Mining? Related Fields and Disciplines The process of discovering patterns in data. (Witten et al.) Data mining provides the capabilities to: predict the outcome of a future observations uncover relationships between data attributes (features) uncover relationships between observations learning how to best react to situations through trial and error (reinforcement learning) Machine Learning Computational Intelligence Big Data Knowledge Discovery Artificial Intelligence Data Analytics Computer Science Engineering Statistics Mathematics Data Visualization High Performance Computing 1

Applications of Data Mining Applications in Wind Industry Medical Diagnostics Speech Recognition Market Prediction Sports Fraud Detection Online Dating Credit Worthiness Surveillance Cosmology Law Enforcement/NSA DNA Sequence Mapping Equipment Condition Monitoring Pharmaceutical Research Sales Prediction Product Recommendation Image Recognition Wind Power Forecasting Long term forecasts for planning Medium and short term forecasts for power generation commitment Time series, neural networks, fuzzy intelligent systems Applications in Wind Industry Wind Power Firming Securing availability of wind power source at a defined output level and time duration Uses energy storage system to capture excess energy to release later, or Uses gas system that can be deployed quickly Supervised Learning Learns from labeled data Regression Classification Types of Learning Unsupervised Learning Finds groups or structure of unlabeled data Clustering Dimensionality reduction Reinforcement Learning Learns best reaction through trial and error Avoids rapid voltage and power swings on grid 2

Common Algorithms Neural Networks Decision Trees K-Nearest Neighbor K-Means Clustering Support Vector Machines Type of supervised learning Can be used for classification or regression One parameter to tune for (k) Parameter k is an odd number Can suffer from curse of dimensionality in high dimensional space Extreme Learning Machines Naïve Bayes Logistic Regression Principle Component Analysis Must define what is meant by nearest, should be large for dissimilar objects? Numbers Real number Binary number Strings Images Videos Documents 3

Non negativity, 0 Isolation, 0 Symmetry,, Triangle Inequality,,, Euclidean distance, Manhattan distance, Minkowski distance, / Hamming distance Number of positions of two strings that are different (if strings are of equal length) String edit distance Skin Detection Example Objective: predict skin/nonskin for each pixel in testing set using training set of known classes A skin pixel is adjacent to other skin pixels A surrounding 7x7 grid is considered for each pixel The RGB color values for each pixel in the grid is assembled as a feature set 7x7x3 = 147 total features for each pixel The pairwise Euclidean distance is used to determine the k-nearest neighbors Data Features? 4

Easy to understand and implement Highly nonlinear separator Only two parameters to tune (distance and k) Model can be updated easily Sensitive to noise or irrelevant attributes Expensive testing of each instance because all pairwise distance must be calculated to determine k-nearest neighbors Can be used for classification or regression In classification, leaves of tree represent predicted class Decisions trees are often used for decision making representation, but decision tree learning results in a prediction (that can be used for decision making) Interior nodes filter features until a leaf node is reached and a prediction is made The model is the data set and must be stored for future predictions Example Learning is achieved by splitting the training set into subsets based on an attribute value test At each step, a variable is chosen to best split the set One criteria for best is information gain (Kullback Leibler divergence) Entropy measures the level of impurity of a split Minimum impurity (all one class) has entropy of 0 Maximum impurity (equal number of all classes) has entropy of 1 log where is the probability of class i 5

Information gain is used to decide the order of each attribute split Tells us the importance of an attribute of a given feature vector for discriminating between classes Information gain = (entropy of parent) (weighted average entropy of children) Choose split with the highest information gain Minimum Impurity Maximum Impurity 1 log 1 0 0.5 log 0.5 0.5 log 0.5 1 Parent: Entropy=1 Advantages Easy to understand and interpret results Requires less data preprocessing Can handle discrete and continuous data Can quickly predict new instances Child 1: Weight = 2/6 Child 2: Weight=4/6 Disadvantages Based on heuristics such as greedy algorithm, therefore cannot guarantee globally optimum solution Pruning is necessary to avoid overfitting Information gain is biased towards features with more levels, however methods exist for avoiding this bias Child 1: Entropy=0 Information Gain: 1 2 6 0 4 6 0.81 IG=0.46 Child 2: Entropy=0.81 6

Ensembles Some algorithms are best suited for specific characteristics in data (for example linear or nonlinear) Data may have a combination of characteristics A single algorithm might not be able to model this combination Ensembles combine the results of individual algorithms to obtain better performance than can be achieved by each algorithm alone There are several ways to combine the results Training and Validation What is the best model? It depends on the evaluation criteria Some errors may be more costly than others Credit card fraud Medical diagnosis Confusion matrix Many data mining competitions are won by ensembles Training and Validation Overfitting versus underfitting Cross Validation Training and Validation Model should work well for new (unused) Training Set Validation Set Used to build model/select parameters Estimates Error/Model Selection 7

Training and Validation Knowledge Discovery in Databases k-fold Cross Validation Input Data Data Preprocessing Data Mining Postprocessing Information Feature Selection Dimensionality Reduction Normalization Data Subsetting Filtering Patterns Visualization Pattern Interpretation Software Data Sets/Resources Weka (free) R (free) MATLAB Statistica Many others UCI Machine Learning Repository http://archive.ics.uci.edu/ml/ UCI KDD Archive http://kdd.ics.uci.edu/ Carnegie Melon University Statlib http://lib.stat.cmu.edu/index.php University of Toronto Delve Datasets http://www.cs.toronto.edu/~delve/data/datasets.html 8

Classes at UI References Knowledge Discovery Professor Nick Street CS:6421:0001 or MSCI:6421:0001 Computational Intelligence Professor Amaury Lendasse IE:6350:0001 or NURS:6900:0001 Big Data Analytics Professor Amaury Lendasse IE:4172:0001 Statistical Pattern Recognition Professor Yong Chen IE:6760:0001 Information Visualization Professor Amaury Lendasse IE:3149:0001 Tan, P. N., Steinbach, M., & Kumar, V. Introduction to Data Mining. Boston: Pearson. 2006. Witten, I. H., Frank, E., & Hall, M. A. Data Mining: Practical Machine Learning Tools and Techniques. Amsterdam: Morgan Kaufmann. 2011. Santoso, S., Negnevitsky, M., & Hatziargyriou, N. Applications of Data Mining and Analysis Techniques in Wind Power Systems. Power Engineering Society General Meeting, 2006. IEEE. http://en.wikipedia.org/wiki/data_mining Figures http://www.telegraph.co.uk/finance/newsbysector/energy/8575639/gree n-taxes-hit-wind-turbine-makers.html Figures http://aimotion.blogspot.com/2010/08/tools-for-machine-learningperformance.html http://scikitlearn.org/stable/auto_examples/plot_underfitting_overfitting.html https://chrisjmccormick.wordpress.com/2013/07/31/k-fold-crossvalidation-with-matlab-code/ http://fiji.sc/trainable_weka_segmentation_- _How_to_compare_classifiers http://new.abb.com/substations/energy-storage-applications/capacityfirming http://www.proscoutleadgeneration.com/prospect-list-generation-datamining/ http://www.philippe-fournierviger.com/spmf/index.php?link=documentation.php 9