Big Data: The Science of Patterns. Dr. Lutz Hamel Dept. of Computer Science and Statistics hamel@cs.uri.edu

Similar documents
Data Mining of Web Access Logs

ATHABASCA UNIVERSITY

Decision-Tree Learning

The KDD Process: Applying Data Mining

Data Mining with Weka

8. Machine Learning Applied Artificial Intelligence

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Data mining techniques: decision trees

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Data Mining on Streams

Chapter 12 Discovering New Knowledge Data Mining

Monday Morning Data Mining

Knowledge-based systems and the need for learning

Knowledge Discovery and Data Mining

Social Media Mining. Data Mining Essentials

Data Mining: Foundation, Techniques and Applications

Chapter-4. Forms & Steps in Data Mining Operations. Introduction

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining and Neural Networks in Stata

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Critical Rationalism

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Course 395: Machine Learning

Supervised Learning (Big Data Analytics)

Computing Concepts with Java Essentials

Data Mining for Knowledge Management. Classification

Data Mining Classification: Decision Trees

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition

D A T A M I N I N G C L A S S I F I C A T I O N

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

A Review of Data Mining Techniques

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Data quality in Accounting Information Systems

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

THE COMPARISON OF DATA MINING TOOLS

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Information Management course

Sutee Sujitparapitaya, Ph.D. Institutional Effectiveness and Analytics San José State University

Data Mining - Introduction

Categorical Data Visualization and Clustering Using Subjective Factors

Decision Trees. JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology. Doctoral School, Catania-Troina, April, 2008

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Data Mining Techniques

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Classification On The Clouds Using MapReduce

A comparative study of data mining (DM) and massive data mining (MDM)

Nagarjuna College Of

An Introduction to the WEKA Data Mining System

Data Mining & Data Stream Mining Open Source Tools

Mining Views: Database Views for Data Mining

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends

Java 6 'th. Concepts INTERNATIONAL STUDENT VERSION. edition

Spring 2007 Assignment 6: Learning

Is Big Data a Big Deal? What Big Data Does to Science

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Version Spaces.

Final Project Report

A Decision Tree for Weather Prediction

Data Mining Practical Machine Learning Tools and Techniques

How To Use Neural Networks In Data Mining

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining with Customer Relationship Management

Management Decision Making. Hadi Hosseini CS 330 David R. Cheriton School of Computer Science University of Waterloo July 14, 2011

Data Mining Algorithms Part 1. Dejan Sarka

Sunnie Chung. Cleveland State University

Federico Rajola. Customer Relationship. Management in the. Financial Industry. Organizational Processes and. Technology Innovation.

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

EKT150 Introduction to Computer Programming. Wk1-Introduction to Computer and Computer Program

Web Usage Mining: Identification of Trends Followed by the user through Neural Network

Rule based Classification of BSE Stock Data with Data Mining

Hadoop and Map-reduce computing

Introduction to Learning & Decision Trees

Professor Anita Wasilewska. Classification Lecture Notes

A Practical Comparative Study Of Data Mining Query Languages

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

The Challenge of Handling Large Data Sets within your Measurement System

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Teaching Big Data and Analytics to Undergraduate and Graduate Students

Volume 4, Issue 1, January 2016 International Journal of Advance Research in Computer Science and Management Studies

Big Data Hope or Hype?

not possible or was possible at a high cost for collecting the data.

Classification and Prediction

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

BCS Higher Education Qualifications. Professional Graduate Diploma in IT. Programming Paradigms Syllabus

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS

Transcription:

Big Data: The Science of Patterns Dr. Lutz Hamel Dept. of Computer Science and Statistics hamel@cs.uri.edu

The Blessing and the Curse: Lots of Data Outlook Temp Humidity Wind Play Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mold High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mold High Strong Yes Overcast Hot Normal Weak Yes Rain Mold High Strong No 1 petabyte is 2 50 bytes, 1024 terabytes, or a million gigabytes Take Yahoo Inc.'s 2-petabyte, specially built data warehouse, which it uses to analyze the behavior of its half-billion Web visitors per month (2008).

The Problem: Data vs. Information Outlook Temp Humidity Wind Play Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mold High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mold High Strong Yes Overcast Hot Normal Weak Yes Rain Mold High Strong No Data Information

Information as Patterns From an AI perspective information is represented as patterns patterns summarize large collections of data patterns can be converted into actionable information in Yahoo s case, web behavior patterns can be connected to the kinds of online ads Yahoo might show to its customers. patterns come in all kinds of shapes and forms graphical, rule-based, visual, numeric, etc.

Can You find some Patterns here? Outlook Temp Humidity Wind Play Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mold High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mold High Strong Yes Overcast Hot Normal Weak Yes Rain Mold High Strong No Data Information

A Tree based Pattern Outlook Temp Humidity Wind Play Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mold High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mold High Strong Yes Overcast Hot Normal Weak Yes Rain Mold High Strong No Data Information

Rule based Patterns Best rules found: 1. outlook=overcast 4 ==> play=yes 4 conf:(1) 2. temperature=cool 4 ==> humidity=normal 4 conf:(1) 3. humidity=normal windy=false 4 ==> play=yes 4 conf:(1) 4. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1) 5. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1) 6. outlook=rainy play=yes 3 ==> windy=false 3 conf:(1) 7. outlook=rainy windy=false 3 ==> play=yes 3 conf:(1) 8. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1) 9. outlook=sunny temperature=hot 2 ==> humidity=high 2 conf:(1) 10. temperature=hot play=no 2 ==> outlook=sunny 2 conf:(1) These are called association rules.

Visual Pattern Bacterium b-cereus on different agars Self-Organizing Map Mannitol Chocolate Blood Nutrient Blood You are what you eat!

Numeric Patterns ANNs learn numeric patterns on the weighted connections of their neurons.

Decision Tree Learning Supervised Learning we have a key concept (target attribute) that we want to learn, e.g. when to play tennis. The key idea is that the attributes and their values should be used to sort the data instances in such a way that target attribute is a non-random as possible its entropy as close to 0 as possible.

Entropy S is a sample of training examples p + is the proportion of positive examples in S p - is the proportion of negative examples in S Entropy measures the impurity (randomness) of S p + Entropy(S) - p + log 2 p + - p - log 2 p -

Decision Tree Learning Recursive Algorithm Main loop: 1 Let attribute A be the attribute that minimizes the average entropy at the current node 2 For each attribute value of A, create new decendents of current node 3 Sort training examples to decendents 4 If training examples are perfectly sorted (entropy=0), then STOP, else iterate over decendents.

How does that exactly work? Outlook Temp Humidity Wind Play Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mold High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mold High Strong Yes Overcast Hot Normal Weak Yes Rain Mold High Strong No Data Information

Decision Tree Learning

Decision Tree Learning

Decision Tree Learning

Decision Tree Learning

The Limits of Big Data There are some domains that are infinite, that is, there are an infinite number of distinct observations one could collect. That implies that even the biggest data warehouse will not be able to store all the observations. That means we are building patterns from only partial information a problem for inductive reasoning.

The Limits of Big Data Karl Popper (1902-1994) was a philosopher of science and first put forward the problem of inductive reasoning as the black swan problem. With a limited sample on swans you will most likely conclude that all swans are white. But it turns out there are black swans in Australia. Paraphrasing this: our patterns are only as good as the data which generated them.

The Limits of Big Data If your data warehouse only captures D of the overall data universe X, then your pattern: all swans are white will not be correct.

Weka Demo Weka is a data mining tool freely available on the web. It is written in Java and should probably not be used for big data projects but it is a great way to experiment with tools and techniques important to big data projects. http://www.cs.waikato.ac.nz/ml/weka/ repository.seasr.org/datasets/uci/arff/mushroom.arff

Resources and References Tom Mitchell -- Machine Learning, McGraw Hill, 1997 Karl Popper -- The Logic of Scientific Discovery, 1934 (as Logik der Forschung, English translation 1959), ISBN 0-415-27844-9 Weka Data Mining -- http://www.cs.waikato.ac.nz/ml/weka/ Yahoo! data warehouse -- http://www.computerworld.com/article/2535825/business-intelligence/sizematters--yahoo-claims-2-petabyte-database-is-world-s-biggest--busiest.html Decision tree learning -- http://youtu.be/ekd5gxppey0? list=plbv09bd7ez_4tembw7vla19p3tdqh6fyo Classifying Bacteria using SOM -- Bayesian Probability Approach to Feature Significance for Infrared Spectra of Bacteria, Lutz Hamel, Chris W. Brown, Applied Spectroscopy, Volume 66, Number 1, 2012.