Analytics on Big Data

Similar documents

Machine Learning using MapReduce

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

8. Machine Learning Applied Artificial Intelligence

Social Media Mining. Data Mining Essentials

An Introduction to WEKA. As presented by PACE

Introduction Predictive Analytics Tools: Weka

Mammoth Scale Machine Learning!

Data Mining Algorithms Part 1. Dejan Sarka

Map-Reduce for Machine Learning on Multicore

An Introduction to Data Mining

Bayesian networks - Time-series models - Apache Spark & Scala

Introduction to Data Mining

An Overview of Knowledge Discovery Database and Data mining Techniques

The Data Mining Process

Final Project Report

BIG DATA What it is and how to use?

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Using Data Mining for Mobile Communication Clustering and Characterization

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Data Mining + Business Intelligence. Integration, Design and Implementation

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining Solutions for the Business Environment

Chapter 12 Discovering New Knowledge Data Mining

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

The basic data mining algorithms introduced may be enhanced in a number of ways.

Machine Learning. Hands-On for Developers and Technical Professionals

Data Mining Applications in Manufacturing

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Information Management course

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

A Comparative study of Techniques in Data Mining

Supervised Learning (Big Data Analytics)

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Azure Machine Learning, SQL Data Mining and R

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Advanced In-Database Analytics

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Machine learning for algo trading

Statistics for BIG data

Analysis Tools and Libraries for BigData

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Chapter 20: Data Analysis

Big Data and Data Science: Behind the Buzz Words

Chapter 6. The stacking ensemble approach

THE COMPARISON OF DATA MINING TOOLS

A Review of Data Mining Techniques

Data Mining. Nonlinear Classification

Foundations of Artificial Intelligence. Introduction to Data Mining

Sunnie Chung. Cleveland State University

Detecting Spam Using Spam Word Associations

Client Overview. Engagement Situation. Key Requirements

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Architectures for Big Data Analytics A database perspective

Spark and the Big Data Library

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Machine Learning Big Data using Map Reduce

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

1. Classification problems

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Journée Thématique Big Data 13/03/2015

Specific Usage of Visual Data Analysis Techniques

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Customer Classification And Prediction Based On Data Mining Technique

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining. Chengkai Li University of Texas at Arlington Spring 2016

Data Mining and Neural Networks in Stata

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

What s Cooking in KNIME

Data Mining: An Overview. David Madigan

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Introduction. A. Bellaachia Page: 1

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Tools for Mining Massive Datasets

1. Introduction to Data Mining

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Data Mining for Business Analytics

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Why is Internal Audit so Hard?

Big Data Analytics. Lucas Rego Drumond

CS Data Science and Visualization Spring 2016

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

High Productivity Data Processing Analytics Methods with Applications

HiBench Installation. Sunil Raiyani, Jayam Modi

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Transcription:

Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI)

Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information Tools for data analysis Database management Business intelligence Data mining (step of Knowledge Discovery) Machine learning 2

Goals of data analysis Future Past

Data analysis Analyze/mine/summarize large datasets Extract knowledge from past data Predict trends in future data Involve machine learning and data mining tools 4

Common Use Cases Recommend products/friends/dates Classify content into predefined groups Group similar content together Find associations/patterns in actions/behaviors Detect anomalies Identify key topics/summarize text Ranking search results Others.. 5

6 Examples of application

Data mining & machine learning Data analysis relies on database systems + DM & ML Machine learning: branch of artificial intelligence capability to learn from data without being explicitly programmed example: distinguish between spam and non-spam messages Data mining: step of the Knowledge Discovery process discovering patterns in large data sets for decision support example: regularities between products sold in large transaction data Data mining involves machine learning techniques 7

8 The knowledge discovery process

Data mining Discovering patterns in large data sets and transform them into an understandable structure for further use Methods from: database management, statistics, artificial intelligence, machine learning, visualization 9

Data & Patterns Input: raw data Big data Usually unstructured Coming from different sources Output: patterns Expression, in an appropriate language, that describes succinctly some information extracted from the data Features Regularities on data High-level information 10

Example Loans x x x x x x x x x x x x o x x o o o o x o o o o o o o o o o o Bank loans x: missed rate payments o: regular payments Salaries 11

Example Loans x x x x x x x x x x x x o x x o o o o x o o o o o o o o o o o k Salaries Pattern: IF salary < k THEN missed payments 12

Properties of patterns Validity within some degree (shifting k to the right reduces validity) Novelty with respect to previous knowledge Utility for example: increase in expected profit from a bank Comprehensibility Syntactical measures Semantic measures 13

Data mining techniques Association rules (aka frequent pattern mining) Classification Clustering Collaborative Filtering Regression Anomaly detection Structured prediction Supervised/Semi-supervised learning Feature learning Online learning Grammar induction 14

15 A possible classification

Tools Apache Mahout MLlib Weka KNIME mlpy OpenNN dlib Orange Scikit-learn R RapidMiner Shogun 16

In Our Context Efficient in analyzing/mining data Do not scale Efficient in managing big data Not so easy to analyze or mine the data How to integrate these two worlds together 17

Big data projects R over a cluster computing framework Rhadoop: Open source extension of R on Hadoop Revolution R: R distribution from Revolution Analytics Many other connectors Apache Mahout Open-source package on Hadoop for data mining and machine learning Apache Mllib Spark s scalable machine learning library consisting of common learning algorithms and utilities 18

R A software environment for statistical computing and graphics Widely used among statisticians Open source (GNU project) Binary versions for various operating systems Uses a command line interface but GUIs are also available 19

Main features of R Data structures: vectors, matrices, arrays, data frames (tables) and lists a scalar is represented as a vector with length one in R. Supports: procedural programming with tons of built-in functions a wide variety of statistical techniques linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering,... matrix arithmetic several graphical facilities highly extensible object-oriented programming with generic functions 20

Examples in R # Function that return multiple objects powers <- function(x) { parcel = list(x2=x*x, x3=x*x*x, x4=x*x*x*x); return(parcel); } X = powers(3); print("showing powers of 3 --"); print(x); # Read values from tab-delimited autos.dat autos_data <- read.table("c:/r/autos.dat", header=t, sep="\t") # Graph autos with adjacent bars using rainbow colors barplot(as.matrix(autos_data), main="autos", ylab= "Total", beside=true, col=rainbow(5)) # Place the legend at the top-left corner with no frame # using rainbow colors legend("topleft", c("mon","tue","wed","thu","fri"), cex=0.6, bty="n", fill=rainbow(5)); cars trucks suvs 1 2 4 3 5 4 6 4 6 4 5 6 9 12 16 21

Apache Mahout Apache Software Foundation project Goals: Build a scalable machine learning library. On large data sets. On a community of developers Be as fast and efficient given the intrinsic design of the algorithm Core algorithms implemented on top of scalable, distributed systems Most Mahout implementations are Map Reduce enabled Solutions that run on a single machine are also provided Be open source. 22

Components of Apache Mahout An environment for building scalable algorithms Mahout MapReduce A library of ML algorithm implementations using MapReduce It runs over Hadoop with the MapReduce engine Mahout Samsara A library of ML algorithm implementations using linear algebra and statistical operations along with the data structures to support them Scala with specific extensions that look like R It runs over Hadoop with the Spark engine 23

Problems tackled in Mahout 3C: Collaborative Filtering Clustering Classification FPM: Frequent Pattern Mining Miscellaneous: Dimensionality Reduction (for features extraction) Topic Models (for topics discovering) Others 24

C1: Collaborative Filtering A set of techniques for automatic recommendation Goal: Predict the interest of a user for an item Filter only interesting items "Collaborative" approach: Collecting preferences information from many users Rationale: if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a person chosen randomly. Requirement: Collect a large number of user preferences 25

C2: Clustering Group similar objects together Different distance measures Manhattan, Euclidean, Different algorithms: Connectivity based, Centroid-based, Distribution-based, Density-Based, Others 26

C3: Classification Place items into predefined categories: Entertainment, politics, risks,.. Recommenders Approaches: Linear, Decision trees, Bayesian networks, Support vector machines, Neural networks, 27

FPM: Find the frequent itemsets milk, bread, cheese are sold frequently together Very common in market analysis, access pattern analysis, 28

We Focus On Clustering K-Means Classification Naïve Bayes Frequent Pattern Mining Apriori 29

K-Means Algorithm Iterative algorithm until converges 30

K-Means Algorithm Step 1: Select K points at random (Centers) Step 2: For each data point, assign it to the closest center Now we formed K clusters Step 3: For each cluster, re-compute the centers E.g., in the case of 2D points X: average over all x-axis points in the cluster Y: average over all y-axis points in the cluster Step 4: If the new centers are different from the old centers (previous iteration) then Go to Step 2 31

K-Means in MapReduce Input Dataset (set of points in 2D) Large Initial centroids (K points) Small Map step Each map reads the K-centroids + one block from dataset Assign each point to the closest centroid Output <centroid, point> 32

K-Means in MapReduce (Cont d) Reduce step Gets all points for a given centroid Re-compute a new centroid for this cluster Output: <new centroid> Iteration Control Compare the old and new set of K-centroids if similar then Stop else if max iterations has reached o then Stop o else Start another Map-Reduce Iteration 33

K-Means Optimizations Use of Combiners Similar to the reducer Computes for each centroid the local sums (and counts) of the assigned points Sends to the reducer <centroid, <partial sums>> Use of Single Reducer Amount of data to reducers is very small Single reducer can tell whether any of the centers has changed or not Creates a single output file 34

Naïve Bayes Classifier Given a dataset (training data), we learn (build) a statistical model This model is called Classifier Each point in the training data is in the form of: <label, feature 1, feature 2,,feature N > Label: is the class label Features 1..N: the features (dimensions of the point) Then, given a point without a label <??, feature1,,featuren> Use the model to decide on its label 35

Naïve Bayes Classifier: Example Three features Class label (male or female) Training dataset 36

Naïve Bayes Classifier (Cont d) For each feature in each label Compute the mean and variance That is the model (classifier) 37

Naïve Bayes: Classify New Object Male or female? For each label: Compute posterior value The label with the largest posterior is the suggested label 38

Naïve Bayes: Classify New Object (Cont d) Male or female? Evidence: Can be ignored since it is the same constant for all labels P(label): % of training points with this label p(feature label) =, f is feature value in sample f 39

Naïve Bayes: Classify New Object (Cont d) Male or female? 40

Naïve Bayes: Classify New Object (Cont d) Male or female? The sample is predicted to be female 41

Frequent Pattern Mining Input: a set of items I ={milk, bread, jelly, } a set of transactions where each transaction contains subset of items t1 = {milk, bread, water} t2 = {milk, nuts, butter, rice} Goal: What are the itemsets frequently sold together? Is there a relationship between itemsets in transactions? 42

Output: association rules Rules X Y where X,Y appears together in a transaction Support S: #trans. containing X Y / #trans. in D Statistical relevance Confidence C: #trans. containing X Y / #trans. containing X Relevance of the implication 43

Example Support: Milk Eggs 2% of transactions contain both elements Confidence: 30% of transactions that contain milk contain eggs as well 44

Example TRANSACTION ID 2000 1000 4000 5000 TRANSACTIONS A,B,C A,C A,D B,E,F Let us assume: Minimal support: 50% Minimal confidence:50% 45

Examples (cont) TRANSACTION ID 2000 1000 4000 5000 TRANSACTIONS A,B,C A,C A,D B,E,F Extracted rules: A C support 50% confidence 66.6 C A support 50% confidence 100% 46

Approach Step 1: extraction of frequent itemsets TRANSACTION ID 2000 1000 4000 5000 TRANSACTIONS A,B,C A,C A,D B,E,F 47 Minimal support 50% FREQUENT ITEMSET {A} {B} {C} {A,C} SUPPORT 75% 50% 50% 50%

Approach (cont) Step 2: rule extraction Minimal confidence 50% Confidence of rule A C Support {A,C} / Support {A} = 66.6% Confidence of rule C A Support {A,C} / Support {C} = 100% Extracted rules A C support 50%, confidence 66.6% C A support 50%, confidence 100% 48

Application Market basket analysis * eggs What should we promote to increase the sales of eggs? Milk * what products need to be sold by a supermarket that sells milk? 49

How to find frequent itemsets Naïve Approach Enumerate all possible itemsets and then count each one All possible itemsets of size 1 All possible itemsets of size 2 All possible itemsets of size 3 All possible itemsets of size 4 50

Can we optimize?? {Bread}: 80% {PeanutButter}: 60% {Bread, PeanutButter}: 60% Important property: For itemset S={X, Y, Z, } of size n to be frequent, all its subsets of size n-1 must be frequent as well 51

Apriori Algorithm Executes in scans (iterations), each scan has two phases Given a list of candidate itemsets of size n, count their appearance and find frequent ones From the frequent ones generate candidates of size n+1 (previous property must hold) Start the algorithm where n =1, then repeat 52

53 Apriori Example

54 Apriori Example (Cont d)

Machine learning over Hadoop How to implement K-Means, Naïve Bayes and FMP in MapReduce? 55

Apache Mahout http://mahout.apache.org/ 56

57 Apache Mahout