Mammoth Scale Machine Learning!



Similar documents
COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

Machine Learning using MapReduce

Text Clustering Using LucidWorks and Apache Mahout

Analytics on Big Data

Advanced In-Database Analytics

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Big Data Analytics Opportunities and Challenges

MLlib: Scalable Machine Learning on Spark

Big Data and Scripting Systems build on top of Hadoop

Map-Reduce for Machine Learning on Multicore

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

BIG DATA What it is and how to use?

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

MADlib. An open source library for in-database analytics. Hitoshi Harada PGCon 2012, May 17th

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Radoop: Analyzing Big Data with RapidMiner and Hadoop

On the Performance of High Dimensional Data Clustering and Classification Algorithms

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Machine Learning Big Data using Map Reduce

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Journée Thématique Big Data 13/03/2015

Hadoop Ecosystem B Y R A H I M A.

Using Data Mining and Machine Learning in Retail

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

HiBench Introduction. Carson Wang Software & Services Group

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Spark and the Big Data Library

Connecting library content using data mining and text analytics on structured and unstructured data

Social Media Mining. Data Mining Essentials

HiBench Installation. Sunil Raiyani, Jayam Modi

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Exploring Big Data in Social Networks

Active Learning SVM for Blogs recommendation

Distributed Computing and Big Data: Hadoop and MapReduce

Hadoop: challenge accepted!

What s Cooking in KNIME

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Sentiment analysis on tweets in a financial domain

Large-Scale Test Mining

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Big Data and Marketing

L3: Statistical Modeling with Hadoop

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Architectures for Big Data Analytics A database perspective

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Introduction Predictive Analytics Tools: Weka

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

BIG DATA TRENDS AND TECHNOLOGIES

Role of Social Networking in Marketing using Data Mining

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

The Data Mining Process

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Source: Tutorial: Introduction to Big Data Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia

Fast Data in the Era of Big Data: Twitter s Real-

Getting Started with Hadoop with Amazon s Elastic MapReduce

Hadoop. Sunday, November 25, 12

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Lecture #2. Algorithms for Big Data

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

1 Topic. 2 Scilab. 2.1 What is Scilab?

Unified Big Data Processing with Apache Spark. Matei

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

The Need for Training in Big Data: Experiences and Case Studies

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Data processing goes big

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Prerequisites. Course Outline

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

How To Scale Out Of A Nosql Database

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data and Data Science: Behind the Buzz Words

Collaborative Filtering. Radek Pelánek

Transcription:

Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010!

Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes and Terabytes of data to analyze?!# Do you have Hadoop or MapReduce experience?!# Thanks for the survey!

Little bit about me!# Apache Mahout PMC member!# A ML Enthusiast "!# Software Engineer @ Google!# Google Summer of Code Mentor!# Previous Life: Google Summer of Code student for 2 years.

Agenda!# Introducing Mahout!# Different classes of problems!# And their Mahout based solutions!# Basic data structure!# Usage examples!# Sneak peek at our Next Release

The Mission To build a scalable machine learning library

!# Scale to large datasets Scale! -# Hadoop MapReduce implementations that scales linearly with data. -# Fast sequential algorithms whose runtime doesn t depend on the size of the data -# Goal: To be as fast as possible for any algorithm!# Scalable to support your business case -# Apache Software License 2!# Scalable community -# Vibrant, responsive and diverse -# Come to the mailing list and find out more

The Mission To build a scalable machine learning library

Why a new Library!# Plenty of open source Machine Learning libraries either -# Lack community -# Lack scalability -# Lack documentations and examples -# Lack Apache licensing -# Are not well tested -# Are Research oriented

Agenda!# Introducing Mahout!# Different classes of problems!# And their Mahout based solutions!# Basic data structure!# Usage examples!# Sneak peek at our Next Release

ML on Twitter!# Collection of tweets in the last hour!# Each 140 character or token stream!# We will keep using this example throughout this talk

What is Clustering!# Call it fuzzy grouping based on a notion of similarity

!# Plenty of Algorithms: K-Means, Fuzzy K-Means, Mean Shift, Canopy, Dirichlet Mahout Clustering!# Group similar looking objects!# Notion of similarity: Distance measure: -# Euclidean -# Cosine -# Tanimoto -# Manhattan

Clustering Tweets Identify tweets that are similar and group them

Topic modeling!# Grouping similar or co-occurring features into a topic -# Topic Lol Cat : -# Cat -# Meow -# Purr -# Haz -# Cheeseburger -# Lol

Mahout Topic Modeling!# Algorithm: Latent Dirichlet Allocation -# Input a set of documents -# Output top K prominent topics and the features in each topic

Filtering Topics from Tweets Identify emerging topics in a collection of tweets

Classification!# Predicting the type of a new object based on its features!# The types are predetermined Dog Cat

Mahout Classification!# Plenty of algorithms -# Naïve Bayes -# Complementary Naïve Bayes -# Random Forests -# Logistic Regression (Almost done) -# Support Vector Machines (patch ready)!# Learn a model from a manually classified data!# Predict the class of a new object based on its features and the learned model

Detect OSCON Tweets Tweets without #OSCON Use tweets mentioning #OSCON to train and Classify incoming tweets

Recommendations!# Predict what the user likes based on -# His/Her historical behavior -# Aggregate behavior of people similar to him

Mahout Recommenders!# Different types of recommenders -# User based -# Item based!# Full framework for storage, online online and offline computation of recommendations!# Like clustering, there is a notion of similarity in users or items -# Cosine, Tanimoto, Pearson and LLR

Recommended Tweets Discover interesting tweets without Re-Tweeting or Replying

Frequent Pattern Mining!# Find interesting groups of items based on how they co-occur in a dataset

Mahout Parallel FPGrowth!# Identify the most commonly occurring patterns from -# Sales Transactions buy Milk, eggs and bread -# Query Logs ipad -> apple, tablet, iphone -# Spam Detection Yahoo! http://www.slideshare.net/hadoopusergroup/mail-antispam

Frequent patterns in Tweets Identify groups of words that occur together Or Identify related searches from search logs

Mahout is Evolving!# Mapreduce enabled fitness functions for Genetic programming -# Integration with Watchmaker -# Solves: Travelling salesman, class discovery and many others!# Singular Value decomposition [SVD] of large matrices -# Reduce a large matrix into a smaller one by identifying the key rows and columns and discarding the others -# Mapreduce implementation of Lanczos algorithm

Agenda!# Introducing Mahout!# Different classes of problems!# And their Mahout based solutions!# Basic data structure!# Usage examples!# Sneak peek at our Next Release

Vector

Y! Representing Data as Vectors X = 5, Y = 3! (5, 3)! X!!# The vector denoted by point (5, 3) is simply Array([5, 3]) or HashMap([0 => 5], [1 => 3])

Representing Vectors The basics!# Now think 3, 4, 5,.. n-dimensional!# Think of a document as a bag of words. she sells sea shells on the sea shore!# Now map them to integers she => 0 sells => 1 sea => 2 and so on!# The resulting vector [1.0, 1.0, 2.0, ]

Vectorizer tools!# Map/Reduce tools to convert text data to vectors -# Use collate multiple words (n-grams) eg: San Francisco -# Normalization -# Optimize for sequential or random access -# TF-IDF calculation -# Pruning -# Stop words removal

Agenda!# Introducing Mahout!# Different classes of problems!# And their Mahout based solutions!# Basic data structure!# Usage examples!# Sneak peek at our Next Release

How to use mahout!# Command line launcher bin/mahout!!# See the list of tools and algorithms by running bin/mahout!!# Run any algorithm by its shortname: -# bin/mahout kmeans help!!# By default runs locally!# export HADOOP_HOME = /pathto/hadoop-0.20.2/! -# Runs on the cluster configured as per the conf files in the hadoop directory!# Use driver classes to launch jobs: -# KMeansDriver.runjob(Path input, Path output )!

Clustering Walkthrough (tiny example)!# Input: set of text files in a directory!# Download Mahout and unzip -# mvn install! -# bin/mahout seqdirectory i <input> o <seqoutput>! -# bin/mahout seq2sparse i seq-output o <vectoroutput>! -# bin/mahout kmeans i<vector-output>!!-c <cluster-temp> -o <cluster-output> -k 10 cd 0.01 x 20!

Clustering Walkthrough (a bit more)!# Use bigrams: -ng 2!!# Prune low frequency: s 10!!# Normalize: -n 2!!# Use a distance measure : -dm org.apache.mahout.common.distance.cosinedistancem easure!

Clustering Walkthrough (viewing results)!# bin/mahout clusterdump " s cluster-output/clusters-9/part-00000 " -d vector-output/dictionary.file-* " -dt sequencefile -n 5 -b 100"!# Top terms in a typical cluster comic => 9.793121272867376! comics => 6.115341078151356! con => 5.015090566692931! sdcc => 3.927590843402978! webcomics => 2.916910980686997!

Agenda!# Introducing Mahout!# Different classes of problems!# And their Mahout based solutions!# Basic data structure!# Usage examples!# Sneak peek at our Next Release

!# New breed of classifiers: Mahout 0.4 (trunk) -# Stochastic Gradient Descent (SGD) -# Pegasos SVM (Order of magnitude faster than SVM Perf) -# Lib Linear (Winner, ICML 2008)!# New Recommenders: -# Restricted Boltzmann Machine (RBM) based recommender -# SVD++ recommender!# New Clustering algorithms: -# Spectral Clustering -# K-Means++!# Full Hadoop 0.20 API compliance and performance improvements

Get Started!# http://mahout.apache.org!# dev@mahout.apache.org - Developer mailing list!# user@mahout.apache.org - User mailing list!# Check out the documentations and wiki for quickstart!# http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code

Resources!# Mahout in Action Owen, Anil, Dunning, Friedman http://www.manning.com/owen!# Taming Text Ingersoll, Morton, Farris http://www.manning.com/ingersoll!# Introducing Apache Mahout http://www.ibm.com/developerworks/java/library/j-mahout/

Thanks to!# Apache Foundation!# Mahout Committers!# Google Summer of Code Organizers!# And Students!# OSCON!# Open source!

!# news.google.com References!# Cat http://www.flickr.com/photos/gattou/3178745634/!# Dog http://www.flickr.com/photos/30800139@n04/3879737638/!# Milk Eggs Bread http://www.flickr.com/photos/nauright/4792775946/!# Amazon Recommendations!# twitter