! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Size: px
Start display at page:

Download "! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II"

Transcription

1 ! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center 1 October 2nd, 2014

2 Course Structure Class Data Number Topics Covered 09/04/14 1 Introduction to Big Data Analytics 09/11/14 2 Big Data Analytics Platforms 09/18/14 3 Big Data Storage and Processing 09/25/14 4 Big Data Analytics Algorithms -- I 10/02/14 5 Big Data Analytics Algorithms -- II (recommendation) 10/09/14 6 Big Data Analytics Algorithms III (clustering) 10/16/14 7 Big Data Analytics Algorithms IV (classification) 10/23/14 8 Linked Big Data Graph Computing 10/30/14 9 Big Data Visualization 11/06/14 10 Mobile Data Collection, Analysis, and Interface 11/13/14 11 Hardware, Processors, and Cluster Platforms 11/20/14 12 Big Data Next Challenges IoT, Cognition, and Beyond 11/27/14 Thanksgiving Holiday 12/04/14 13 Final Projects Discussion (Optional) 12/11/14 & 12/12/ Two-Day Big Data Analytics Workshop Final Project Presentations 2

3 Review Key Components of Mahout 3

4 Mahout reference book 4

5 Setting Up Mahout Step 1: Java JVM and IDEs (e.g., Eclipse) Step 2: Maven Step 3: Mahout Eclipse Luna (June 2014) 5

6 Recommender Inputs Solid lines: positively related Dashed lines: negatively related Input Data: User, Item, Rating 6

7 User-based Recommendation Scenario I gettofail.com 7

8 User-based Recommendation Scenario II 8

9 User-based Recommendation Scenario III 9

10 User-based Recommendation Algorithms 10

11 Example Recommender Code via Mahout 11

12 Process and output of the example Recommendation for Person 1: Item 104 > Item 106 Item 107 is not favored 12

13 Refresh (Reload) Data 13

14 Update data 14

15 User Similarity Measurements Pearson Correlation Similarity Euclidean Distance Similarity Cosine Measure Similarity Spearman Correlation Similarity Tanimoto Coefficient Similarity (Jaccard coefficient) Log-Likelihood Similarity!! 15

16 Pearson Correlation Similarity Data: missing data 16

17 On Pearson Similarity Three problems with the Pearson Similarity:! 1. Not take into account of the number of items in which two users preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.) 2. If two users overlap on only one item, no correlation can be computed. 3. The correlation is undefined if either series of preference values are identical. Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or -1.0, depending on how many points are used. 17

18 Euclidean Distance Similarity Similarity = 1 / ( 1 + d ) 18

19 Cosine Similarity Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0). 19

20 Spearman Correlation Similarity Example for ties Pearson value on the relative ranks 20

21 Caching User Similarity Spearman Correlation Similarity is time consuming. Need to use Caching ==> remember s user-user similarity which was previously computed. 21

22 Tanimoto (Jaccard) Coefficient Similarity Discard preference values 22 Tanimoto similarity is the same as Jaccard similarity. But, Tanimoto distance is not the same as Jaccard distance.

23 Log-Likelihood Similarity Asses how unlikely it is that the overlap between the two users is just due to chance. 23

24 Performance measurements Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset. Spearnman: 0.8 Tanimoto: 0.82 Log-Likelihood: 0.73 Euclidean: 0.75 Pearson (weighted): 0.77 Pearson:

25 Performance measurements 10 nearest neighbors: nearest neighbors: nearest neighbors: % of training; 5% of testing 25

26 Selecting the number of neighbors Based on number of neighbors Based on a fixed threshold, e.g., 0.7 or

27 Item-based recommendation 27

28 Item-based recommendation algorithm 28

29 Code and Performance of Item-Based Recommendation performance 29

30 Slope-One Recommender 30

31 Slope-One Algorithm Difference values from the example Slope-One got a result of near 0.65 on the GroupLens data 31

32 Other recommenders SVD recommender number of features number of training step lambda: factor for regularization SVD method got 0.69 on the GroupLens data 32

33 Linear Interpolation Item-based recommender SVD method got 0.76 on the GroupLens data 33

34 Cluster-based Recommendation 34

35 Other Recommenders not in Mahout Groups (SDM 06) A 3 rd party Knowledge Repository: 30K users and 20K documents. Study the most active 697 users who have at least 20 download in a year. Results: beyond Collaborative Filtering: (1) Collaborative + Content Filtering (53% improvement); (2) CBDR: Collaborative + Content Filtering + Graph Community Analytics (259% accuracy improvement over collaborative filtering) CB DR CB DR CB DR 35

36 Other Recommenders not in Mahout Info Flow (SIGIR 06) CF + SP IF TIF Network Info Flow Number of recommended users Innovators? Late majority adopt? Early adopters Early majority Early adopter Late adopter CF + SP IF TIF Number of recommended users IF: Graphical Information Flow Model TIF: Joint Topic Detection + Information Flow Model Tests: 1 month 586 new docs 1,170 users 36 People with similar tastes Laggards! Comparing to Collaborative Filtering (CF) + Similar People Precision: IF is 91% better, TIF is 108% better Recall: IF is 87% better, TIF is 113% better

37 Distributed Item-based Recommender 37

38 Distributed recommender get co-occurrence matrix Data: 38

39 Multiply the co-occurrence matrix with user preference The highest is 103 (101, 104, 105, 107 have been purchased by user 3) 39

40 Translating to MapReduce: generating user vectors 40

41 Translating to MapReduce: calculating co-occurrence 41

42 Translating to MapReduce: matrix multiplication 42

43 Translating to MapReduce: partial products 43

44 Translating to MapReduce: partial product II 44

45 Running Recommender on MapReduce and HDFS 45

46 Questions? 46

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining alin@intelligentmining.com Outline Predictive modeling methodology k-nearest Neighbor

More information

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) ! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and

More information

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data

More information

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012 Overview Introduction Recommender Systems & Slope One Recommender Distributed Slope One on Mahout and

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains

More information

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

E6895 Advanced Big Data Analytics Lecture 4:! Data Store E6895 Advanced Big Data Analytics Lecture 4:! Data Store Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics,

More information

Clustering and Data Mining in R

Clustering and Data Mining in R Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches

More information

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015 E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing

More information

Scientific Report. BIDYUT KUMAR / PATRA INDIAN VTT Technical Research Centre of Finland, Finland. Raimo / Launonen. First name / Family name

Scientific Report. BIDYUT KUMAR / PATRA INDIAN VTT Technical Research Centre of Finland, Finland. Raimo / Launonen. First name / Family name Scientific Report First name / Family name Nationality Name of the Host Organisation First Name / family name of the Scientific Coordinator BIDYUT KUMAR / PATRA INDIAN VTT Technical Research Centre of

More information

Using Data Mining and Machine Learning in Retail

Using Data Mining and Machine Learning in Retail Using Data Mining and Machine Learning in Retail Omeid Seide Senior Manager, Big Data Solutions Sears Holdings Bharat Prasad Big Data Solution Architect Sears Holdings Over a Century of Innovation A Fortune

More information

Map/Reduce Affinity Propagation Clustering Algorithm

Map/Reduce Affinity Propagation Clustering Algorithm Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring 2014. Mahout

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring 2014. Mahout COSC 6397 Big Data Analytics Mahout and 3 rd homework assignment Edgar Gabriel Spring 2014 Mahout Scalable machine learning library Built with MapReduce and Hadoop in mind Written in Java Focusing on three

More information

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol Data Algorithms Mahmoud Parsian Beijing Boston Farnham Sebastopol Tokyo O'REILLY Table of Contents Foreword xix Preface xxi 1. Secondary Sort: Introduction 1 Solutions to the Secondary Sort Problem 3 Implementation

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

A Collaborative Filtering Recommendation Algorithm Based On User Clustering And Item Clustering

A Collaborative Filtering Recommendation Algorithm Based On User Clustering And Item Clustering A Collaborative Filtering Recommendation Algorithm Based On User Clustering And Item Clustering GRADUATE PROJECT TECHNICAL REPORT Submitted to the Faculty of The School of Engineering & Computing Sciences

More information

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering E6893 Big Data Analytics: Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering Aonan Zhang Dept. of Electrical Engineering 1 October 9th, 2014 Mahout Brief Review The Apache

More information

Pentaho Data Mining Last Modified on January 22, 2007

Pentaho Data Mining Last Modified on January 22, 2007 Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org

More information

Question Preparation Guide

Question Preparation Guide Question Preparation Guide Educational materials in preparation for the 2014 Big Data Analytics World Championships. All rights reserved. 1 This booklet provides participants, educators and event partners

More information

A Workbench for Comparing Collaborative- and Content-Based Algorithms for Recommendations

A Workbench for Comparing Collaborative- and Content-Based Algorithms for Recommendations A Workbench for Comparing Collaborative- and Content-Based Algorithms for Recommendations Master Thesis Pat Kläy from Bösingen University of Fribourg March 2015 Prof. Dr. Andreas Meier, Information Systems,

More information

High Productivity Data Processing Analytics Methods with Applications

High Productivity Data Processing Analytics Methods with Applications High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

More information

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,

More information

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then

More information

Development of a distributed recommender system using the Hadoop Framework

Development of a distributed recommender system using the Hadoop Framework Development of a distributed recommender system using the Hadoop Framework Raja Chiky, Renata Ghisloti, Zakia Kazi Aoul LISITE-ISEP 28 rue Notre Dame Des Champs 75006 Paris firstname.lastname@isep.fr Abstract.

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Performance Characterization of Game Recommendation Algorithms on Online Social Network Sites

Performance Characterization of Game Recommendation Algorithms on Online Social Network Sites Leroux P, Dhoedt B, Demeester P et al. Performance characterization of game recommendation algorithms on online social network sites. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 27(3): 611 623 May 2012.

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

White Paper. How Streaming Data Analytics Enables Real-Time Decisions White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream

More information

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 12, December-2013 279 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 12, December-2013 279 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 12, December-2013 279 Performance Analysis of Various Recommendation Algorithms Using Apache Hadoop and Mahout Dr. Senthil Kumar

More information

Social Media Mining. Network Measures

Social Media Mining. Network Measures Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

More information

Distributed Recommenders. Fall 2010

Distributed Recommenders. Fall 2010 Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm

More information

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

More information

E N T I T Y R E C O M M E N D AT I O N B A S E D O N W I K I P E D I A

E N T I T Y R E C O M M E N D AT I O N B A S E D O N W I K I P E D I A University of Saarland Faculty of Natural Sciences and Technology I Department of Computer Science Master s Thesis E N T I T Y R E C O M M E N D AT I O N B A S E D O N W I K I P E D I A submitted by dragan

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Recommending News Articles using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1

Recommending News Articles using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1 Paper 1886-2014 Recommending News s using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1 1 GE Capital Retail Finance, 2 Warwick Business School ABSTRACT Predicting news articles

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Content-Boosted Collaborative Filtering for Improved Recommendations

Content-Boosted Collaborative Filtering for Improved Recommendations Proceedings of the Eighteenth National Conference on Artificial Intelligence(AAAI-2002), pp. 187-192, Edmonton, Canada, July 2002 Content-Boosted Collaborative Filtering for Improved Recommendations Prem

More information

Module 5: Statistical Analysis

Module 5: Statistical Analysis Module 5: Statistical Analysis To answer more complex questions using your data, or in statistical terms, to test your hypothesis, you need to use more advanced statistical tests. This module reviews the

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business Instructor: Kunpeng Zhang (kzhang@rmsmith.umd.edu) Lecture-Discussions:

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Big Data Analytics Verizon Lab, Palo Alto

Big Data Analytics Verizon Lab, Palo Alto Spark Meetup Big Data Analytics Verizon Lab, Palo Alto July 28th, 2015 Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.

More information

Model Selection. Introduction. Model Selection

Model Selection. Introduction. Model Selection Model Selection Introduction This user guide provides information about the Partek Model Selection tool. Topics covered include using a Down syndrome data set to demonstrate the usage of the Partek Model

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II) E6893 Big Data Analytics Lecture 10: Linked Big Data Graph Computing (II) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

CS 207 - Data Science and Visualization Spring 2016

CS 207 - Data Science and Visualization Spring 2016 CS 207 - Data Science and Visualization Spring 2016 Professor: Sorelle Friedler sorelle@cs.haverford.edu An introduction to techniques for the automated and human-assisted analysis of data sets. These

More information

Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

More information

Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu

Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel

More information

A Statistical Text Mining Method for Patent Analysis

A Statistical Text Mining Method for Patent Analysis A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical

More information

8. Linear least-squares

8. Linear least-squares 8. Linear least-squares EE13 (Fall 211-12) definition examples and applications solution of a least-squares problem, normal equations 8-1 Definition overdetermined linear equations if b range(a), cannot

More information

Fast Data in the Era of Big Data: Twitter s Real-

Fast Data in the Era of Big Data: Twitter s Real- Fast Data in the Era of Big Data: Twitter s Real- Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Presented by: Rania Ibrahim 1 AGENDA Motivation

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, chittiprolumounika@gmail.com; Third C.

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Entropy based Graph Clustering: Application to Biological and Social Networks

Entropy based Graph Clustering: Application to Biological and Social Networks Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:

More information

Automated Collaborative Filtering Applications for Online Recruitment Services

Automated Collaborative Filtering Applications for Online Recruitment Services Automated Collaborative Filtering Applications for Online Recruitment Services Rachael Rafter, Keith Bradley, Barry Smyth Smart Media Institute, Department of Computer Science, University College Dublin,

More information

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9 Homework 2 Page 110: Exercise 6.10; Exercise 6.12 Page 116: Exercise 6.15; Exercise 6.17 Page 121: Exercise 6.19 Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 Page 131: Exercise 7.3; Exercise 7.5;

More information

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving

More information

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus

More information

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014 Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

More information

Attend Part 1 (2-3pm) to get 1 point extra credit. Polo will announce on Piazza options for DL students.

Attend Part 1 (2-3pm) to get 1 point extra credit. Polo will announce on Piazza options for DL students. Attend Part 1 (2-3pm) to get 1 point extra credit. Polo will announce on Piazza options for DL students. Data Science/Data Analytics and Scaling to Big Data with MathWorks Using Data Analytics to turn

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Application Scenario: Recommender

More information

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool. International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

More information

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

15.564 Information Technology I. Business Intelligence

15.564 Information Technology I. Business Intelligence 15.564 Information Technology I Business Intelligence Outline Operational vs. Decision Support Systems What is Data Mining? Overview of Data Mining Techniques Overview of Data Mining Process Data Warehouses

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Spontaneous Code Recommendation based on Open Source Code Repository

Spontaneous Code Recommendation based on Open Source Code Repository Spontaneous Code Recommendation based on Open Source Code Repository Hidehiko Masuhara masuhara@acm.org Tokyo Tech joint work with Takuya Watanabe, Naoya Murakami, Tomoyuki Aotani Do you program with Google?

More information

E6893 Big Data Analytics: Yelp Fake Review Detection

E6893 Big Data Analytics: Yelp Fake Review Detection E6893 Big Data Analytics: Yelp Fake Review Detection Mo Zhou, Chen Wen, Dhruv Kuchhal, Duo Chen Columbia University in the City of New York December 11th, 2014 Overview 1 Problem Summary 2 Technical Approach

More information

They can be obtained in HQJHQH format directly from the home page at: http://www.engene.cnb.uam.es/downloads/kobayashi.dat

They can be obtained in HQJHQH format directly from the home page at: http://www.engene.cnb.uam.es/downloads/kobayashi.dat HQJHQH70 *XLGHG7RXU This document contains a Guided Tour through the HQJHQH platform and it was created for training purposes with respect to the system options and analysis possibilities. It is not intended

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

More information

Recommender Systems for Large-scale E-Commerce: Scalable Neighborhood Formation Using Clustering

Recommender Systems for Large-scale E-Commerce: Scalable Neighborhood Formation Using Clustering Recommender Systems for Large-scale E-Commerce: Scalable Neighborhood Formation Using Clustering Badrul M Sarwar,GeorgeKarypis, Joseph Konstan, and John Riedl {sarwar, karypis, konstan, riedl}@csumnedu

More information

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

More information

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008 Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Is a Data Scientist the New Quant? Stuart Kozola MathWorks Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by

More information

RECOMMENDATION METHOD ON HADOOP AND MAPREDUCE FOR BIG DATA APPLICATIONS

RECOMMENDATION METHOD ON HADOOP AND MAPREDUCE FOR BIG DATA APPLICATIONS RECOMMENDATION METHOD ON HADOOP AND MAPREDUCE FOR BIG DATA APPLICATIONS T.M.S.MEKALARANI #1, M.KALAIVANI *2 # ME, Computer Science and Engineering, Dhanalakshmi College of Engineering, Tambaram, India.

More information

A Web Recommender System for Recommending, Predicting and Personalizing Music Playlists

A Web Recommender System for Recommending, Predicting and Personalizing Music Playlists A Web Recommender System for Recommending, Predicting and Personalizing Music Playlists Zeina Chedrawy 1, Syed Sibte Raza Abidi 1 1 Faculty of Computer Science, Dalhousie University, Halifax, Canada {chedrawy,

More information

Performance evaluation of Web Information Retrieval Systems and its application to e-business

Performance evaluation of Web Information Retrieval Systems and its application to e-business Performance evaluation of Web Information Retrieval Systems and its application to e-business Fidel Cacheda, Angel Viña Departament of Information and Comunications Technologies Facultad de Informática,

More information

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

More information

Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park

More information

E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms -- I

E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms -- I E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms -- I Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information