! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms  II


 Ariel Morrison
 1 years ago
 Views:
Transcription
1 ! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms  II ChingYung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center 1 October 2nd, 2014
2 Course Structure Class Data Number Topics Covered 09/04/14 1 Introduction to Big Data Analytics 09/11/14 2 Big Data Analytics Platforms 09/18/14 3 Big Data Storage and Processing 09/25/14 4 Big Data Analytics Algorithms  I 10/02/14 5 Big Data Analytics Algorithms  II (recommendation) 10/09/14 6 Big Data Analytics Algorithms III (clustering) 10/16/14 7 Big Data Analytics Algorithms IV (classification) 10/23/14 8 Linked Big Data Graph Computing 10/30/14 9 Big Data Visualization 11/06/14 10 Mobile Data Collection, Analysis, and Interface 11/13/14 11 Hardware, Processors, and Cluster Platforms 11/20/14 12 Big Data Next Challenges IoT, Cognition, and Beyond 11/27/14 Thanksgiving Holiday 12/04/14 13 Final Projects Discussion (Optional) 12/11/14 & 12/12/ TwoDay Big Data Analytics Workshop Final Project Presentations 2
3 Review Key Components of Mahout 3
4 Mahout reference book 4
5 Setting Up Mahout Step 1: Java JVM and IDEs (e.g., Eclipse) Step 2: Maven Step 3: Mahout Eclipse Luna (June 2014) 5
6 Recommender Inputs Solid lines: positively related Dashed lines: negatively related Input Data: User, Item, Rating 6
7 Userbased Recommendation Scenario I gettofail.com 7
8 Userbased Recommendation Scenario II 8
9 Userbased Recommendation Scenario III 9
10 Userbased Recommendation Algorithms 10
11 Example Recommender Code via Mahout 11
12 Process and output of the example Recommendation for Person 1: Item 104 > Item 106 Item 107 is not favored 12
13 Refresh (Reload) Data 13
14 Update data 14
15 User Similarity Measurements Pearson Correlation Similarity Euclidean Distance Similarity Cosine Measure Similarity Spearman Correlation Similarity Tanimoto Coefficient Similarity (Jaccard coefficient) LogLikelihood Similarity!! 15
16 Pearson Correlation Similarity Data: missing data 16
17 On Pearson Similarity Three problems with the Pearson Similarity:! 1. Not take into account of the number of items in which two users preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.) 2. If two users overlap on only one item, no correlation can be computed. 3. The correlation is undefined if either series of preference values are identical. Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or 1.0, depending on how many points are used. 17
18 Euclidean Distance Similarity Similarity = 1 / ( 1 + d ) 18
19 Cosine Similarity Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0). 19
20 Spearman Correlation Similarity Example for ties Pearson value on the relative ranks 20
21 Caching User Similarity Spearman Correlation Similarity is time consuming. Need to use Caching ==> remember s useruser similarity which was previously computed. 21
22 Tanimoto (Jaccard) Coefficient Similarity Discard preference values 22 Tanimoto similarity is the same as Jaccard similarity. But, Tanimoto distance is not the same as Jaccard distance.
23 LogLikelihood Similarity Asses how unlikely it is that the overlap between the two users is just due to chance. 23
24 Performance measurements Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset. Spearnman: 0.8 Tanimoto: 0.82 LogLikelihood: 0.73 Euclidean: 0.75 Pearson (weighted): 0.77 Pearson:
25 Performance measurements 10 nearest neighbors: nearest neighbors: nearest neighbors: % of training; 5% of testing 25
26 Selecting the number of neighbors Based on number of neighbors Based on a fixed threshold, e.g., 0.7 or
27 Itembased recommendation 27
28 Itembased recommendation algorithm 28
29 Code and Performance of ItemBased Recommendation performance 29
30 SlopeOne Recommender 30
31 SlopeOne Algorithm Difference values from the example SlopeOne got a result of near 0.65 on the GroupLens data 31
32 Other recommenders SVD recommender number of features number of training step lambda: factor for regularization SVD method got 0.69 on the GroupLens data 32
33 Linear Interpolation Itembased recommender SVD method got 0.76 on the GroupLens data 33
34 Clusterbased Recommendation 34
35 Other Recommenders not in Mahout Groups (SDM 06) A 3 rd party Knowledge Repository: 30K users and 20K documents. Study the most active 697 users who have at least 20 download in a year. Results: beyond Collaborative Filtering: (1) Collaborative + Content Filtering (53% improvement); (2) CBDR: Collaborative + Content Filtering + Graph Community Analytics (259% accuracy improvement over collaborative filtering) CB DR CB DR CB DR 35
36 Other Recommenders not in Mahout Info Flow (SIGIR 06) CF + SP IF TIF Network Info Flow Number of recommended users Innovators? Late majority adopt? Early adopters Early majority Early adopter Late adopter CF + SP IF TIF Number of recommended users IF: Graphical Information Flow Model TIF: Joint Topic Detection + Information Flow Model Tests: 1 month 586 new docs 1,170 users 36 People with similar tastes Laggards! Comparing to Collaborative Filtering (CF) + Similar People Precision: IF is 91% better, TIF is 108% better Recall: IF is 87% better, TIF is 113% better
37 Distributed Itembased Recommender 37
38 Distributed recommender get cooccurrence matrix Data: 38
39 Multiply the cooccurrence matrix with user preference The highest is 103 (101, 104, 105, 107 have been purchased by user 3) 39
40 Translating to MapReduce: generating user vectors 40
41 Translating to MapReduce: calculating cooccurrence 41
42 Translating to MapReduce: matrix multiplication 42
43 Translating to MapReduce: partial products 43
44 Translating to MapReduce: partial product II 44
45 Running Recommender on MapReduce and HDFS 45
46 Questions? 46
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining alin@intelligentmining.com Outline Predictive modeling methodology knearest Neighbor
More information! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)
! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I) ChingYung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and
More informationE6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms
E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms ChingYung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data
More informationOverview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses
Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012 Overview Introduction Recommender Systems & Slope One Recommender Distributed Slope One on Mahout and
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationE6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics ChingYung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
More informationCollaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains
More informationE6895 Advanced Big Data Analytics Lecture 4:! Data Store
E6895 Advanced Big Data Analytics Lecture 4:! Data Store ChingYung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics,
More informationClustering and Data Mining in R
Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches
More informationChingYung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015
E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) ChingYung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing
More informationScientific Report. BIDYUT KUMAR / PATRA INDIAN VTT Technical Research Centre of Finland, Finland. Raimo / Launonen. First name / Family name
Scientific Report First name / Family name Nationality Name of the Host Organisation First Name / family name of the Scientific Coordinator BIDYUT KUMAR / PATRA INDIAN VTT Technical Research Centre of
More informationUsing Data Mining and Machine Learning in Retail
Using Data Mining and Machine Learning in Retail Omeid Seide Senior Manager, Big Data Solutions Sears Holdings Bharat Prasad Big Data Solution Architect Sears Holdings Over a Century of Innovation A Fortune
More informationMap/Reduce Affinity Propagation Clustering Algorithm
Map/Reduce Affinity Propagation Clustering Algorithm WeiChih Hung, ChunYen Chu, and YiLeh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,
More informationMapReduce for Machine Learning on Multicore
MapReduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers  dual core to 12+core Shift to more concurrent programming paradigms and languages Erlang,
More informationCOSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring 2014. Mahout
COSC 6397 Big Data Analytics Mahout and 3 rd homework assignment Edgar Gabriel Spring 2014 Mahout Scalable machine learning library Built with MapReduce and Hadoop in mind Written in Java Focusing on three
More informationData Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol
Data Algorithms Mahmoud Parsian Beijing Boston Farnham Sebastopol Tokyo O'REILLY Table of Contents Foreword xix Preface xxi 1. Secondary Sort: Introduction 1 Solutions to the Secondary Sort Problem 3 Implementation
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationA Collaborative Filtering Recommendation Algorithm Based On User Clustering And Item Clustering
A Collaborative Filtering Recommendation Algorithm Based On User Clustering And Item Clustering GRADUATE PROJECT TECHNICAL REPORT Submitted to the Faculty of The School of Engineering & Computing Sciences
More information! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering
E6893 Big Data Analytics: Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering Aonan Zhang Dept. of Electrical Engineering 1 October 9th, 2014 Mahout Brief Review The Apache
More informationPentaho Data Mining Last Modified on January 22, 2007
Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org
More informationQuestion Preparation Guide
Question Preparation Guide Educational materials in preparation for the 2014 Big Data Analytics World Championships. All rights reserved. 1 This booklet provides participants, educators and event partners
More informationA Workbench for Comparing Collaborative and ContentBased Algorithms for Recommendations
A Workbench for Comparing Collaborative and ContentBased Algorithms for Recommendations Master Thesis Pat Kläy from Bösingen University of Fribourg March 2015 Prof. Dr. Andreas Meier, Information Systems,
More informationHigh Productivity Data Processing Analytics Methods with Applications
High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationMammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
More informationText Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
More informationA STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then
More informationDevelopment of a distributed recommender system using the Hadoop Framework
Development of a distributed recommender system using the Hadoop Framework Raja Chiky, Renata Ghisloti, Zakia Kazi Aoul LISITEISEP 28 rue Notre Dame Des Champs 75006 Paris firstname.lastname@isep.fr Abstract.
More informationBayesian networks  Timeseries models  Apache Spark & Scala
Bayesian networks  Timeseries models  Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup  November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationData Clustering. Dec 2nd, 2013 Kyrylo Bessonov
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms kmeans Hierarchical Main
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationPerformance Characterization of Game Recommendation Algorithms on Online Social Network Sites
Leroux P, Dhoedt B, Demeester P et al. Performance characterization of game recommendation algorithms on online social network sites. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 27(3): 611 623 May 2012.
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationWhite Paper. How Streaming Data Analytics Enables RealTime Decisions
White Paper How Streaming Data Analytics Enables RealTime Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream
More informationRecommender Systems: Contentbased, Knowledgebased, Hybrid. Radek Pelánek
Recommender Systems: Contentbased, Knowledgebased, Hybrid Radek Pelánek 2015 Today lecture, basic principles: contentbased knowledgebased hybrid, choice of approach,... critiquing, explanations,...
More informationInternational Journal of Scientific & Engineering Research, Volume 4, Issue 12, December2013 279 ISSN 22295518
International Journal of Scientific & Engineering Research, Volume 4, Issue 12, December2013 279 Performance Analysis of Various Recommendation Algorithms Using Apache Hadoop and Mahout Dr. Senthil Kumar
More informationSocial Media Mining. Network Measures
Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the likeminded users
More informationDistributed Recommenders. Fall 2010
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
More informationITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
More informationE N T I T Y R E C O M M E N D AT I O N B A S E D O N W I K I P E D I A
University of Saarland Faculty of Natural Sciences and Technology I Department of Computer Science Master s Thesis E N T I T Y R E C O M M E N D AT I O N B A S E D O N W I K I P E D I A submitted by dragan
More information1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationRecommending News Articles using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1
Paper 18862014 Recommending News s using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1 1 GE Capital Retail Finance, 2 Warwick Business School ABSTRACT Predicting news articles
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationContentBoosted Collaborative Filtering for Improved Recommendations
Proceedings of the Eighteenth National Conference on Artificial Intelligence(AAAI2002), pp. 187192, Edmonton, Canada, July 2002 ContentBoosted Collaborative Filtering for Improved Recommendations Prem
More informationModule 5: Statistical Analysis
Module 5: Statistical Analysis To answer more complex questions using your data, or in statistical terms, to test your hypothesis, you need to use more advanced statistical tests. This module reviews the
More informationFast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
More informationBUDT 758B0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business
BUDT 758B0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business Instructor: Kunpeng Zhang (kzhang@rmsmith.umd.edu) LectureDiscussions:
More informationAnalytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
More informationData Mining Yelp Data  Predicting rating stars from review text
Data Mining Yelp Data  Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority
More informationBig Data Analytics Verizon Lab, Palo Alto
Spark Meetup Big Data Analytics Verizon Lab, Palo Alto July 28th, 2015 Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
More informationModel Selection. Introduction. Model Selection
Model Selection Introduction This user guide provides information about the Partek Model Selection tool. Topics covered include using a Down syndrome data set to demonstrate the usage of the Partek Model
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationHow to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
More information! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)
E6893 Big Data Analytics Lecture 10: Linked Big Data Graph Computing (II) ChingYung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationCS 207  Data Science and Visualization Spring 2016
CS 207  Data Science and Visualization Spring 2016 Professor: Sorelle Friedler sorelle@cs.haverford.edu An introduction to techniques for the automated and humanassisted analysis of data sets. These
More informationJava Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Nonnormal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
More informationMachine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu
Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machinelearning Logistics Lectures M 9:3011:30 am Room 4419 Personnel
More informationA Statistical Text Mining Method for Patent Analysis
A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical
More information8. Linear leastsquares
8. Linear leastsquares EE13 (Fall 21112) definition examples and applications solution of a leastsquares problem, normal equations 81 Definition overdetermined linear equations if b range(a), cannot
More informationFast Data in the Era of Big Data: Twitter s Real
Fast Data in the Era of Big Data: Twitter s Real Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Presented by: Rania Ibrahim 1 AGENDA Motivation
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE  Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationInternational Journal of Advanced Computer Technology (IJACT) ISSN:23197900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS
PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, chittiprolumounika@gmail.com; Third C.
More informationAdvanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
More informationEntropy based Graph Clustering: Application to Biological and Social Networks
Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley YoungRae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving
More informationE6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices ChingYung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
More informationKNIME TUTORIAL. Anna Monreale KDDLab, University of Pisa Email: annam@di.unipi.it
KNIME TUTORIAL Anna Monreale KDDLab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:
More informationAutomated Collaborative Filtering Applications for Online Recruitment Services
Automated Collaborative Filtering Applications for Online Recruitment Services Rachael Rafter, Keith Bradley, Barry Smyth Smart Media Institute, Department of Computer Science, University College Dublin,
More informationHomework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9
Homework 2 Page 110: Exercise 6.10; Exercise 6.12 Page 116: Exercise 6.15; Exercise 6.17 Page 121: Exercise 6.19 Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 Page 131: Exercise 7.3; Exercise 7.5;
More informationCS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
More informationIntroduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.
Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus
More informationAsking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate  R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
More informationAttend Part 1 (23pm) to get 1 point extra credit. Polo will announce on Piazza options for DL students.
Attend Part 1 (23pm) to get 1 point extra credit. Polo will announce on Piazza options for DL students. Data Science/Data Analytics and Scaling to Big Data with MathWorks Using Data Analytics to turn
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Application Scenario: Recommender
More informationComparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.
International Journal of Engineering Research and Development eissn: 2278067X, pissn: 2278800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 1924 Comparative Analysis of EM Clustering Algorithm
More informationLargescale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
Largescale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More information15.564 Information Technology I. Business Intelligence
15.564 Information Technology I Business Intelligence Outline Operational vs. Decision Support Systems What is Data Mining? Overview of Data Mining Techniques Overview of Data Mining Process Data Warehouses
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Daybyday Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationSpontaneous Code Recommendation based on Open Source Code Repository
Spontaneous Code Recommendation based on Open Source Code Repository Hidehiko Masuhara masuhara@acm.org Tokyo Tech joint work with Takuya Watanabe, Naoya Murakami, Tomoyuki Aotani Do you program with Google?
More informationE6893 Big Data Analytics: Yelp Fake Review Detection
E6893 Big Data Analytics: Yelp Fake Review Detection Mo Zhou, Chen Wen, Dhruv Kuchhal, Duo Chen Columbia University in the City of New York December 11th, 2014 Overview 1 Problem Summary 2 Technical Approach
More informationThey can be obtained in HQJHQH format directly from the home page at: http://www.engene.cnb.uam.es/downloads/kobayashi.dat
HQJHQH70 *XLGHG7RXU This document contains a Guided Tour through the HQJHQH platform and it was created for training purposes with respect to the system options and analysis possibilities. It is not intended
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationAn analysis of suitable parameters for efficiently applying Kmeans clustering to large TCPdump data set using Hadoop framework
An analysis of suitable parameters for efficiently applying Kmeans clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University
More informationRecommender Systems for Largescale ECommerce: Scalable Neighborhood Formation Using Clustering
Recommender Systems for Largescale ECommerce: Scalable Neighborhood Formation Using Clustering Badrul M Sarwar,GeorgeKarypis, Joseph Konstan, and John Riedl {sarwar, karypis, konstan, riedl}@csumnedu
More informationDistance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I  Applications Motivation and Introduction Patient similarity application Part II
More informationW6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
More informationEnsemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008
Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles
More informationInfomatics. BigData and Hadoop Developer Training with Oracle WDP
BigData and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
More informationL1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
More informationIs a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
More informationRECOMMENDATION METHOD ON HADOOP AND MAPREDUCE FOR BIG DATA APPLICATIONS
RECOMMENDATION METHOD ON HADOOP AND MAPREDUCE FOR BIG DATA APPLICATIONS T.M.S.MEKALARANI #1, M.KALAIVANI *2 # ME, Computer Science and Engineering, Dhanalakshmi College of Engineering, Tambaram, India.
More informationA Web Recommender System for Recommending, Predicting and Personalizing Music Playlists
A Web Recommender System for Recommending, Predicting and Personalizing Music Playlists Zeina Chedrawy 1, Syed Sibte Raza Abidi 1 1 Faculty of Computer Science, Dalhousie University, Halifax, Canada {chedrawy,
More informationPerformance evaluation of Web Information Retrieval Systems and its application to ebusiness
Performance evaluation of Web Information Retrieval Systems and its application to ebusiness Fidel Cacheda, Angel Viña Departament of Information and Comunications Technologies Facultad de Informática,
More informationCLASSIFICATION AND CLUSTERING. Anveshi Charuvaka
CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training
More informationScalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park
More informationE6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms  I
E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms  I ChingYung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More information