Naïve Bayes and Hadoop. Shannon Quinn
|
|
|
- Randolph Wood
- 10 years ago
- Views:
Transcription
1 Naïve Bayes and Hadoop Shannon Quinn
2
3 Coupled Temporal Scoping of Relational Facts. P.P. Talukdar, D.T. Wijaya and T.M. Mitchell. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2012 Understanding Semantic Change of Words Over Centuries. D.T. Wijaya and R. Yeniterzi. In Workshop on Detecting and Exploiting Cultural Diversity on the Social Web (DETECT), 2011 at CIKM 2011
4
5
6 Some of A Joint Distribution A B C D E p is the effect of the is the effect of a The effect of this to this effect : be the effect of the not the effect of any does not affect the general does not affect the question any manner affect the principle
7 Flashback Size of table: 2 15 =32768 (if all binary) è avg of 1.5 examples per row Actual m = 1,974,927, 360 (if continuous attributes binarized) P( E) = rows matching E P(row) Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. [Kohavi, 1996] Number of Instances: 48,842 Number of Attributes: 14 (in UCI s copy of dataset) + 1; 3 (here)
8 Naïve Density Estimation The problem with the Joint Estimator is that it just mirrors the training data (and is *hard* to compute). We need something which generalizes more usefully. Copyright Andrew W. Moore The naïve model generalizes strongly: Assume that each attribute is distributed independently of any of the other attributes.
9 Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A ^ ~B ^ C ^ ~D)? Copyright Andrew W. Moore
10 Using the Naïve Distribution Once you have a Naïve Distribution you can easily compute any row of the joint distribution. Suppose A, B, C and D are independently distributed. What is P(A ^ ~B ^ C ^ ~D)? P(A) P(~B) P(C) P(~D) Copyright Andrew W. Moore
11 Naïve Distribution General Case Suppose X 1,X 2,,X d are independently distributed. Pr( X = x1,..., X d = xd ) = Pr( X1 = x1 )... Pr( X d = 1 d x ) So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand. How do we learn this? Copyright Andrew W. Moore
12 Learning a Naïve Density Estimator #records with X i = xi P( X i = xi ) = #records MLE P( X i = x i ) = #records with X i = x #records + m i + mq Dirichlet (MAP) Another trivial learning algorithm! Copyright Andrew W. Moore
13 Independently Distributed Data Review: A and B are independent if Pr(A,B)=Pr(A)Pr(B) Sometimes written: A and B are conditionally independent given C if Pr(A,B C)=Pr(A C)*Pr(B C) Written A B A B C Copyright Andrew W. Moore
14 Bayes Classifiers If we can do inference over Pr(X,Y) in particular compute Pr(X Y) and Pr(Y). We can compute Pr( Y X,..., X ) = 1 d Pr( X1,..., X d Y ) Pr( Y ) Pr( X,..., X ) 1 d
15 Can we make this interesting? Yes! Key ideas: Pick the class variable Y Instead of estimating P(X 1,,X n,y) = P(X 1 )* *P(X n )*Y, estimate P(X 1,,X n Y) = P(X 1 Y)* *P(X n Y) Or, assume P(X i Y)=Pr(X i X 1,,X i-1,x i+1, X n,y) Or, that X i is conditionally independent of every X j, j!=i, given Y. How to estimate? MLE
16 The Naïve Bayes classifier v1 Dataset: each example has A unique id id Why? For debugging the feature extractor d attributes X 1,,X d Each X i takes a discrete value in dom(x i ) One class label Y in dom(y) You have a train dataset and a test dataset
17 The Naïve Bayes classifier v1 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X j =x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): d Compute Pr(y,x 1,.,x d ) = Pr( X x Y y') j = j = Pr( Y = j= 1 d Pr( X j = x j, Y = y') = Pr( Y = y') j 1 Pr( Y y') = = Return the best y y')
18 The Naïve Bayes classifier v1 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X j =x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): d Compute Pr(y,x 1,.,x d ) = Pr( X x Y y') j = j = Pr( Y = j= 1 d C( X j = x j Y = y') C( Y = y') = j 1 C( Y y') = = C( Y = ANY ) This will overfit, so Return the best y y')
19 The Naïve Bayes classifier v1 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X j =x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): d Compute Pr(y,x 1,.,x d ) = Pr( X x Y y') j = j = Pr( Y = j= 1 d C( X j = x j Y = y') + mq j C( Y = y') + mq j = j C Y y m = 1 ( = ') + C( Y = ANY ) + m Return the best y where: q j = 1/ dom(x j ) q y = 1/ dom(y) m=1 This will underflow, so y')
20 The Naïve Bayes classifier v1 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X j =x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): Compute log Pr(y,x 1,.,x d ) = C( X = log j C( Y Y Return the best y j = x j = = y') + m y') + mq j C( Y = + log C( Y = where: q j = 1/ dom(x j ) q y = 1/ dom(y) m=1 y') + mq ANY ) + m j
21 The Naïve Bayes classifier v2 For text documents, what features do you use? One common choice: X 1 = first word in the document X 2 = second word in the document X 3 = third X 4 = But: Pr(X 13 =hockey Y=sports) is probably not that different from Pr(X 11 =hockey Y=sports) so instead of treating them as different variables, treat them as different copies of the same variable
22 The Naïve Bayes classifier v1 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X j =x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): d Compute Pr(y,x 1,.,x d ) = Pr( X x Y y') j = j = Pr( Y = j= 1 d Pr( X j = x j, Y = y') = Pr( Y = y') j 1 Pr( Y y') = = Return the best y y')
23 The Naïve Bayes classifier v2 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X j =x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): d Compute Pr(y,x 1,.,x d ) = Pr( X x Y y') j = j = Pr( Y = j= 1 d Pr( X j = x j, Y = y') = Pr( Y = y') j 1 Pr( Y y') = = Return the best y y')
24 The Naïve Bayes classifier v2 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X=x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): d Compute Pr(y,x 1,.,x d ) = Pr( X x Y y') = j = Pr( Y = j= 1 d Pr( X = x j, Y = y') = Pr( Y = y') j 1 Pr( Y y') = = Return the best y y')
25 The Naïve Bayes classifier v2 You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X=x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): Compute log Pr(y,x 1,.,x d ) = # = log C(X = x Y = y')+ mq & j x % $ j C(X = ANY ^Y = y')+ m ( + log C(Y = y')+ mq y ' C(Y = ANY )+ m Return the best y where: q j = 1/ V q y = 1/ dom(y) m=1
26 Complexity of Naïve Bayes You have a train dataset and a test dataset Initialize an event counter (hashtable) C For each example id, y, x 1,.,x d in train: C( Y=ANY ) ++; C( Y=y ) ++ For j in 1..d: C( Y=y ^ X=x j ) ++ For each example id, y, x 1,.,x d in test: For each y in dom(y): Compute log Pr(y,x 1,.,x d ) = # = log C(X = x Y = y')+ mq & j x % $ j C(X = ANY ^Y = y')+ m ( + log C(Y = y')+ mq y ' C(Y = ANY )+ m Sequential reads Complexity: O(n), n=size of train where: q j = 1/ V q y = 1/ dom(y) m=1 Sequential reads Return the best y Complexity: O( dom(y) *n ), n =size of test
27 MapReduce!
28 Inspiration not Plagiarism This is not the first lecture ever on Mapreduce I borrowed from William Cohen, who borrowed from Alona Fyshe, and she borrowed from: Jimmy Lin Google Cloudera
29 Introduction to MapReduce 3 main phases Map (send each input record to a key) Sort (put all of one key in the same place) Handled behind the scenes in Hadoop Reduce (operate on each key and its set of values) Terms come from functional programming: map(lambda x:x.upper(), [ shannon", p", quinn"])è [ SHANNON', P', QUINN'] reduce(lambda x,y:x+"-"+y, [ shannon", p", quinn"])è shannon-p-quinn
30 MapReduce overview Map Shuffle/Sort Reduce
31 MapReduce in slow motion Canonical example: Word Count Example corpus:
32
33
34
35
36
37
38
39
40 others: KeyValueInputFormat SequenceFileInputFormat
41
42
43 Is any part of this wasteful? Remember - moving data around and writing to/reading from disk are very expensive operations No reducer can start until: all mappers are done data in its partition has been sorted
44 Common pitfalls You have no control over the order in which reduces are performed You have no control over the order in which you encounter reduce values Sort of (more later). The only ordering you should assume is that Reducers always start after Mappers
45 Common pitfalls You should assume your Maps and Reduces will be taking place on different machines with different memory spaces Don t make a static variable and assume that other processes can read it They can t. It appear that they can when run locally, but they can t No really, don t do this.
46 Common pitfalls Do not communicate between mappers or between reducers overhead is high you don t know which mappers/reducers are actually running at any given point there s no easy way to find out what machine they re running on because you shouldn t be looking for them anyway
47 Assignment 1 Naïve Bayes on Hadoop Document classification Due Thursday, Jan 29 by 11:59:59pm
48 Other reminders If you haven t yet signed up for a student presentation, do so! I will be assigning slots starting Friday at 5pm, so sign up before then You need to be on BitBucket to do this Mailing list for questions Office hours Mondays 9-10:30am
CSCI6900 Assignment 2: Naïve Bayes on Hadoop
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 2: Naïve Bayes on Hadoop DUE: Friday, September 18 by 11:59:59pm Out September 4, 2015 1 IMPORTANT NOTES You are expected to use
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
ITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
CSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad
CS 6220: Data Mining Techniques Course Project Description
CS 6220: Data Mining Techniques Course Project Description College of Computer and Information Science Northeastern University Spring 2013 General Goal In this project, you will have an opportunity to
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming
10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming Due: Friday, Feb. 21, 2014 23:59 EST via Autolab Late submission with 50% credit: Sunday, Feb. 23, 2014 23:59 EST via Autolab Policy on Collaboration
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Introduction to Learning & Decision Trees
Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing
Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
Machine Learning Capacity and Performance Analysis and R
Machine Learning and R May 3, 11 30 25 15 10 5 25 15 10 5 30 25 15 10 5 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 100 80 60 40 100 80 60 40 100 80 60 40 30 25 15 10 5 25 15 10
The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
MapReduce: Algorithm Design Patterns
Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources
Bayes and Naïve Bayes. cs534-machine Learning
Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
University of Maryland. Tuesday, February 2, 2010
Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
MapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
CS455 - Lab 10. Thilina Buddhika. April 6, 2015
Thilina Buddhika April 6, 2015 Agenda Course Logistics Quiz 8 Review Giga Sort - FAQ Census Data Analysis - Introduction Implementing Custom Data Types in Hadoop Course Logistics HW3-PC Component 1 (Giga
B490 Mining the Big Data. 0 Introduction
B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc
Big Data, Fast Data, Complex Data Jans Aasman Franz Inc Private, founded 1984 AI, Semantic Technology, professional services Now in Oakland Franz Inc Who We Are (1 (2 3) (4 5) (6 7) (8 9) (10 11) (12
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
Hadoop Performance Diagnosis By Post-execution Log Analysis
Hadoop Performance Diagnosis By Post-execution Log Analysis cs598 Project Proposal Zhijin Li, Chris Cai, Haitao Liu ([email protected],[email protected],[email protected]) University of Illinois
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)
Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang
Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *
Impact of Boolean factorization as preprocessing methods for classification of Boolean data
Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Classification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Big Data and Scripting. (lecture, computer science, bachelor/master/phd)
Big Data and Scripting (lecture, computer science, bachelor/master/phd) Big Data and Scripting - abstract/organization abstract introduction to Big Data and involved techniques lecture (2+2) practical
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class
Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Learning from Data: Naive Bayes
Semester 1 http://www.anc.ed.ac.uk/ amos/lfd/ Naive Bayes Typical example: Bayesian Spam Filter. Naive means naive. Bayesian methods can be much more sophisticated. Basic assumption: conditional independence.
MapReduce for Statistical NLP/Machine Learning
MapReduce for Statistical NLP/Machine Learning Sameer Maskey Week 7, October 17, 2012 1 Topics for Today MapReduce Running MapReduce jobs in Amazon Hadoop MapReduce for Machine Learning Naïve Bayes, K-Means,
Class Imbalance Learning in Software Defect Prediction
Class Imbalance Learning in Software Defect Prediction Dr. Shuo Wang [email protected] University of Birmingham Research keywords: ensemble learning, class imbalance learning, online learning Shuo Wang
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee
Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee {yhlee06, lee}@cnu.ac.kr http://networks.cnu.ac.kr/~yhlee Chungnam National University, Korea January 8, 2013 FloCon 2013 Contents Introduction
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo. VLDB 2009 CS 422 Decision Trees: Main Components Find Best Split Choose split
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
Hadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
CloudRank-D:A Benchmark Suite for Private Cloud Systems
CloudRank-D:A Benchmark Suite for Private Cloud Systems Jing Quan Institute of Computing Technology, Chinese Academy of Sciences and University of Science and Technology of China HVC tutorial in conjunction
Data Structures and Performance for Scientific Computing with Hadoop and Dumbo
Data Structures and Performance for Scientific Computing with Hadoop and Dumbo Austin R. Benson Computer Sciences Division, UC-Berkeley ICME, Stanford University May 15, 2012 1 1 Matrix storage 2 Data
What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani
What is Big Data? Concepts, Ideas and Principles Hitesh Dharamdasani # whoami Security Researcher, Malware Reversing Engineer, Developer GIT > George Mason > UC Berkeley > FireEye > On Stage Building Data-driven
CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns
CS 378 Big Data Programming Lecture 5 Summariza9on Pa:erns Review Assignment 2 Ques9ons? If you d like to use guava (Google collec9ons classes) pom.xml available for assignment 2 Includes dependency for
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. www.cs.cmu.edu/~awm [email protected].
Decision Trees Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email protected] 42-268-7599 Copyright Andrew W. Moore Slide Decision Trees Decision trees
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
Relational Processing on MapReduce
Relational Processing on MapReduce Jerome Simeon IBM Watson Research Content obtained from many sources, notably: Jimmy Lin course on MapReduce. Our Plan Today 1. Recap: Key relational DBMS notes Key Hadoop
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Library Services 2014-15
Library Services 2014-15 Your Guide to Solihull College Library Services 2014-15 Welcome Library Services Welcome to Solihull College from the library team! Whether you are studying full-time, part-time,
Content-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto [email protected] @alexcpsec @MLSecProject
Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto [email protected] @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2012 Slides based on last years tutorials by Chris Körner, Philipp Singer 1 Review and Motivation Agenda Assignment Information Introduction
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Big Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
Developing a MapReduce Application
TIE 12206 - Apache Hadoop Tampere University of Technology, Finland November, 2014 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3 MapReduce
Lecture 9: Introduction to Pattern Analysis
Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
