Sibyl: a system for large scale machine learning
|
|
|
- Jocelyn Strickland
- 9 years ago
- Views:
Transcription
1 Sibyl: a system for large scale machine learning Tushar Chandra, Eugene Ie, Kenneth Goldman, Tomas Lloret Llinares, Jim McFadden, Fernando Pereira, Joshua Redstone, Tal Shaked, Yoram Singer
2 Machine Learning Background Use the past to predict the future Core technology for internet-based prediction tasks Examples of problems that can be solved with machine learning: Classify as spam or not Estimate relevance of an impression in context: Search, advertising, videos, etc. Rank candidate impressions The internet adds a scaling challenge: 100s of millions of users interacting every day Good solutions require a mix of theory and systems
3 Overview of Results Built a large scale machine learning system: Used recently developed machine learning algorithm Algorithms have provable convergence & quality guarantees Solves internet scale problems with reasonable resources Flexible: various loss functions and regularizations Used numerous well known systems techniques MapReduce for scalability Multiple cores and threads per computer for efficiency GFS to store lots of data Compressed column-oriented data format for performance
4 Inference and Learning Objective: draw reliable inferences from all the evidence in our data Is this SPAM? Is this webpage porn? Will this user click on that ad? Learning: create concise representations of the data to support good inferences
5 Many, Sparse Features Many elementary features: words, etc. Complex features: combination of elementary features Most elementary features are infrequent discretization of real-valued features Most complex features don t occur at all We want algorithms that scale well with number of features that are actually present, not with the number of possible features
6 Supervised Learning Given feature-based representation Feedback through a label: Good or Bad Spam or Not-spam Relevant or Not-relevant Supervised learning task: Given training examples, find an accurate model that predicts their labels
7 Machine learning overview Training data Label Feature 1,... Feature n Label Feature 1,... Feature n Label Feature 1,... Feature n
8 Machine learning overview Training data Label Feature 1,... Feature n Label Feature 1,... Feature n Label Feature 1,... Feature n Model Feature 1 = 0.2,... Feature n = -0.5
9 Machine learning overview Training data Label Feature 1,... Feature n Label Feature 1,... Feature n Label Feature 1,... Feature n Model Feature 1 = 0.2,... Feature n = Feature 1,... Feature n
10 Machine learning overview Training data Label Feature 1,... Feature n Label Feature 1,... Feature n Label Feature 1,... Feature n Model Feature 1 = 0.2,... Feature n = Feature 1,... Feature n Predicted label
11 Machine learning overview Training data Model Label Feature 1,... Feature n Label Feature 1,... Feature n Label Feature 1,... Feature n Label Feature 1,... Feature n Feature 1 = 0.2,... Feature n = Feature 1,... Feature n Predicted label
12 Example: Spam Prediction Feedback on s: Move to Spam, Move to Inbox Lots of features: Viagra Document IP Address of sender is bad Sender s Feedback returned daily and grows with time New features appear every day
13 From s to Vectors User receives an from an unknown sender is tokenized:... Viagra Document Sudafed Document Find a young wife Document... Compressed instance: x {0, 1} n (0, 0, 1, 0, 1, 0,...,0, 0, 1, 0)
14 From s to Vectors User receives an from an unknown sender is tokenized:... Viagra Document Sudafed Document Find a young wife Document... Compressed instance: x {0, 1} n (0, 0, 1, 0, 1, 0,...,0, 0, 1, 0)
15 Prediction Models Captures importance of features Viagra Document => score +2.0 Sudafed Document => score +0.5 Sender s => score -1.0 Represented as a vector of weights w = (0, 0, 2.0, -0.1, 0.5,..., -1.0,...) Scoring the w.x = Logistic regression (used for probability predictions) Probability =
16 Prediction Models Captures importance of features Viagra Document => score +2.0 Sudafed Document => score +0.5 Sender s => score -1.0 Represented as a vector of weights w = (0, 0, 2.0, -0.1, 0.5,..., -1.0,...) Scoring the w.x = Logistic regression (used for probability predictions) Probability =
17 Parallel Boosting (Collins, Schapire, Singer 2001) Iterative algorithm, each iteration improves model Number of iterations to get within of the optimum: log(m)/ 2 Updates correlated with gradients, but not a gradient algorithm Self-tuned step size, large when instances are sparse
18 Boosting: Illustration
19 Boosting: Illustration
20 Boosting: Illustration
21 Parallel Boosting Algorithm instances features q(i) = exp(y i (w x i )) µ + j = µ j = w j += η log i:y i =1 x ij =1 i:y i = 1 x ij =1 µ + j µ j q(i) q(i)
22 Parallel Boosting Algorithm instances features mistake probability q(i) = exp(y i (w x i )) µ + j = µ j = w j += η log i:y i =1 x ij =1 i:y i = 1 x ij =1 µ + j µ j q(i) q(i)
23 Parallel Boosting Algorithm instances features positive correlation mistake probability q(i) = exp(y i (w x i )) µ + j = µ j = w j += η log i:y i =1 x ij =1 i:y i = 1 x ij =1 µ + j µ j q(i) q(i)
24 Parallel Boosting Algorithm instances features positive correlation negative correlation mistake probability q(i) = exp(y i (w x i )) µ + j = µ j = w j += η log i:y i =1 x ij =1 i:y i = 1 x ij =1 µ + j µ j q(i) q(i)
25 Parallel Boosting Algorithm instances features positive correlation negative correlation mistake probability q(i) = exp(y i (w x i )) µ + j = µ j = w j += η log i:y i =1 x ij =1 i:y i = 1 x ij =1 step size µ + j µ j q(i) q(i)
26 Properties of parallel boosting Embarrassingly parallel: 1. Computes feature correlations for each example in parallel 2. Feature are updated in parallel We need to shuffle the outputs of Step 1 for Step 2 Step size inversely proportional to number of active features per example Not total number of features Good for sparse training data Needs some form of regularization
27 Learning w/ L1 Regularization
28 Learning w/ L1 Regularization
29 Learning w/ L1 Regularization Loss + Regularization Iterations
30 Implementing Parallel Boosting Embarrassingly parallel Stateless, so robust to transient data errors Each model is consistent, sequence of models for debugging iterations to converge Data Model i MapReduce Model i+1 MapReduce
31 Some observations We typically train multiple models To explore different types of features Don t read unnecessary features To explore different levels of regularization Amortize fixed costs across similar models Computers have lots of RAM Store the model and training stats in RAM at each worker Design for multi-core Computers have lots of cores Training data is highly compressible
32 Design principle: use column-oriented data store Column for each field Benefits Learners read much less data Efficient to transform fields Data compresses better Each learner only reads relevant columns
33 Design principle: use model sets Train multiple similar models together Cost of reading training data Cost of transforming data Downsides Need more RAM Shuffle more data Benefit: amortize fixed costs across models
34 Design principle: Integerize features Each column has its own dense integer space Encode features in decreasing order of frequency Variable-length encoding of integers Benefits: Training data compression Store in-memory model and statistics as arrays rather than hash tables Compact, faster, less data to shuffle
35 Design principle: store model and stats in RAM Each worker keeps in RAM A copy of the previous model Learning statistics for its training data Boosting requires O(10 bytes) per feature Possible to handle billions of features
36 Design principle: optimize for multi-core Share model across cores MapReduce optimizations Multi-shard combiners Share training statistics across cores
37 Design principle: use combiners to limit communication + Standard Mapper Mapper with Combiner
38 Design principle: use combiners to limit communication Fewer large shards mean less shuffling, but possible stragglers when shards fail Map Shard Output to Shuffle Less shuffling Faster recovery Map Shard Input Size
39 Design principle: use combiners to limit communication Solution: Multishard Combining Multiple threads per worker Many small map shards per thread One accumulator shared across threads One supershard per worker... less shuffling Spread shards from failed workers across the remaining workers... fewer stragglers
40 Design principle: use combiners to limit communication Standard Mapper Mapper with Combiner Combiner per Map Thread Multishard Combiner
41 Compression results Data Set 1 3.2x compression (source is unsorted and has medium compression) 2.6x compression (source is sorted and has medium compression) 1.7x compression (source is sorted and has max compression) string -> int map overhead < 0.5% Data Set 2 string -> int map overhead < 0.5% 1.8x compression (default compression options)
42 Performance results Number of models in model set Cores M 4.0M 4.4M 5.4M 4.5M 1.3M 2.4M 3.0M 4.4M 3.5M 1.4M 2.2M 3.0M 3.9M 3.5M 1.2M 2.0M 2.4M 2.9M 3.3M 1.1M 1.7M 2.4M 2.1M 2.7M Measurements in features/second per core
43 Infrastructure challenges Sibyl is an HPC workload running on infrastructure designed for the web Rapidly opens lots of files GFS master overload Concurrently reads 100s of files per machine Cluster cross-sectional bandwidth overload Denial of service for co-resident processes Random accesses into large vectors Prefetch performance Page-table performance MapReduce challenges Multi-shard combiners, column-oriented format Column oriented data format creates lots of small files Outside the GFS sweet spot
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Hadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Introduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
Large Scale Learning
Large Scale Learning Data hypergrowth: an example Reuters- 21578: about 10K docs (ModApte) Bekkerman et al, SIGIR 2001 RCV1: about 807K docs Bekkerman & Scholz, CIKM 2008 LinkedIn job Mtle data: about
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Scalable Machine Learning - or what to do with all that Big Data infrastructure
- or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
MapReduce: Algorithm Design Patterns
Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Physical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems
215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Invited Applications Paper
Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK [email protected] [email protected]
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
Hypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
Lecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Vowpal Wabbit 7 Tutorial. Stephane Ross
Vowpal Wabbit 7 Tutorial Stephane Ross Binary Classification and Regression Input Format Data in text file (can be gziped), one example/line Label [weight] Namespace Feature... Feature Namespace... Namespace:
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Distributed R for Big Data
Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
Dynamical Clustering of Personalized Web Search Results
Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC [email protected] Hong Cheng CS Dept, UIUC [email protected] Abstract Most current search engines present the user a ranked
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
Active Learning with Boosting for Spam Detection
Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Parquet. Columnar storage for the people
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li [email protected] Software engineer, Cloudera Impala Outline Context from various
Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator
WHITE PAPER Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com SAS 9 Preferred Implementation Partner tests a single Fusion
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13
External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Healthcare data analytics. Da-Wei Wang Institute of Information Science [email protected]
Healthcare data analytics Da-Wei Wang Institute of Information Science [email protected] Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
L3: Statistical Modeling with Hadoop
L3: Statistical Modeling with Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...
Quantcast Petabyte Storage at Half Price with QFS!
9-131 Quantcast Petabyte Storage at Half Price with QFS Presented by Silvius Rus, Director, Big Data Platforms September 2013 Quantcast File System (QFS) A high performance alternative to the Hadoop Distributed
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Introduction to Machine Learning Using Python. Vikram Kamath
Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression
10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming
10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming Due: Friday, Feb. 21, 2014 23:59 EST via Autolab Late submission with 50% credit: Sunday, Feb. 23, 2014 23:59 EST via Autolab Policy on Collaboration
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research
Algorithmic Techniques for Big Data Analysis Barna Saha AT&T Lab-Research Challenges of Big Data VOLUME Large amount of data VELOCITY Needs to be analyzed quickly VARIETY Different types of structured
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
Object Recognition and Template Matching
Object Recognition and Template Matching Template Matching A template is a small image (sub-image) The goal is to find occurrences of this template in a larger image That is, you want to find matches of
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Journée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
Big Data Text Mining and Visualization. Anton Heijs
Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
COSHH: A Classification and Optimization based Scheduler for Heterogeneous Hadoop Systems
COSHH: A Classification and Optimization based Scheduler for Heterogeneous Hadoop Systems Aysan Rasooli a, Douglas G. Down a a Department of Computing and Software, McMaster University, L8S 4K1, Canada
JAVA IEEE 2015. 6 Privacy Policy Inference of User-Uploaded Images on Content Sharing Sites Data Mining
S.NO TITLES Domains 1 Anonymity-based Privacy-preserving Data Reporting for Participatory Sensing 2 Anonymizing Collections of Tree-Structured Data 3 Making Digital Artifacts on the Web Verifiable and
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
A survey on click modeling in web search
A survey on click modeling in web search Lianghao Li Hong Kong University of Science and Technology Outline 1 An overview of web search marketing 2 An overview of click modeling 3 A survey on click models
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
How To Create A Multi Disk Raid
Click on the diagram to see RAID 0 in action RAID Level 0 requires a minimum of 2 drives to implement RAID 0 implements a striped disk array, the data is broken down into blocks and each block is written
Search Engines. Stephen Shaw <[email protected]> 18th of February, 2014. Netsoc
Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Hadoop. Bioinformatics Big Data
Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio [email protected] [email protected] Big Data Too much information! Big Data Explosive data growth proliferation of data capture
ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #2610771
ENHANCEMENTS TO SQL SERVER COLUMN STORES Anuhya Mallempati #2610771 CONTENTS Abstract Introduction Column store indexes Batch mode processing Other Enhancements Conclusion ABSTRACT SQL server introduced
