Large Scale (Machine) Learning at Twitter. Aleksander Kołcz Twitter, Inc.
|
|
|
- Cameron Watkins
- 10 years ago
- Views:
Transcription
1 Large Scale (Machine) Learning at Twitter Aleksander Kołcz Twitter, Inc.
2
3
4
5 What is Twitter Micro-blogging service Open exchange of information/opinions Private communication possible via DMs Content restricted to 140 characters Seems tiny but
6 #egypt, #tunesia, #libya, #syria
7 #ocws, #occupywallstreet,
8 The Scale of Twitter Twitter has more than 140 million active users 340 million Tweets are sent per day Over 400 million monthly unique visitors to twitter.com 50 million people log into Twitter every day
9 Large scale infrastructure of information delivery Users interact via web-ui, sms, and various apps Over 55% of our active users are mobile users Real-time redistribution of content Plus search and other services living on top of it E.g., 2.3B search queries/day (26K/sec)
10 Support for user interaction Search Relevance ranking User recommendation WTF or Who To Follow Content recommendation Relevant news, media, trends ML is an important component but really just a part of a larger whole
11 Problems we are trying to solve Relevance ranking in search
12 Problems we are trying to solve Who to follow
13 Problems we are trying to solve Content recommendation (stories/media)
14 (other) problems we are trying to solve Trending topics Language detection Anti-spam Revenue optimization User interest modeling Growth optimization
15 Recommendation/Personalization Other users friends Other users opinion sources News/information sources Specific tweets Specific urls Lists/hashtags Categorization/Tagging Clustering/grouping
16 Is this BIG data?
17 Challenges Twitter is International 70% of accounts are outside the US Twitter supports more than 28 different languages Twitter is real time The now factor is very important for people interacting with Twitter The concept of relevance is highly time dependent 140 characters : documents are very short and vocabulary contains many non-standard Cool Please please please
18 What type of machine learning? At this point ML is an extensive and diverse field Algorithms differ widely in terms of implementational and operational complexity Decisions which techniques to use are driven by the nature of the problems and infrastructure/operational constraints
19 ML for social networks Power of simple models It has been found again and again that with enough data relatively simple models (e.g., Naïve Bayes or Logistic Regression) work remarkably well Ease of integration with data processing flows (both training and inference) trumps model sophistication
20 ML for social networks Power of graph aggregation Even an imperfect signal (e.g., classifier) can be useful when its effect on a graph node is taken in the context of the node s neighborhood
21 ML for social networks Challenges of scalability and adaptive processing Learning and inference need to handle data streams and combinations thereof Models need to be able to quickly adapt to data stream change
22 Are there wheels not to reinvent? Temptation is always there. Data processing infrastructure? Core ML algorithm libraries? Data pre/post processing? Visualization? Glue code?
23 Apache Mesos Analytics Ecosystem
24 Maximizing the use of Hadoop We cannot afford too many diverse computing environments Most of analytics job are run using Hadoop cluster Hence, that s where the data live It is natural to structure ML computation so that it takes advantage of the cluster and is performed close to the data
25 AVOID: janky analysis of messy data Too Much Data Down-Sample messy Run Model on all data Build model on a local box Probably lose intermediate results and data
26 Leveraging off-line tools While the data is big, a lot of useful feature components can be learned with smaller datasets The ML tool ecosystem provides a wide range of options for optimization and tuning various models Once tuned, the models can be applied to big data in a distributed fashion often not as final models but as feature extractors The key is not to rely primarily on ad-hockery in production pipelines
27 Large scale learning frameworks Participation in the open source community There are a number of initiatives for using ML over Hadoop (e.g., Mahout) Upsides Larger support and developer network Downsides Not always convenient to integrate with internal analytics and data processing flows
28 Our extensions PigML A library of UDFs wrapping the ML functionality Scalding/PyCascading Cascading with Scala/Jython ML Java lib Used in both cases to capture the low level ML functionality
29 Build/reuse/integrate ML library Big Data processing pipeline External/Open source + internal Community (Hadoop/Pig) PigML / Scalding
30 MapReduce k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3 Source: Lin and Dyer (2010)
31 java.io.ioexception; java.util.arraylist; java.util.iterator; java.util.list; org.apache.hadoop.fs.path; org.apache.hadoop.io.longwritable; org.apache.hadoop.io.text; org.apache.hadoop.io.writable; org.apache.hadoop.io.writablecomparable; org.apache.hadoop.mapred.fileinputformat; org.apache.hadoop.mapred.fileoutputformat; org.apache.hadoop.mapred.jobconf; org.apache.hadoop.mapred.keyvaluetextinputformat; org.apache.hadoop.mapred.mapper; org.apache.hadoop.mapred.mapreducebase; org.apache.hadoop.mapred.outputcollector; org.apache.hadoop.mapred.recordreader; org.apache.hadoop.mapred.reducer; org.apache.hadoop.mapred.reporter; visits org.apache.hadoop.mapred.sequencefileinputformat; gvisits = group visits by url; org.apache.hadoop.mapred.sequencefileoutputformat; org.apache.hadoop.mapred.textinputformat; org.apache.hadoop.mapred.jobcontrol.job; urlinfo org.apache.hadoop.mapred.jobcontrol.jobcontrol; org.apache.hadoop.mapred.lib.identitymapper; = load /data/visits as (user, url, time); visitcounts = foreach gvisits generate url, count(urlvisits); = load /data/urlinfo as (url, category, prank); visitcounts = join visitcounts by url, urlinfo by url; gcategories = group visitcounts by category; = foreach gcategories generate top(visitcounts,10); class MRExample { blic static class topurls LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(longwritable k, Text val, OutputCollector<Text, store topurls into Text> /data/topurls ; oc, Reporter reporter) throws IOException { // Pull the key out String line = val.tostring(); int firstcomma = line.indexof(','); String key = line.substring(0, firstcomma); String value = line.substring(firstcomma + 1); Text outkey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outval = new Text("1" + value); if (value.charat(0) = first.add(value.substring(1)); else second.add(value reporter.setstatus("o } // Do the cross product a for (String s1 : first) { for (String s2 : seco String outval = k oc.collect(null, reporter.setstatu } } } } public static class LoadJoined ex implements Mapper<Text, Text, public void map( Text k, Text val, OutputCollector<Text, Reporter reporter) th // Find the url String line = val.tostrin int firstcomma = line.ind int secondcomma = line.in String key = line.substri // drop the rest of the r // just pass a 1 for the Text outkey = new Text(ke oc.collect(outkey, new Lo } } public static class ReduceUrls ex implements Reducer<Text, Long Writable> { public void reduce( Text key, Iterator<LongWritable OutputCollector<Writa
32 PigML Embedding ML learning/eval functionality directly in Pig Flows naturally with other data processing operations Minimal learning curve for users
33 Training a model in Pig It s just a store function! training = load trn_data' using piggybank.ml.storage() as (target: double, features: map[]); store training into 'model-lr' using piggybank.ml.train.online.lrclassifierbuilder('with Pegasos withlambda:0.1');
34 Training a model in Pig It s just a store function! training = load trn_data' using piggybank.ml.storage() as (target: double, features: map[]); store training into 'model-lr' using piggybank.ml.train.online.lrclassifierbuilder('with Pegasos withlambda:0.1');
35 Training a model in Pig It s just a store function! training = load trn_data' using piggybank.ml.storage() as (target: double, features: map[]); store training into 'model-lr' using piggybank.ml.train.online.lrclassifierbuilder('with Pegasos withlambda:0.1');
36 Applying a model in Pig DEFINE Classify piggybank.ml.classify.classifyfeatureswithlrclassifier('model-lr'); data = load 'test_data' using piggybank.ml.storage() as (target: double, features: map[]); data = foreach data generate target, Classify(features) as prediction; results = foreach data generate (target == prediction.label? 1 : 0) as matching; dump results;
37 Applying a model in Pig DEFINE Classify piggybank.ml.classify.classifyfeatureswithlrclassifier('model-lr'); data = load 'test_data' using piggybank.ml.storage() as (target: double, features: map[]); data = foreach data generate target, Classify(features) as prediction; results = foreach data generate (target == prediction.label? 1 : 0) as matching; dump results;
38 Model training UDF internals A single node in the Hadoop cluster does not have extensive memory resources The learner cannot cache too much data Natural fit for: Stochastic gradient descent (SGD), possibly with mini-batching Effective for streaming the whole dataset through a single learner
39 Supervised classification in a nutshell Given Induce D f arg x i,y n i i empirical loss = : X Y s.t. loss is minimized n 1 f (x i ), y i n 0 Consider functions of a parametric form: min n 1 ), n 0 f (x i; y i i i label (sparse) feature vector loss function model parameters Key insight: machine learning as an optimization problem! (closed form solutions generally not possible)
40 Gradient Descent w (t 1) w (t ) (t ) 1 n n i 0 l f (x i ; (t) ),y i batch learning: update model after considering all training instances Stochastic Gradient Descent (SGD) w (t 1) w (t ) (t ) l f (x; (t) ),y online learning: update model after considering each (randomly-selected) training instance In practice just as good! Solves the iteration problem! What about the single reducer problem?
41 Ensembles Classifier committees are one of the best performing types of learners Some of these algorithms are sequential (not very MR friendly) Boosting But others rely mostly on randomization Each learner is trained over a different split (features and/or instances) of the data
42 previous Pig dataflow previous Pig dataflow map Classifier Training reduce Pig storage function label, feature vector model model model Making Predictions model feature vector UDF model prediction UDF feature vector prediction
43 Ensembles: continued Ensembles of linear classifiers E.g., each trained using SGD over different subset of features Ensembles of decision trees (random forest) Each tree is seeing only a subset of the dataset and node split variables are randomized at each split
44 Further advantages of parallelism Although stream based learners look at all of the data, a number of them can be executed in parallel Effective for tuning hyper-parameters Generative models such as Naïve Bayes are naturally well suited to distributed learning Just counting
45 Example: tweet sentiment detection Training/Test data: tweets with emoticons Emoticons provide surrogate labels for training Emoticons are removed from the data Logistic Regression trained over character 4grams Single classifier vs. use of ensembles
46 Ensembles with 10m examples better than 100m single classifier! Diminishing returns for free single classifier 10m ensembles 100m ensembles
47 Iterative algorithms What if one REALLY wants to iterative over big data till convergence? Unrolling of small loops in Pig Implementing the algorithm in Cascading with Scala or Python Using Python Control Flow in Pig Offload to custom MR job (e.g., Mahout)
48
49 Example: modeling topic distribution Latent Dirichlet Allocation (LDA) Bayesian soft-clustering technique Each document can naturally belong to multiple clusters with degree-of-membership Popular in topics modeling for text data Why should we care? Topics capture user interests Knowledge of interests makes recommendation easier
50 Mahout/PigML integration Data preprocessing PigML Mahout Iterate LDA till convergence LDA inference PigML
51 LDA applications User interest modeling facilitates user matching as well as predicting engagement. It does not directly solve the more general problem of finding users who might be interested in a tweet, story, web-page, etc LDA offers a probabilistic model of user interests (e.g., based on what they tweet about).
52 Anchoring LDA Runs of LDA can lead to significantly different results from one run to the next This may hamper the usefulness/interpretability of the results Anchoring clusters based on the known tag labels regularizes the clustering procedure Also known as Labeled LDA (LLDA)
53 ML/Data Mining we contribute to Cassowary Large graph mining in Java (single box) Pig Scalding cascading with Scala PyCascading Cascading with Python Mahout
54 ML outside of Twitter What does the academic community consider important? Sentiment detection Event tracking (e.g., epidemics) Politics Locality detection Spam detection
55 Publications mentioning Google Scholar (late 2011) Query Result count Twitter spam 9,800 Web spam 52,300 spam 41,600 IM spam 30,000 SMS spam 10,500
56 Quick search Number of titles on Amazon (early 2011) Query Result count Twitter marketing 544 Twitter profits 100 Twitter money 236 Twitter rich 36 Twitter seo 21
57 Spam/spammer modeling Why spam Twitter (or other social media)? The types of spam seen Trend/search spam Follow spam Long term vs. short term Spammers favor pump and dump Swift reaction to attacks is key
58 Example: normal interactions
59 Example: spammy interactions
60 Kumman Chellapila Jimmy Lin Yue Lu Jake Mannix Gilad Mishne Ram Ravichandran Miguel Rios Andy Schlaikjer Yifan Shi +others Twitter
61 THANK YOU
web-scale data processing Christopher Olston and many others Yahoo! Research
web-scale data processing Christopher Olston and many others Yahoo! Research Motivation Projects increasingly revolve around analysis of big data sets Extracting structured data, e.g. face detection Understanding
Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
Hadoop Pig. Introduction Basic. Exercise
Your Name Hadoop Pig Introduction Basic Exercise A set of files A database A single file Modern systems have to deal with far more data than was the case in the past Yahoo : over 170PB of data Facebook
Distributed and Cloud Computing
Distributed and Cloud Computing K. Hwang, G. Fox and J. Dongarra Chapter 6: Cloud Programming and Software Environments Part 1 Adapted from Kai Hwang, University of Southern California with additions from
Connecting Hadoop with Oracle Database
Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.
MR-(Mapreduce Programming Language)
MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming
Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
Journée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Big Data Technology Pig: Query Language atop Map-Reduce
Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class MR Implementation This class
Scalable Machine Learning - or what to do with all that Big Data infrastructure
- or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify
Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)
Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1
ecap Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo PC enables programmers to call functions in remote processes. IDL (Interface Definition Language)
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich
Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Spark: Cluster Computing with Working Sets
Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
L3: Statistical Modeling with Hadoop
L3: Statistical Modeling with Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring 2014. Mahout
COSC 6397 Big Data Analytics Mahout and 3 rd homework assignment Edgar Gabriel Spring 2014 Mahout Scalable machine learning library Built with MapReduce and Hadoop in mind Written in Java Focusing on three
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Ryan Lowe ([email protected])
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)
Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Fast Data in the Era of Big Data: Twitter s Real-
Fast Data in the Era of Big Data: Twitter s Real- Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Presented by: Rania Ibrahim 1 AGENDA Motivation
The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:
Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step
Introduc8on to Apache Spark
Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these
Xiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
Introduction to Apache Pig Indexing and Search
Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013 Organizational
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
Applied Multivariate Analysis - Big data analytics
Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix [email protected] http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Machine Learning for Understanding User Behaviours. Semi-Supervised Learning Applied to Click Streams
Machine Learning for Understanding User Behaviours Semi-Supervised Learning Applied to Click Streams Goals Motivation for semi-supervised learning and log analytics Overall Methodology Leveraging Hadoop
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet (@abifet)
HUAWEI Advanced Data Science with Spark Streaming Albert Bifet (@abifet) Huawei Noah s Ark Lab Focus Intelligent Mobile Devices Data Mining & Artificial Intelligence Intelligent Telecommunication Networks
OpenChorus: Building a Tool-Chest for Big Data Science
OpenChorus: Building a Tool-Chest for Big Data Science Milind Bhandarkar Chief Scientist, Machine Learning Platforms EMC Greenplum 1 Agenda! Tools for Data Science! Data Science Workflow! Greenplum OpenChorus!
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
MLlib: Scalable Machine Learning on Spark
MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang
Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science
Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl
Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Healthcare data analytics. Da-Wei Wang Institute of Information Science [email protected]
Healthcare data analytics Da-Wei Wang Institute of Information Science [email protected] Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction
Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata
Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3 Polling
Data Mining with Hadoop at TACC
Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical
Sibyl: a system for large scale machine learning
Sibyl: a system for large scale machine learning Tushar Chandra, Eugene Ie, Kenneth Goldman, Tomas Lloret Llinares, Jim McFadden, Fernando Pereira, Joshua Redstone, Tal Shaked, Yoram Singer Machine Learning
ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.
Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.com Hadoop Hadoop is an open source distributed platform for data storage
Click Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.
Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Big Data Research in Berlin BBDC and Apache Flink
Big Data Research in Berlin BBDC and Apache Flink Tilmann Rabl [email protected] dima.tu-berlin.de bbdc.berlin 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2015 Agenda About Data Management,
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
