Large Scale (Machine) Learning at Twitter. Aleksander Kołcz Twitter, Inc.



Similar documents
web-scale data processing Christopher Olston and many others Yahoo! Research

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Hadoop Pig. Introduction Basic. Exercise

Distributed and Cloud Computing

Connecting Hadoop with Oracle Database

MR-(Mapreduce Programming Language)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Journée Thématique Big Data 13/03/2015

Mammoth Scale Machine Learning!

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Technology Pig: Query Language atop Map-Reduce

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Spark and the Big Data Library

Bayesian networks - Time-series models - Apache Spark & Scala

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Challenges for Data Driven Systems

Moving From Hadoop to Spark

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Data Mining. Nonlinear Classification

Spark: Cluster Computing with Working Sets

Unified Big Data Analytics Pipeline. 连 城

Unified Big Data Processing with Apache Spark. Matei

How Companies are! Using Spark

Map-Reduce for Machine Learning on Multicore

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

L3: Statistical Modeling with Hadoop

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Apache Flink Next-gen data analysis. Kostas

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

A Logistic Regression Approach to Ad Click Prediction

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

The Internet of Things and Big Data: Intro

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Distributed Computing and Big Data: Hadoop and MapReduce

Fast Data in the Era of Big Data: Twitter s Real-

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Introduc8on to Apache Spark

Xiaoming Gao Hui Li Thilina Gunarathne

Introduction to Apache Pig Indexing and Search

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Applied Multivariate Analysis - Big data analytics

Big Data and Data Science: Behind the Buzz Words

Knowledge Discovery and Data Mining

A Brief Introduction to Apache Tez

Big Data Analytics. Lucas Rego Drumond

Big Data and Scripting Systems build on top of Hadoop

Machine Learning using MapReduce

Machine Learning for Understanding User Behaviours. Semi-Supervised Learning Applied to Click Streams

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

OpenChorus: Building a Tool-Chest for Big Data Science

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Advanced In-Database Analytics

Hadoop Ecosystem B Y R A H I M A.

MLlib: Scalable Machine Learning on Spark

Open source Google-style large scale data analysis with Hadoop

Advanced Big Data Analytics with R and Hadoop

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Big Data for the JVM developer. Costin Leau,

BIG DATA What it is and how to use?

Healthcare data analytics. Da-Wei Wang Institute of Information Science

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data Analytics. Lucas Rego Drumond

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Data Mining with Hadoop at TACC

Sibyl: a system for large scale machine learning

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Big Data and Scripting Systems build on top of Hadoop

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

Click Stream Data Analysis Using Hadoop

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

An Introduction to Data Mining

Social Media Mining. Data Mining Essentials

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Research in Berlin BBDC and Apache Flink

The Stratosphere Big Data Analytics Platform

Transcription:

Large Scale (Machine) Learning at Twitter Aleksander Kołcz Twitter, Inc.

What is Twitter Micro-blogging service Open exchange of information/opinions Private communication possible via DMs Content restricted to 140 characters Seems tiny but

#egypt, #tunesia, #libya, #syria

#ocws, #occupywallstreet,

The Scale of Twitter Twitter has more than 140 million active users 340 million Tweets are sent per day Over 400 million monthly unique visitors to twitter.com 50 million people log into Twitter every day

Large scale infrastructure of information delivery Users interact via web-ui, sms, and various apps Over 55% of our active users are mobile users Real-time redistribution of content Plus search and other services living on top of it E.g., 2.3B search queries/day (26K/sec)

Support for user interaction Search Relevance ranking User recommendation WTF or Who To Follow Content recommendation Relevant news, media, trends ML is an important component but really just a part of a larger whole

Problems we are trying to solve Relevance ranking in search

Problems we are trying to solve Who to follow

Problems we are trying to solve Content recommendation (stories/media)

(other) problems we are trying to solve Trending topics Language detection Anti-spam Revenue optimization User interest modeling Growth optimization

Recommendation/Personalization Other users friends Other users opinion sources News/information sources Specific tweets Specific urls Lists/hashtags Categorization/Tagging Clustering/grouping

Is this BIG data?

Challenges Twitter is International 70% of accounts are outside the US Twitter supports more than 28 different languages Twitter is real time The now factor is very important for people interacting with Twitter The concept of relevance is highly time dependent 140 characters : documents are very short and vocabulary contains many non-standard acronyms @XXXX_ Cool Please please please

What type of machine learning? At this point ML is an extensive and diverse field Algorithms differ widely in terms of implementational and operational complexity Decisions which techniques to use are driven by the nature of the problems and infrastructure/operational constraints

ML for social networks Power of simple models It has been found again and again that with enough data relatively simple models (e.g., Naïve Bayes or Logistic Regression) work remarkably well Ease of integration with data processing flows (both training and inference) trumps model sophistication

ML for social networks Power of graph aggregation Even an imperfect signal (e.g., classifier) can be useful when its effect on a graph node is taken in the context of the node s neighborhood

ML for social networks Challenges of scalability and adaptive processing Learning and inference need to handle data streams and combinations thereof Models need to be able to quickly adapt to data stream change

Are there wheels not to reinvent? Temptation is always there. Data processing infrastructure? Core ML algorithm libraries? Data pre/post processing? Visualization? Glue code?

Apache Mesos Analytics Ecosystem

Maximizing the use of Hadoop We cannot afford too many diverse computing environments Most of analytics job are run using Hadoop cluster Hence, that s where the data live It is natural to structure ML computation so that it takes advantage of the cluster and is performed close to the data

AVOID: janky analysis of messy data Too Much Data Down-Sample messy Run Model on all data Build model on a local box Probably lose intermediate results and data

Leveraging off-line tools While the data is big, a lot of useful feature components can be learned with smaller datasets The ML tool ecosystem provides a wide range of options for optimization and tuning various models Once tuned, the models can be applied to big data in a distributed fashion often not as final models but as feature extractors The key is not to rely primarily on ad-hockery in production pipelines

Large scale learning frameworks Participation in the open source community There are a number of initiatives for using ML over Hadoop (e.g., Mahout) Upsides Larger support and developer network Downsides Not always convenient to integrate with internal analytics and data processing flows

Our extensions PigML A library of UDFs wrapping the ML functionality Scalding/PyCascading Cascading with Scala/Jython ML Java lib Used in both cases to capture the low level ML functionality

Build/reuse/integrate ML library Big Data processing pipeline External/Open source + internal Community (Hadoop/Pig) PigML / Scalding

MapReduce k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3 Source: Lin and Dyer (2010)

java.io.ioexception; java.util.arraylist; java.util.iterator; java.util.list; org.apache.hadoop.fs.path; org.apache.hadoop.io.longwritable; org.apache.hadoop.io.text; org.apache.hadoop.io.writable; org.apache.hadoop.io.writablecomparable; org.apache.hadoop.mapred.fileinputformat; org.apache.hadoop.mapred.fileoutputformat; org.apache.hadoop.mapred.jobconf; org.apache.hadoop.mapred.keyvaluetextinputformat; org.apache.hadoop.mapred.mapper; org.apache.hadoop.mapred.mapreducebase; org.apache.hadoop.mapred.outputcollector; org.apache.hadoop.mapred.recordreader; org.apache.hadoop.mapred.reducer; org.apache.hadoop.mapred.reporter; visits org.apache.hadoop.mapred.sequencefileinputformat; gvisits = group visits by url; org.apache.hadoop.mapred.sequencefileoutputformat; org.apache.hadoop.mapred.textinputformat; org.apache.hadoop.mapred.jobcontrol.job; urlinfo org.apache.hadoop.mapred.jobcontrol.jobcontrol; org.apache.hadoop.mapred.lib.identitymapper; = load /data/visits as (user, url, time); visitcounts = foreach gvisits generate url, count(urlvisits); = load /data/urlinfo as (url, category, prank); visitcounts = join visitcounts by url, urlinfo by url; gcategories = group visitcounts by category; = foreach gcategories generate top(visitcounts,10); class MRExample { blic static class topurls LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(longwritable k, Text val, OutputCollector<Text, store topurls into Text> /data/topurls ; oc, Reporter reporter) throws IOException { // Pull the key out String line = val.tostring(); int firstcomma = line.indexof(','); String key = line.substring(0, firstcomma); String value = line.substring(firstcomma + 1); Text outkey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outval = new Text("1" + value); if (value.charat(0) = first.add(value.substring(1)); else second.add(value reporter.setstatus("o } // Do the cross product a for (String s1 : first) { for (String s2 : seco String outval = k oc.collect(null, reporter.setstatu } } } } public static class LoadJoined ex implements Mapper<Text, Text, public void map( Text k, Text val, OutputCollector<Text, Reporter reporter) th // Find the url String line = val.tostrin int firstcomma = line.ind int secondcomma = line.in String key = line.substri // drop the rest of the r // just pass a 1 for the Text outkey = new Text(ke oc.collect(outkey, new Lo } } public static class ReduceUrls ex implements Reducer<Text, Long Writable> { public void reduce( Text key, Iterator<LongWritable OutputCollector<Writa

PigML Embedding ML learning/eval functionality directly in Pig Flows naturally with other data processing operations Minimal learning curve for users

Training a model in Pig It s just a store function! training = load trn_data' using piggybank.ml.storage() as (target: double, features: map[]); store training into 'model-lr' using piggybank.ml.train.online.lrclassifierbuilder('with Pegasos withlambda:0.1');

Training a model in Pig It s just a store function! training = load trn_data' using piggybank.ml.storage() as (target: double, features: map[]); store training into 'model-lr' using piggybank.ml.train.online.lrclassifierbuilder('with Pegasos withlambda:0.1');

Training a model in Pig It s just a store function! training = load trn_data' using piggybank.ml.storage() as (target: double, features: map[]); store training into 'model-lr' using piggybank.ml.train.online.lrclassifierbuilder('with Pegasos withlambda:0.1');

Applying a model in Pig DEFINE Classify piggybank.ml.classify.classifyfeatureswithlrclassifier('model-lr'); data = load 'test_data' using piggybank.ml.storage() as (target: double, features: map[]); data = foreach data generate target, Classify(features) as prediction; results = foreach data generate (target == prediction.label? 1 : 0) as matching; dump results;

Applying a model in Pig DEFINE Classify piggybank.ml.classify.classifyfeatureswithlrclassifier('model-lr'); data = load 'test_data' using piggybank.ml.storage() as (target: double, features: map[]); data = foreach data generate target, Classify(features) as prediction; results = foreach data generate (target == prediction.label? 1 : 0) as matching; dump results;

Model training UDF internals A single node in the Hadoop cluster does not have extensive memory resources The learner cannot cache too much data Natural fit for: Stochastic gradient descent (SGD), possibly with mini-batching Effective for streaming the whole dataset through a single learner

Supervised classification in a nutshell Given Induce D f arg x i,y n i i empirical loss = : X Y s.t. loss is minimized n 1 f (x i ), y i n 0 Consider functions of a parametric form: min n 1 ), n 0 f (x i; y i i i label (sparse) feature vector loss function model parameters Key insight: machine learning as an optimization problem! (closed form solutions generally not possible)

Gradient Descent w (t 1) w (t ) (t ) 1 n n i 0 l f (x i ; (t) ),y i batch learning: update model after considering all training instances Stochastic Gradient Descent (SGD) w (t 1) w (t ) (t ) l f (x; (t) ),y online learning: update model after considering each (randomly-selected) training instance In practice just as good! Solves the iteration problem! What about the single reducer problem?

Ensembles Classifier committees are one of the best performing types of learners Some of these algorithms are sequential (not very MR friendly) Boosting But others rely mostly on randomization Each learner is trained over a different split (features and/or instances) of the data

previous Pig dataflow previous Pig dataflow map Classifier Training reduce Pig storage function label, feature vector model model model Making Predictions model feature vector UDF model prediction UDF feature vector prediction

Ensembles: continued Ensembles of linear classifiers E.g., each trained using SGD over different subset of features Ensembles of decision trees (random forest) Each tree is seeing only a subset of the dataset and node split variables are randomized at each split

Further advantages of parallelism Although stream based learners look at all of the data, a number of them can be executed in parallel Effective for tuning hyper-parameters Generative models such as Naïve Bayes are naturally well suited to distributed learning Just counting

Example: tweet sentiment detection Training/Test data: tweets with emoticons Emoticons provide surrogate labels for training Emoticons are removed from the data Logistic Regression trained over character 4grams Single classifier vs. use of ensembles

Ensembles with 10m examples better than 100m single classifier! Diminishing returns for free single classifier 10m ensembles 100m ensembles

Iterative algorithms What if one REALLY wants to iterative over big data till convergence? Unrolling of small loops in Pig Implementing the algorithm in Cascading with Scala or Python Using Python Control Flow in Pig Offload to custom MR job (e.g., Mahout)

Example: modeling topic distribution Latent Dirichlet Allocation (LDA) Bayesian soft-clustering technique Each document can naturally belong to multiple clusters with degree-of-membership Popular in topics modeling for text data Why should we care? Topics capture user interests Knowledge of interests makes recommendation easier

Mahout/PigML integration Data preprocessing PigML Mahout Iterate LDA till convergence LDA inference PigML

LDA applications User interest modeling facilitates user matching as well as predicting engagement. It does not directly solve the more general problem of finding users who might be interested in a tweet, story, web-page, etc LDA offers a probabilistic model of user interests (e.g., based on what they tweet about).

Anchoring LDA Runs of LDA can lead to significantly different results from one run to the next This may hamper the usefulness/interpretability of the results Anchoring clusters based on the known tag labels regularizes the clustering procedure Also known as Labeled LDA (LLDA)

ML/Data Mining we contribute to Cassowary Large graph mining in Java (single box) Pig Scalding cascading with Scala PyCascading Cascading with Python Mahout

ML outside of Twitter What does the academic community consider important? Sentiment detection Event tracking (e.g., epidemics) Politics Locality detection Spam detection

Publications mentioning Google Scholar (late 2011) Query Result count Twitter spam 9,800 Web spam 52,300 Email spam 41,600 IM spam 30,000 SMS spam 10,500

Quick search Number of titles on Amazon (early 2011) Query Result count Twitter marketing 544 Twitter profits 100 Twitter money 236 Twitter rich 36 Twitter seo 21

Spam/spammer modeling Why spam Twitter (or other social media)? The types of spam seen @replies/@mentions Trend/search spam Follow spam Long term vs. short term Spammers favor pump and dump Swift reaction to attacks is key

Example: normal interactions

Example: spammy interactions

Kumman Chellapila Jimmy Lin Yue Lu Jake Mannix Gilad Mishne Ram Ravichandran Miguel Rios Andy Schlaikjer Yifan Shi +others ML @ Twitter

THANK YOU