Challenges of Cloud Scale Natural Language Processing Mark Dredze Johns Hopkins University
My Interests? Information Expressed in Human Language Machine Learning Natural Language Processing Intelligent User Interfaces
Some History Large scale computing resources Large scale data
Information Growth In the past 10 years 200 billion daily 100 million users 1 trillion URLs 18 million users 300 million users MySpace, Blogs, Podcasts, YouTube
Intelligent Information Systems Computers help us organize and understand information! Linguistically informed data driven learning User interfaces backed by intelligent systems Intelligent Email Management Email behaviors by role (CHI 2005) Activity management (IUI 2006) Summarization (IUI 2008) Triage and search (IJCAI 2009) Large scale data = tremendous opportunities Statistical NLP can change how we process information
Challenges of Cloud Scale With Great Data Comes Great Responsibility Learning high quality advanced NLP systems from data is not trivial The old way: carefully curated controlled corpora Advantages: easy to learn Disadvantages: small datasets The new way: large amounts of raw data Advantages: data is everyone you look! Disadvantages:
Today: Learning Challenges Outline Large scale learning Challenge: How can algorithms designed for thousands of examples scale to billions? Solution: Confidence-Weighted Learning Heterogeneous data Challenge: data is messy, highly varied and unpredictable: different domains, genres, languages, users, etc.? Solution: Apply Confidence-Weighted learning Multi-domain learning Recognizing domain shifts
A Learning Foundation Online learning algorithms for linear classifiers Updates hypothesis after every example (streaming) Ex. Perceptron, Winnow, MIRA Strength in simplicity Naturally handles many examples Widely used in many statistical NLP systems Weakness in naïve assumptions Few assumptions about data are naïve Limits the update options
Online Linear Classifiers Linear classifier A parameter for each feature Prediction: linear combination of parameters Binary classification = sign(prediction) Margin = abs(prediction) 0-0.5 0.2 1.5 1.4-1.2 0.1 Classifier parameters (weight vector) 5.9 + 1 0 0 1 3 0 2 Example Prediction Update: this example is negative! Change parameters to be more negative
Representing Data NLP represents data as sparse feature vectors I loved watching this sensational movie. 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 Even for simple tasks we have tens of thousands of features! Some much more common than others loved vs. sensational
Rare Features are Useful
Parameter Confidence Online classifier does not track feature frequency Intuition: the more a parameter is updated, the less it should change Solution: introduce parameter confidence More parameter confidence smaller changes
Confidence Weighted Learning Represent each parameter value as a Gaussian Why Gaussian? Mean: the parameter s value Variance: confidence in the parameter s value Learning: Update parameter: move mean Increase confidence: reduce variance Dredze et al. ICML 2008, Crammer et al. NIPS 2009, Crammer et al. EMNLP 2009
Confidence Weighted Update 1) 2) Objective: Condition: Smallest possible change to parameters Classify example correctly 1) min µ,σ D ( KL (µ,σ) (µ i,σ i )) Smallest change s.t. Pr y i (w x i ) 0 [ ] η 2) Correct with probability η η (0.5, 1) Sigma always decreases (more confident) Update weighted by covariance
Low Variance for Frequent or Useful Features
Take Away Message Intuition about language improves learning Parameter confidence improves learning CW beats Perceptron, MIRA, SGD, Maxent, SVM Useful in other settings Large scale learning Parallel training Heterogeneous data Multi-domain learning Recognizing domain shifts
Scaling Online Learning Cloud systems: many machines to process data Learn many linear classifiers across many machines Combine the final classifiers How should we combine many classifiers? Option 1: average Option 2: CW combinations Average CW Combination 1 million sentiment examples on 10 machines Single Machine 92.5 93 93.5 94 94.5 95 95.5 Accuracy
Heterogeneous Data More data doesn t mean more of the same data More domains, genres, languages Algorithms must handle heterogeneous data Multi-domain learning A single classifier for many different domains Detecting domain shift When has the topic changed and impacted accuracy? We care about scale Use the online setting
Domain Change Example Sentiment classification: predict if a product review is positive or negative This book has interesting characters, a well developed plot, suspense, action, adventure. What I would expect from an award winning author.? This blender is durable, and affordable. It comes with a five year warranty and creates tasty smoothies. Training Data Test Data
Learning Across Domains Setting: domains interleaved for sentiment classification Assume we know domain for each example Training: given labels for learning Kitchen Electronics Movies Books Appliances Stream of product reviews Learn all domains at once!
Naïve Approaches Assume one data set Domains are different! Very long battery life vs. Very long movie Assume different data sets More similarities than differences! I loved this book vs. I loved this movie
Multi-Domain Learning How can we learn a system for a single task across many domains? Examples Sentiment classification across product types Spam classification across different users Named entity recognition across different genres
Combined Approach Shared parameters: a parameter for each feature regardless of domain Captures shared behaviors I loved this book vs. I loved this movie Domain parameters: a parameter for each feature in each domain Captures domain behaviors Very long battery life vs. Very long movie
Learning with New Parameters Combine domain specific and shared parameters for learning Classify examples with combined parameters Update parameters to change combined behavior How to combine parameters? How to learn with the combination? Confidence Weighted Learning
Combining Parameters Recall combining parameters from many machines Averaging Parameters 2 Shared CW Combination -1 Domain Specific.5 Combined
Learning We know how to combine parameters for prediction How do we update parameters? Shared behavior shared parameters Domain behavior domain parameters How do we know which features are which? Recall: Low variance means useful for prediction In combination, low variance contributes more New online update using combination!
Multi-Domain Regularization Domain parameters regularize each other We want parameters to be similar if possible (shared) New update using combination 1) Smallest parameter change 2) Classify example correctly Dredze and Crammer, 2008; Dredze et al. 2009
Evaluation on Sentiment Methods Proposed method: Multi-domain regularization Single classifier: best for shared behaviors Separate classifiers: best for domain specific behaviors Sentiment classification Rate product reviews: positive/negative 4 datasets All- 7 Amazon product types Books- different rating thresholds DVDs- different rating thresholds Books+DVDs 1500 train, 100 test per domain
Results 25 20 Test Error 15 10 Single Separate MDR 5 0 Books DVD Books+DVD All Test error (smaller better) 10-fold CV, one pass online training Books, DVDs, Books+DVDs p=.001
Discovering Domain Change Sentiment Classification System Movies Kitchen
Changing Domains Data changes in the real world and hurts accuracy If we knew we had a new domain Turn off a badly performing system! Fix it How do we know that we have a new domain? Detect when we encounter a new domain!
Detecting Domain Shifts Assumptions: A new domain will be signaled by Accuracy: classifier accuracy drops Margin: some features disappear= smaller margins We can t measure accuracy, can we use margins?
Improved Margins Margins are a signal of confidence Fewer important features less confidence Is there a better way to get confidence estimates? Confidence Weighted margin values from a Confidence Weighted classifier Linear combinations of Scalar parameters scalar margin Gaussian parameters Gaussian margin Mean = margin Variance = confidence in margin Normalized margins mean/variance 2
Domain Shift Accuracy Average Book Reviews Shift Margin DVD Reviews Average
Experiments Data Sentiment classification between domains Spam classification between users Named entity classification between genres News articles, broadcast news, telephone, blogs, etc. Simulate domain shifts between each pair 500 source examples, 1500 target examples CW margin for examples with source domain classifier Baseline: Support Vector Machine margin When does an A-Distance tracker detect change?
1200 SVM Margin 900 600 300 0 0 300 600 900 1200 CW Normalized Margin Num examples after change
Summary: Learning Challenges Large scale learning Scaling NLP systems using CW learning Parallelizes across the cloud Heterogeneous data Learn from heterogeneous data in an online setting Learn a single system across many domains Recognizing when data sources shift
Cloud Computing Opportunities Enormous data for NLP Challenge: diverse data processing Domains, genres, dialects, languages, users Challenge: scaling up methods Real systems informed by real users Challenge: building intelligent user facing systems Key: understanding what users wants We can change how people interact with information
Thank You Data, Code, More Info? www.dredze.com mdredze@cs.jhu.edu Collaborators Koby Crammer: The Technion Alex Kulesza: University of Pennsylvania Tim Oates: University of Maryland - Baltimore County Fernando Pereira: Google Inc. Christine Piatko: Johns Hopkins University