Challenges of Cloud Scale Natural Language Processing



Similar documents
Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Semi-Supervised Learning for Blog Classification

Making Sense of the Mayhem: Machine Learning and March Madness

Machine Learning Final Project Spam Filtering

Multi-Domain Learning: When Do Domains Matter?

Employer Health Insurance Premium Prediction Elliott Lui

Chapter 6. The stacking ensemble approach

Single-Pass Online Learning: Performance, Voting Schemes and Online Feature Selection

Data Mining - Evaluation of Classifiers

A Systematic Cross-Comparison of Sequence Classifiers

Author Gender Identification of English Novels

Classification of Bad Accounts in Credit Card Industry

Simple and efficient online algorithms for real world applications

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Big Data Analytics CSCI 4030

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

An Introduction to Machine Learning and Natural Language Processing Tools

Statistical Machine Learning

Active Learning SVM for Blogs recommendation

Classification Problems

Jubatus: An Open Source Platform for Distributed Online Machine Learning

Forecasting stock markets with Twitter

Server Load Prediction

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

CSE 473: Artificial Intelligence Autumn 2010

Analysis Tools and Libraries for BigData

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Sentiment analysis on tweets in a financial domain

Introduction to Data Mining

Predicting Flight Delays

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Challenges for Data Driven Systems

MACHINE LEARNING IN HIGH ENERGY PHYSICS

How to Win at the Track

Beating the NCAA Football Point Spread

Learning to Process Natural Language in Big Data Environment

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Predict the Popularity of YouTube Videos Using Early View Data

Statistics for BIG data

Supervised Learning (Big Data Analytics)

Azure Machine Learning, SQL Data Mining and R

Information Management course

Applying Data Science to Sales Pipelines for Fun and Profit

Machine Learning in Spam Filtering

Distributed Computing and Big Data: Hadoop and MapReduce

Predicting Soccer Match Results in the English Premier League

1 Maximum likelihood estimation

Predicting the Stock Market with News Articles

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Recognizing Informed Option Trading

Towards better accuracy for Spam predictions

NetView 360 Product Description

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Learning is a very general term denoting the way in which agents:

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Anomaly detection. Problem motivation. Machine Learning

Lecture 9: Introduction to Pattern Analysis

How can we discover stocks that will

Projektgruppe. Categorization of text documents via classification

II. RELATED WORK. Sentiment Mining

Knowledge Discovery from patents using KMX Text Analytics

Why Semantic Analysis is Better than Sentiment Analysis. A White Paper by T.R. Fitz-Gibbon, Chief Scientist, Networked Insights

Applying Machine Learning to Stock Market Trading Bryce Taylor

: Introduction to Machine Learning Dr. Rita Osadchy

Investigation of Support Vector Machines for Classification

Segmentation and Classification of Online Chats

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

How To Bet On An Nfl Football Game With A Machine Learning Program

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

HELP DESK SYSTEMS. Using CaseBased Reasoning

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Data Mining Yelp Data - Predicting rating stars from review text

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Can Twitter provide enough information for predicting the stock market?

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Using Twitter as a source of information for stock market prediction

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Sentiment Analysis Tool using Machine Learning Algorithms

Predict Influencers in the Social Network

The Big Data Paradigm Shift. Insight Through Automation

Transcription:

Challenges of Cloud Scale Natural Language Processing Mark Dredze Johns Hopkins University

My Interests? Information Expressed in Human Language Machine Learning Natural Language Processing Intelligent User Interfaces

Some History Large scale computing resources Large scale data

Information Growth In the past 10 years 200 billion daily 100 million users 1 trillion URLs 18 million users 300 million users MySpace, Blogs, Podcasts, YouTube

Intelligent Information Systems Computers help us organize and understand information! Linguistically informed data driven learning User interfaces backed by intelligent systems Intelligent Email Management Email behaviors by role (CHI 2005) Activity management (IUI 2006) Summarization (IUI 2008) Triage and search (IJCAI 2009) Large scale data = tremendous opportunities Statistical NLP can change how we process information

Challenges of Cloud Scale With Great Data Comes Great Responsibility Learning high quality advanced NLP systems from data is not trivial The old way: carefully curated controlled corpora Advantages: easy to learn Disadvantages: small datasets The new way: large amounts of raw data Advantages: data is everyone you look! Disadvantages:

Today: Learning Challenges Outline Large scale learning Challenge: How can algorithms designed for thousands of examples scale to billions? Solution: Confidence-Weighted Learning Heterogeneous data Challenge: data is messy, highly varied and unpredictable: different domains, genres, languages, users, etc.? Solution: Apply Confidence-Weighted learning Multi-domain learning Recognizing domain shifts

A Learning Foundation Online learning algorithms for linear classifiers Updates hypothesis after every example (streaming) Ex. Perceptron, Winnow, MIRA Strength in simplicity Naturally handles many examples Widely used in many statistical NLP systems Weakness in naïve assumptions Few assumptions about data are naïve Limits the update options

Online Linear Classifiers Linear classifier A parameter for each feature Prediction: linear combination of parameters Binary classification = sign(prediction) Margin = abs(prediction) 0-0.5 0.2 1.5 1.4-1.2 0.1 Classifier parameters (weight vector) 5.9 + 1 0 0 1 3 0 2 Example Prediction Update: this example is negative! Change parameters to be more negative

Representing Data NLP represents data as sparse feature vectors I loved watching this sensational movie. 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 Even for simple tasks we have tens of thousands of features! Some much more common than others loved vs. sensational

Rare Features are Useful

Parameter Confidence Online classifier does not track feature frequency Intuition: the more a parameter is updated, the less it should change Solution: introduce parameter confidence More parameter confidence smaller changes

Confidence Weighted Learning Represent each parameter value as a Gaussian Why Gaussian? Mean: the parameter s value Variance: confidence in the parameter s value Learning: Update parameter: move mean Increase confidence: reduce variance Dredze et al. ICML 2008, Crammer et al. NIPS 2009, Crammer et al. EMNLP 2009

Confidence Weighted Update 1) 2) Objective: Condition: Smallest possible change to parameters Classify example correctly 1) min µ,σ D ( KL (µ,σ) (µ i,σ i )) Smallest change s.t. Pr y i (w x i ) 0 [ ] η 2) Correct with probability η η (0.5, 1) Sigma always decreases (more confident) Update weighted by covariance

Low Variance for Frequent or Useful Features

Take Away Message Intuition about language improves learning Parameter confidence improves learning CW beats Perceptron, MIRA, SGD, Maxent, SVM Useful in other settings Large scale learning Parallel training Heterogeneous data Multi-domain learning Recognizing domain shifts

Scaling Online Learning Cloud systems: many machines to process data Learn many linear classifiers across many machines Combine the final classifiers How should we combine many classifiers? Option 1: average Option 2: CW combinations Average CW Combination 1 million sentiment examples on 10 machines Single Machine 92.5 93 93.5 94 94.5 95 95.5 Accuracy

Heterogeneous Data More data doesn t mean more of the same data More domains, genres, languages Algorithms must handle heterogeneous data Multi-domain learning A single classifier for many different domains Detecting domain shift When has the topic changed and impacted accuracy? We care about scale Use the online setting

Domain Change Example Sentiment classification: predict if a product review is positive or negative This book has interesting characters, a well developed plot, suspense, action, adventure. What I would expect from an award winning author.? This blender is durable, and affordable. It comes with a five year warranty and creates tasty smoothies. Training Data Test Data

Learning Across Domains Setting: domains interleaved for sentiment classification Assume we know domain for each example Training: given labels for learning Kitchen Electronics Movies Books Appliances Stream of product reviews Learn all domains at once!

Naïve Approaches Assume one data set Domains are different! Very long battery life vs. Very long movie Assume different data sets More similarities than differences! I loved this book vs. I loved this movie

Multi-Domain Learning How can we learn a system for a single task across many domains? Examples Sentiment classification across product types Spam classification across different users Named entity recognition across different genres

Combined Approach Shared parameters: a parameter for each feature regardless of domain Captures shared behaviors I loved this book vs. I loved this movie Domain parameters: a parameter for each feature in each domain Captures domain behaviors Very long battery life vs. Very long movie

Learning with New Parameters Combine domain specific and shared parameters for learning Classify examples with combined parameters Update parameters to change combined behavior How to combine parameters? How to learn with the combination? Confidence Weighted Learning

Combining Parameters Recall combining parameters from many machines Averaging Parameters 2 Shared CW Combination -1 Domain Specific.5 Combined

Learning We know how to combine parameters for prediction How do we update parameters? Shared behavior shared parameters Domain behavior domain parameters How do we know which features are which? Recall: Low variance means useful for prediction In combination, low variance contributes more New online update using combination!

Multi-Domain Regularization Domain parameters regularize each other We want parameters to be similar if possible (shared) New update using combination 1) Smallest parameter change 2) Classify example correctly Dredze and Crammer, 2008; Dredze et al. 2009

Evaluation on Sentiment Methods Proposed method: Multi-domain regularization Single classifier: best for shared behaviors Separate classifiers: best for domain specific behaviors Sentiment classification Rate product reviews: positive/negative 4 datasets All- 7 Amazon product types Books- different rating thresholds DVDs- different rating thresholds Books+DVDs 1500 train, 100 test per domain

Results 25 20 Test Error 15 10 Single Separate MDR 5 0 Books DVD Books+DVD All Test error (smaller better) 10-fold CV, one pass online training Books, DVDs, Books+DVDs p=.001

Discovering Domain Change Sentiment Classification System Movies Kitchen

Changing Domains Data changes in the real world and hurts accuracy If we knew we had a new domain Turn off a badly performing system! Fix it How do we know that we have a new domain? Detect when we encounter a new domain!

Detecting Domain Shifts Assumptions: A new domain will be signaled by Accuracy: classifier accuracy drops Margin: some features disappear= smaller margins We can t measure accuracy, can we use margins?

Improved Margins Margins are a signal of confidence Fewer important features less confidence Is there a better way to get confidence estimates? Confidence Weighted margin values from a Confidence Weighted classifier Linear combinations of Scalar parameters scalar margin Gaussian parameters Gaussian margin Mean = margin Variance = confidence in margin Normalized margins mean/variance 2

Domain Shift Accuracy Average Book Reviews Shift Margin DVD Reviews Average

Experiments Data Sentiment classification between domains Spam classification between users Named entity classification between genres News articles, broadcast news, telephone, blogs, etc. Simulate domain shifts between each pair 500 source examples, 1500 target examples CW margin for examples with source domain classifier Baseline: Support Vector Machine margin When does an A-Distance tracker detect change?

1200 SVM Margin 900 600 300 0 0 300 600 900 1200 CW Normalized Margin Num examples after change

Summary: Learning Challenges Large scale learning Scaling NLP systems using CW learning Parallelizes across the cloud Heterogeneous data Learn from heterogeneous data in an online setting Learn a single system across many domains Recognizing when data sources shift

Cloud Computing Opportunities Enormous data for NLP Challenge: diverse data processing Domains, genres, dialects, languages, users Challenge: scaling up methods Real systems informed by real users Challenge: building intelligent user facing systems Key: understanding what users wants We can change how people interact with information

Thank You Data, Code, More Info? www.dredze.com mdredze@cs.jhu.edu Collaborators Koby Crammer: The Technion Alex Kulesza: University of Pennsylvania Tim Oates: University of Maryland - Baltimore County Fernando Pereira: Google Inc. Christine Piatko: Johns Hopkins University