Challenges of Cloud Scale Natural Language Processing

Size: px
Start display at page:

Download "Challenges of Cloud Scale Natural Language Processing"

Transcription

1 Challenges of Cloud Scale Natural Language Processing Mark Dredze Johns Hopkins University

2 My Interests? Information Expressed in Human Language Machine Learning Natural Language Processing Intelligent User Interfaces

3 Some History Large scale computing resources Large scale data

4 Information Growth In the past 10 years 200 billion daily 100 million users 1 trillion URLs 18 million users 300 million users MySpace, Blogs, Podcasts, YouTube

5 Intelligent Information Systems Computers help us organize and understand information! Linguistically informed data driven learning User interfaces backed by intelligent systems Intelligent Management behaviors by role (CHI 2005) Activity management (IUI 2006) Summarization (IUI 2008) Triage and search (IJCAI 2009) Large scale data = tremendous opportunities Statistical NLP can change how we process information

6 Challenges of Cloud Scale With Great Data Comes Great Responsibility Learning high quality advanced NLP systems from data is not trivial The old way: carefully curated controlled corpora Advantages: easy to learn Disadvantages: small datasets The new way: large amounts of raw data Advantages: data is everyone you look! Disadvantages:

7 Today: Learning Challenges Outline Large scale learning Challenge: How can algorithms designed for thousands of examples scale to billions? Solution: Confidence-Weighted Learning Heterogeneous data Challenge: data is messy, highly varied and unpredictable: different domains, genres, languages, users, etc.? Solution: Apply Confidence-Weighted learning Multi-domain learning Recognizing domain shifts

8 A Learning Foundation Online learning algorithms for linear classifiers Updates hypothesis after every example (streaming) Ex. Perceptron, Winnow, MIRA Strength in simplicity Naturally handles many examples Widely used in many statistical NLP systems Weakness in naïve assumptions Few assumptions about data are naïve Limits the update options

9 Online Linear Classifiers Linear classifier A parameter for each feature Prediction: linear combination of parameters Binary classification = sign(prediction) Margin = abs(prediction) Classifier parameters (weight vector) Example Prediction Update: this example is negative! Change parameters to be more negative

10 Representing Data NLP represents data as sparse feature vectors I loved watching this sensational movie Even for simple tasks we have tens of thousands of features! Some much more common than others loved vs. sensational

11 Rare Features are Useful

12 Parameter Confidence Online classifier does not track feature frequency Intuition: the more a parameter is updated, the less it should change Solution: introduce parameter confidence More parameter confidence smaller changes

13 Confidence Weighted Learning Represent each parameter value as a Gaussian Why Gaussian? Mean: the parameter s value Variance: confidence in the parameter s value Learning: Update parameter: move mean Increase confidence: reduce variance Dredze et al. ICML 2008, Crammer et al. NIPS 2009, Crammer et al. EMNLP 2009

14 Confidence Weighted Update 1) 2) Objective: Condition: Smallest possible change to parameters Classify example correctly 1) min µ,σ D ( KL (µ,σ) (µ i,σ i )) Smallest change s.t. Pr y i (w x i ) 0 [ ] η 2) Correct with probability η η (0.5, 1) Sigma always decreases (more confident) Update weighted by covariance

15 Low Variance for Frequent or Useful Features

16 Take Away Message Intuition about language improves learning Parameter confidence improves learning CW beats Perceptron, MIRA, SGD, Maxent, SVM Useful in other settings Large scale learning Parallel training Heterogeneous data Multi-domain learning Recognizing domain shifts

17 Scaling Online Learning Cloud systems: many machines to process data Learn many linear classifiers across many machines Combine the final classifiers How should we combine many classifiers? Option 1: average Option 2: CW combinations Average CW Combination 1 million sentiment examples on 10 machines Single Machine Accuracy

18 Heterogeneous Data More data doesn t mean more of the same data More domains, genres, languages Algorithms must handle heterogeneous data Multi-domain learning A single classifier for many different domains Detecting domain shift When has the topic changed and impacted accuracy? We care about scale Use the online setting

19 Domain Change Example Sentiment classification: predict if a product review is positive or negative This book has interesting characters, a well developed plot, suspense, action, adventure. What I would expect from an award winning author.? This blender is durable, and affordable. It comes with a five year warranty and creates tasty smoothies. Training Data Test Data

20 Learning Across Domains Setting: domains interleaved for sentiment classification Assume we know domain for each example Training: given labels for learning Kitchen Electronics Movies Books Appliances Stream of product reviews Learn all domains at once!

21 Naïve Approaches Assume one data set Domains are different! Very long battery life vs. Very long movie Assume different data sets More similarities than differences! I loved this book vs. I loved this movie

22 Multi-Domain Learning How can we learn a system for a single task across many domains? Examples Sentiment classification across product types Spam classification across different users Named entity recognition across different genres

23 Combined Approach Shared parameters: a parameter for each feature regardless of domain Captures shared behaviors I loved this book vs. I loved this movie Domain parameters: a parameter for each feature in each domain Captures domain behaviors Very long battery life vs. Very long movie

24 Learning with New Parameters Combine domain specific and shared parameters for learning Classify examples with combined parameters Update parameters to change combined behavior How to combine parameters? How to learn with the combination? Confidence Weighted Learning

25 Combining Parameters Recall combining parameters from many machines Averaging Parameters 2 Shared CW Combination -1 Domain Specific.5 Combined

26 Learning We know how to combine parameters for prediction How do we update parameters? Shared behavior shared parameters Domain behavior domain parameters How do we know which features are which? Recall: Low variance means useful for prediction In combination, low variance contributes more New online update using combination!

27 Multi-Domain Regularization Domain parameters regularize each other We want parameters to be similar if possible (shared) New update using combination 1) Smallest parameter change 2) Classify example correctly Dredze and Crammer, 2008; Dredze et al. 2009

28 Evaluation on Sentiment Methods Proposed method: Multi-domain regularization Single classifier: best for shared behaviors Separate classifiers: best for domain specific behaviors Sentiment classification Rate product reviews: positive/negative 4 datasets All- 7 Amazon product types Books- different rating thresholds DVDs- different rating thresholds Books+DVDs 1500 train, 100 test per domain

29 Results Test Error Single Separate MDR 5 0 Books DVD Books+DVD All Test error (smaller better) 10-fold CV, one pass online training Books, DVDs, Books+DVDs p=.001

30 Discovering Domain Change Sentiment Classification System Movies Kitchen

31 Changing Domains Data changes in the real world and hurts accuracy If we knew we had a new domain Turn off a badly performing system! Fix it How do we know that we have a new domain? Detect when we encounter a new domain!

32 Detecting Domain Shifts Assumptions: A new domain will be signaled by Accuracy: classifier accuracy drops Margin: some features disappear= smaller margins We can t measure accuracy, can we use margins?

33 Improved Margins Margins are a signal of confidence Fewer important features less confidence Is there a better way to get confidence estimates? Confidence Weighted margin values from a Confidence Weighted classifier Linear combinations of Scalar parameters scalar margin Gaussian parameters Gaussian margin Mean = margin Variance = confidence in margin Normalized margins mean/variance 2

34 Domain Shift Accuracy Average Book Reviews Shift Margin DVD Reviews Average

35 Experiments Data Sentiment classification between domains Spam classification between users Named entity classification between genres News articles, broadcast news, telephone, blogs, etc. Simulate domain shifts between each pair 500 source examples, 1500 target examples CW margin for examples with source domain classifier Baseline: Support Vector Machine margin When does an A-Distance tracker detect change?

36 1200 SVM Margin CW Normalized Margin Num examples after change

37 Summary: Learning Challenges Large scale learning Scaling NLP systems using CW learning Parallelizes across the cloud Heterogeneous data Learn from heterogeneous data in an online setting Learn a single system across many domains Recognizing when data sources shift

38 Cloud Computing Opportunities Enormous data for NLP Challenge: diverse data processing Domains, genres, dialects, languages, users Challenge: scaling up methods Real systems informed by real users Challenge: building intelligent user facing systems Key: understanding what users wants We can change how people interact with information

39 Thank You Data, Code, More Info? Collaborators Koby Crammer: The Technion Alex Kulesza: University of Pennsylvania Tim Oates: University of Maryland - Baltimore County Fernando Pereira: Google Inc. Christine Piatko: Johns Hopkins University

Confidence-Weighted Linear Classification

Confidence-Weighted Linear Classification Mark Dredze mdredze@cis.upenn.edu Koby Crammer crammer@cis.upenn.edu Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104 USA Fernando Pereira 1 Google, Inc.,

More information

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015 Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Semi-Supervised Learning for Blog Classification

Semi-Supervised Learning for Blog Classification Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,

More information

Multi-Domain Learning: When Do Domains Matter?

Multi-Domain Learning: When Do Domains Matter? Multi-Domain Learning: When Do Domains Matter? Mahesh Joshi School of Computer Science Carnegie Mellon University Pittsburgh, PA, 15213, USA maheshj@cs.cmu.edu Mark Dredze Human Language Technology Center

More information

Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification

Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification John Blitzer Mark Dredze Fernando Pereira Department of Computer and Information Science University of Pennsylvania

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Neural Networks. Introduction to Artificial Intelligence CSE 150 May 29, 2007

Neural Networks. Introduction to Artificial Intelligence CSE 150 May 29, 2007 Neural Networks Introduction to Artificial Intelligence CSE 150 May 29, 2007 Administration Last programming assignment has been posted! Final Exam: Tuesday, June 12, 11:30-2:30 Last Lecture Naïve Bayes

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

A Systematic Cross-Comparison of Sequence Classifiers

A Systematic Cross-Comparison of Sequence Classifiers A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel grurgrur@gmail.com, feldman@cs.biu.ac.il,

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Employer Health Insurance Premium Prediction Elliott Lui

Employer Health Insurance Premium Prediction Elliott Lui Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than

More information

Author Gender Identification of English Novels

Author Gender Identification of English Novels Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Single-Pass Online Learning: Performance, Voting Schemes and Online Feature Selection

Single-Pass Online Learning: Performance, Voting Schemes and Online Feature Selection Single-Pass Online Learning: Performance, Voting Schemes and Online Feature Selection ABSTRACT Vitor R. Carvalho a a Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue,Pittsburgh,

More information

An Introduction to Machine Learning and Natural Language Processing Tools

An Introduction to Machine Learning and Natural Language Processing Tools An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo) 8/24/2010-8/26/2010 Some reasonably reliable

More information

Forecasting stock markets with Twitter

Forecasting stock markets with Twitter Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,

More information

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts

More information

01 Opinion mining, sentiment analysis

01 Opinion mining, sentiment analysis 01 Opinion mining, sentiment analysis IA161 Advanced Techniques of Natural Language Processing Z. Nevěřilová NLP Centre, FI MU, Brno September 21, 2016 Z. Nevěřilová IA161 Advanced NLP 01 Opinion mining,

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Analysis of Representations for Domain Adaptation

Analysis of Representations for Domain Adaptation Analysis of Representations for Domain Adaptation Shai Ben-David School of Computer Science University of Waterloo shai@cs.uwaterloo.ca John Blitzer, Koby Crammer, and Fernando Pereira Department of Computer

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Analysis Tools and Libraries for BigData

Analysis Tools and Libraries for BigData + Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

More information

lop Building Machine Learning Systems with Python en source

lop Building Machine Learning Systems with Python en source Building Machine Learning Systems with Python Master the art of machine learning with Python and build effective machine learning systems with this intensive handson guide Willi Richert Luis Pedro Coelho

More information

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

INTELLIGENT EMAIL: AIDING USERS WITH AI. Mark Harel Dredze. Computer and Information Science

INTELLIGENT EMAIL: AIDING USERS WITH AI. Mark Harel Dredze. Computer and Information Science INTELLIGENT EMAIL: AIDING USERS WITH AI Mark Harel Dredze A DISSERTATION in Computer and Information Science Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements

More information

BizPro: Extracting and Categorizing Business Intelligence Factors from News

BizPro: Extracting and Categorizing Business Intelligence Factors from News BizPro: Extracting and Categorizing Business Intelligence Factors from News Wingyan Chung, Ph.D. Institute for Simulation and Training wchung@ucf.edu Definitions and Research Highlights BI Factor: qualitative

More information

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

More information

How to Win at the Track

How to Win at the Track How to Win at the Track Cary Kempston cdjk@cs.stanford.edu Friday, December 14, 2007 1 Introduction Gambling on horse races is done according to a pari-mutuel betting system. All of the money is pooled,

More information

AN ABSTRACT OF THE THESIS OF. Thomas G. Dietterich

AN ABSTRACT OF THE THESIS OF. Thomas G. Dietterich AN ABSTRACT OF THE THESIS OF Victoria L. Keiser for the degree of Master of Science in Computer Science presented on May 22, 2009. Title: Evaluating Online Text Classification Algorithms for Email Prediction

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Predicting Flight Delays

Predicting Flight Delays Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

More information

Recognizing Informed Option Trading

Recognizing Informed Option Trading Recognizing Informed Option Trading Alex Bain, Prabal Tiwaree, Kari Okamoto 1 Abstract While equity (stock) markets are generally efficient in discounting public information into stock prices, we believe

More information

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

Learning to Process Natural Language in Big Data Environment

Learning to Process Natural Language in Big Data Environment CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

CSE 473: Artificial Intelligence Autumn 2010

CSE 473: Artificial Intelligence Autumn 2010 CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke Zettlemoyer Many slides over the course adapted from Dan Klein. 1 Outline Learning: Naive Bayes and Perceptron

More information

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection.

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise

More information

Applying Machine Learning to Stock Market Trading Bryce Taylor

Applying Machine Learning to Stock Market Trading Bryce Taylor Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,

More information

Anomaly detection. Problem motivation. Machine Learning

Anomaly detection. Problem motivation. Machine Learning Anomaly detection Problem motivation Machine Learning Anomaly detection example Aircraft engine features: = heat generated = vibration intensity Dataset: New engine: (vibration) (heat) Density estimation

More information

NetView 360 Product Description

NetView 360 Product Description NetView 360 Product Description Heterogeneous network (HetNet) planning is a specialized process that should not be thought of as adaptation of the traditional macro cell planning process. The new approach

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction Sentiment Analysis of Movie Reviews and Twitter Statuses Introduction Sentiment analysis is the task of identifying whether the opinion expressed in a text is positive or negative in general, or about

More information

II. RELATED WORK. Sentiment Mining

II. RELATED WORK. Sentiment Mining Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Jubatus: An Open Source Platform for Distributed Online Machine Learning

Jubatus: An Open Source Platform for Distributed Online Machine Learning Jubatus: An Open Source Platform for Distributed Online Machine Learning Shohei Hido Seiya Tokui Preferred Infrastructure Inc. Tokyo, Japan {hido, tokui}@preferred.jp Satoshi Oda NTT Software Innovation

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Beating the NCAA Football Point Spread

Beating the NCAA Football Point Spread Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over

More information

Applying Data Science to Sales Pipelines for Fun and Profit

Applying Data Science to Sales Pipelines for Fun and Profit Applying Data Science to Sales Pipelines for Fun and Profit Andy Twigg, CTO, C9 @lambdatwigg Abstract Machine learning is now routinely applied to many areas of industry. At C9, we apply machine learning

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak 9.6.2015

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak 9.6.2015 Computer-Based Text- and Data Analysis Technologies and Applications Mark Cieliebak 9.6.2015 Data Scientist analyze Data Library use 2 About Me Mark Cieliebak + Software Engineer & Data Scientist + PhD

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

Coding science news (intrinsic and extrinsic features)

Coding science news (intrinsic and extrinsic features) Coding science news (intrinsic and extrinsic features) M I G U E L Á N G E L Q U I N T A N I L L A, C A R L O S G. F I G U E R O L A T A M A R G R O V E S 2 Science news in Spain The corpus of digital

More information

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Ray Chen, Marius Lazer Abstract In this paper, we investigate the relationship between Twitter feed content and stock market

More information

Quiz 1 for Name: Good luck! 20% 20% 20% 20% Quiz page 1 of 16

Quiz 1 for Name: Good luck! 20% 20% 20% 20% Quiz page 1 of 16 Quiz 1 for 6.034 Name: 20% 20% 20% 20% Good luck! 6.034 Quiz page 1 of 16 Question #1 30 points 1. Figure 1 illustrates decision boundaries for two nearest-neighbour classifiers. Determine which one of

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio

Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio Machine Learning model evaluation Luigi Cerulo Department of Science and Technology University of Sannio Accuracy To measure classification performance the most intuitive measure of accuracy divides the

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

Investigation of Support Vector Machines for Email Classification

Investigation of Support Vector Machines for Email Classification Investigation of Support Vector Machines for Email Classification by Andrew Farrugia Thesis Submitted by Andrew Farrugia in partial fulfillment of the Requirements for the Degree of Bachelor of Software

More information

Segmentation and Classification of Online Chats

Segmentation and Classification of Online Chats Segmentation and Classification of Online Chats Justin Weisz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 jweisz@cs.cmu.edu Abstract One method for analyzing textual chat

More information

Classifiers & Classification

Classifiers & Classification Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin

More information

Predicting Soccer Match Results in the English Premier League

Predicting Soccer Match Results in the English Premier League Predicting Soccer Match Results in the English Premier League Ben Ulmer School of Computer Science Stanford University Email: ulmerb@stanford.edu Matthew Fernandez School of Computer Science Stanford University

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science Dr. Daisy Zhe Wang CISE Department University of Florida August 25th 2014 20 Review Overview of Data Science Why Data

More information

WEB PAGE CATEGORISATION BASED ON NEURONS

WEB PAGE CATEGORISATION BASED ON NEURONS WEB PAGE CATEGORISATION BASED ON NEURONS Shikha Batra Abstract: Contemporary web is comprised of trillions of pages and everyday tremendous amount of requests are made to put more web pages on the WWW.

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

Why Semantic Analysis is Better than Sentiment Analysis. A White Paper by T.R. Fitz-Gibbon, Chief Scientist, Networked Insights

Why Semantic Analysis is Better than Sentiment Analysis. A White Paper by T.R. Fitz-Gibbon, Chief Scientist, Networked Insights Why Semantic Analysis is Better than Sentiment Analysis A White Paper by T.R. Fitz-Gibbon, Chief Scientist, Networked Insights Why semantic analysis is better than sentiment analysis I like it, I don t

More information

Keywords social media, internet, data, sentiment analysis, opinion mining, business

Keywords social media, internet, data, sentiment analysis, opinion mining, business Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

More information

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,

More information

HELP DESK SYSTEMS. Using CaseBased Reasoning

HELP DESK SYSTEMS. Using CaseBased Reasoning HELP DESK SYSTEMS Using CaseBased Reasoning Topics Covered Today What is Help-Desk? Components of HelpDesk Systems Types Of HelpDesk Systems Used Need for CBR in HelpDesk Systems GE Helpdesk using ReMind

More information

Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

More information

The Big Data Paradigm Shift. Insight Through Automation

The Big Data Paradigm Shift. Insight Through Automation The Big Data Paradigm Shift Insight Through Automation Agenda The Problem Emcien s Solution: Algorithms solve data related business problems How Does the Technology Work? Case Studies 2013 Emcien, Inc.

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

Identifying Personal Stories in Millions of Weblog Entries

Identifying Personal Stories in Millions of Weblog Entries Identifying Personal Stories in Millions of Weblog Entries Andrew S. Gordon and Reid Swanson Institute for Creative Technologies University of Southern California 13274 Fiji Way, Marina del Rey, CA 90292

More information

Using Twitter as a source of information for stock market prediction

Using Twitter as a source of information for stock market prediction Using Twitter as a source of information for stock market prediction Ramon Xuriguera (rxuriguera@lsi.upc.edu) Joint work with Marta Arias and Argimiro Arratia ERCIM 2011, 17-19 Dec. 2011, University of

More information

How can we discover stocks that will

How can we discover stocks that will Algorithmic Trading Strategy Based On Massive Data Mining Haoming Li, Zhijun Yang and Tianlun Li Stanford University Abstract We believe that there is useful information hiding behind the noisy and massive

More information