Computational Linguistics and Learning from Big Data. Gabriel Doyle UCSD Linguistics

Similar documents
Machine Learning for Data Science (CS4786) Lecture 1

Introduction. A. Bellaachia Page: 1

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

The Truth About Sentiment & Natural Language Processing

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Text Mining - Scope and Applications

Identifying Focus, Techniques and Domain of Scientific Papers

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Applications of Deep Learning to the GEOINT mission. June 2015

Introduction to Data Mining

Top Notch Second Edition Level 3 Unit-by-Unit CEF Correlations

Learning is a very general term denoting the way in which agents:

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

SI485i : NLP. Set 6 Sentiment and Opinions

MARKETING AUTOMATION

Introduction to Pattern Recognition

How To Understand The Value Of Big Data

Building a Question Classifier for a TREC-Style Question Answering System

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Text Analysis for Big Data. Magnus Sahlgren

Information Management course

TEXT ANALYTICS INTEGRATION

Clustering Connectionist and Statistical Language Processing

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

Semi-Supervised Learning for Blog Classification

Search and Information Retrieval

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Automatic Knowledge Base Construction Systems. Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014

Using Artificial Intelligence to Manage Big Data for Litigation

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

MARKETING AUTOMATION

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Customer Journey Mapping for B2B Success

New Frontiers of Automated Content Analysis in the Social Sciences

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Financial Trading System using Combination of Textual and Numerical Data

The Data Mining Process

Machine Learning and Data Mining. Fundamentals, robotics, recognition

A Survey on Product Aspect Ranking

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives

Package syuzhet. February 22, 2015

TIETS34 Seminar: Data Mining on Biometric identification

Classification of Virtual Investing-Related Community Postings

Should HR Care About Big Data? E D COHEN, L E A R NING I NDUSTRY CONSULTA NT

Outline of today s lecture

The biggest risk to your company is not being able to change fast enough Business Rules are the answer. Ron Ross

Predicting stocks returns correlations based on unstructured data sources

Machine Learning using MapReduce

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Speakout Pre-Intermediate

Challenges of Cloud Scale Natural Language Processing

Sentiment analysis on tweets in a financial domain

Why Semantic Analysis is Better than Sentiment Analysis. A White Paper by T.R. Fitz-Gibbon, Chief Scientist, Networked Insights

CORRALLING THE WILD, WILD WEST OF SOCIAL MEDIA INTELLIGENCE

Big Data and Open Data

Employee Survey Analysis

BI and the Unstructured Data Challenge

SENTIMENT ANALYSIS: TEXT PRE-PROCESSING, READER VIEWS AND CROSS DOMAINS EMMA HADDI BRUNEL UNIVERSITY LONDON

ACEDS Membership Benefits Training, Resources and Networking for the E-Discovery Community

Data Warehousing and Data Mining

Chapter 6 - Enhancing Business Intelligence Using Information Systems

Twitter sentiment vs. Stock price!

ANALYTICS IN BIG DATA ERA

Technology & Applications. Three Technology Must-Haves to Improve Sales Effectiveness and Boost Win Rates

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Data Mining Yelp Data - Predicting rating stars from review text

SENTIMENT ANALYSIS: A STUDY ON PRODUCT FEATURES

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Search Engine Optimization:

Text Analytics with Ambiverse. Text to Knowledge.

Machine Learning: Overview

Effective Self-Training for Parsing

Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

Real World Application and Usage of IBM Advanced Analytics Technology

Some Research Challenges for Big Data Analytics of Intelligent Security

2014/02/13 Sphinx Lunch

Movie Classification Using k-means and Hierarchical Clustering

Once you have clearly defined your ideal client, use these practical applications for your business web presence:

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Prediction of Stock Market Shift using Sentiment Analysis of Twitter Feeds, Clustering and Ranking

Machine Learning Log File Analysis

Fine-grained German Sentiment Analysis on Social Media

The Evolution, Uses, and Case Studies of Technology Assisted Review

EHR CURATION FOR MEDICAL MINING

PUSH INTELLIGENCE. Bridging the Last Mile to Business Intelligence & Big Data Copyright Metric Insights, Inc.

Robust Sentiment Detection on Twitter from Biased and Noisy Data

Data Mining Part 5. Prediction

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

Real-Time Analytics: Integrating Social Media Insights with Traditional Data

Veracity of data. New approaches are emerging to account for uncertainty in data at a giant scale IBM Corporation

Particular Requirements on Opinion Mining for the Insurance Business

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

Why are Organizations Interested?

SPECIFICATION BY EXAMPLE. Gojko Adzic. How successful teams deliver the right software. MANNING Shelter Island

Combining Social Data and Semantic Content Analysis for L Aquila Social Urban Network

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Transcription:

Computational Linguistics and Learning from Big Data Gabriel Doyle UCSD Linguistics

From not enough data to too much Finding people: 90s, 700 datapoints, 7 years People finding you: 00s, 30000 datapoints, 3 years People just talking: 10s, 10000 datapoints, 5 days

Big data Benefits Problems Cheap to collect Unsolicited Huge size Covers rare events Little control Noisy data Difficult to analyze

Need for intelligent analysis Big data is too big to analyze dumbly no one can read millions of tweets Analysis needed to establish relevance are they talking about what we re interested in? meaning what are they saying about it? use what does it mean to us?

Structured & Unstructured Data Surveys, focus groups, questionnaires, etc. yield structured data we know what we re asking we force the respondents to fit that structure Imposing structure is costly can only get answers to the questions we ask respondents can t tell us what they might think need to design & implement the structure

Structured & Unstructured Data The internet / social media / devices provide unstructured data People tell us what they want to say, not what we want to know Modern computational linguistic analyses can bridge the gap between our interests fewer constraints on data coming in low cost to speaker, medium cost to analyst

The dangers of simplistic analysis Don t want ads for cutlery on a story about a stabbing Eastland Mall in Pittsburgh s closed BUT Eastland Mall in Bloomington isn t I m not happy the food was expensive vs. I m happy the food was not expensive

Computational approaches Word-sense disambiguation Named-entity recognition Automated parsing Sentiment analysis Information extraction Topic modeling what are people talking about? what are people saying about it? putting it together

Word-sense disambiguation Language is ambiguous what does mean mean? Distinguish between multiple meanings of a word going to the park vs. will park my car connotations: chintzy cheap vs. frugal cheap can be done with supervision (e.g., WordNet) or unsupervised

Named-entity recognition Identifying names of people & things finding out what people are talking about Identifies & connects information about an object central to information extraction Can be tied to other modalities identifying people in photos from captions Berg et al 2004

Cross-modal named-entities

Named-entity recognition

Named-entity resources ANNIE, Stanford NER excellent performance on edited newsprint [90%+] poor performance on tweets & social media [40-70%] Derczynski & Bontcheva 2014 increased noise-tolerance, post-editing improves performance to 84% on tweets

Automated parsing Extracting the structure of a sentence

Automated parsing Core step for getting specific semantic information Structure of a sentence has a huge effect on meaning I m not happy the food was expensive I m happy the food was not expensive Existing parsers are really good, as long as the text isn t too bad

Sentiment analysis Basic idea: what emotion is being expressed here? who has the emotion? what s the emotion directed at? what reason is offered? Learning: train with known data and then extend to unknown e.g., given a set of reviews, what features do the good/bad have?

Sentiment analysis + parsing Socher et al 2013: sentiment percolates up a parse tree This movie doesn t care about [anything good]

Topic models Want to bundle documents/words into groups covering similar topics (Blei, Ng, & Jordan 03) Intuition: Words appearing in the same document are more likely to be related Documents built by choosing topics then choosing words from topics Topic model infers the topics per document & words per topic

Buying a computer Computers: 45% computer: 23% internet: 14% laptop: 12% Shopping: 13% store: 20% buy: 19% price: 11% Research: 19% When it came time to upgrade our computer, when I had to figure out the meanings of solidstate drives and quad-cores, I headed to the Internet to do my research, finding the right stores and the right sites to answer my questions

Topic models Good for general semantic classification grouping news stories, blog posts, etc. categorizing documents into known classes Many extensions, not just text timeseries data, author recognition connecting text to images (Costa Pereira et al 13) financial data (Doyle & Elkan 09) Pompeiian households (Mimno 09)

Information extraction Produces a structured representation of information ( knowledge base ) human-readable or machine-readable information as relations between entities throw(quarterback,pass) within- or across-document learning

IE example: learning football Hovy et al 2011: Unsupervised Discovery of Domain-Specific Knowledge from Text The last time the Detroit Lions won a game in the Metrodome, Scott Mitchell threw a touchdown pass to Herman Moore throw(scottmitchell,touchdown,hermanmoore) is.a(scottmitchell,quarterback) is.a(hermanmoore,widereceiver) throw(qb,touchdown,wr) Big, young, talented and inexperienced, Scott Mitchell, the former backup quarterback for the Miami Dolphins, was in prime position to profit Lions wide receiver Herman Moore reflects on the Detroit-Chicago rivalry

IE example: learning football Parse input using automated parser Use parse + named entities to build semantic structure Use multiple levels of semantic representation to identify general rules Learn on 33,000 New York Times articles 95% sensible propositions extracted

Overview Big data demands intelligent analysis methods are out there already plus new ones all the time Think through the problem you want to solve what data sources do you have? what information would you ask for if you could? what structure do you want to impose? which method(s) yield that structure?

Computational methods summary Automated parsing basic step in structuring natural language data won t fail, will buy vs. will fail, won t buy key to extracting specific information Word-sense disambiguation basic step for assessing what s being discussed toilet tank vs. military tank makes sure you re looking at relevant data

Computational methods summary Sentiment analysis general emotional assessment automatic ratings, user triage noisy due to irony, sarcasm, etc. Named-entity recognition figuring out the lexicon what do people talk about? building knowledge of things

Computational methods summary Topic models document-level semantic classification overall gist of an article good for multimedia linkages Information Extraction specific semantic structures Who s doing what to whom? establishing rules & knowledge

Overall summary Computational methods exist to structure large-scale unstructured data Identify what structure you want to get out find the class of methods that develop such structure combine multiple methods if necessary Test extensively! lots of noise in unstructured data

Starting-Point References NER: Derczynski & Bontcheva 2014, Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Recognizing Person Entities in Tweets NER/MM: Berg, Berg, Edwards, & Forsyth 2004, Who s in the Picture? Sentiment: Socher, Bauer, Manning, & Ng 2013, Parsing with Compositional Vector Grammars IE: Hovy, Zhang, Hovy, & Peñas 2011, Unsupervised Discovery of Domain-Specific Knowledge from Text