Computational Linguistics and Learning from Big Data. Gabriel Doyle UCSD Linguistics

Computational Linguistics and Learning from Big Data Gabriel Doyle UCSD Linguistics

From not enough data to too much Finding people: 90s, 700 datapoints, 7 years People finding you: 00s, 30000 datapoints, 3 years People just talking: 10s, 10000 datapoints, 5 days

Big data Benefits Problems Cheap to collect Unsolicited Huge size Covers rare events Little control Noisy data Difficult to analyze

Need for intelligent analysis Big data is too big to analyze dumbly no one can read millions of tweets Analysis needed to establish relevance are they talking about what we re interested in? meaning what are they saying about it? use what does it mean to us?

Structured & Unstructured Data Surveys, focus groups, questionnaires, etc. yield structured data we know what we re asking we force the respondents to fit that structure Imposing structure is costly can only get answers to the questions we ask respondents can t tell us what they might think need to design & implement the structure

Structured & Unstructured Data The internet / social media / devices provide unstructured data People tell us what they want to say, not what we want to know Modern computational linguistic analyses can bridge the gap between our interests fewer constraints on data coming in low cost to speaker, medium cost to analyst

The dangers of simplistic analysis Don t want ads for cutlery on a story about a stabbing Eastland Mall in Pittsburgh s closed BUT Eastland Mall in Bloomington isn t I m not happy the food was expensive vs. I m happy the food was not expensive

Computational approaches Word-sense disambiguation Named-entity recognition Automated parsing Sentiment analysis Information extraction Topic modeling what are people talking about? what are people saying about it? putting it together

Word-sense disambiguation Language is ambiguous what does mean mean? Distinguish between multiple meanings of a word going to the park vs. will park my car connotations: chintzy cheap vs. frugal cheap can be done with supervision (e.g., WordNet) or unsupervised

Named-entity recognition Identifying names of people & things finding out what people are talking about Identifies & connects information about an object central to information extraction Can be tied to other modalities identifying people in photos from captions Berg et al 2004

Cross-modal named-entities

Named-entity recognition

Named-entity resources ANNIE, Stanford NER excellent performance on edited newsprint [90%+] poor performance on tweets & social media [40-70%] Derczynski & Bontcheva 2014 increased noise-tolerance, post-editing improves performance to 84% on tweets

Automated parsing Extracting the structure of a sentence

Automated parsing Core step for getting specific semantic information Structure of a sentence has a huge effect on meaning I m not happy the food was expensive I m happy the food was not expensive Existing parsers are really good, as long as the text isn t too bad

Sentiment analysis Basic idea: what emotion is being expressed here? who has the emotion? what s the emotion directed at? what reason is offered? Learning: train with known data and then extend to unknown e.g., given a set of reviews, what features do the good/bad have?

Sentiment analysis + parsing Socher et al 2013: sentiment percolates up a parse tree This movie doesn t care about [anything good]

Topic models Want to bundle documents/words into groups covering similar topics (Blei, Ng, & Jordan 03) Intuition: Words appearing in the same document are more likely to be related Documents built by choosing topics then choosing words from topics Topic model infers the topics per document & words per topic

Buying a computer Computers: 45% computer: 23% internet: 14% laptop: 12% Shopping: 13% store: 20% buy: 19% price: 11% Research: 19% When it came time to upgrade our computer, when I had to figure out the meanings of solidstate drives and quad-cores, I headed to the Internet to do my research, finding the right stores and the right sites to answer my questions

Topic models Good for general semantic classification grouping news stories, blog posts, etc. categorizing documents into known classes Many extensions, not just text timeseries data, author recognition connecting text to images (Costa Pereira et al 13) financial data (Doyle & Elkan 09) Pompeiian households (Mimno 09)

Information extraction Produces a structured representation of information ( knowledge base ) human-readable or machine-readable information as relations between entities throw(quarterback,pass) within- or across-document learning

IE example: learning football Hovy et al 2011: Unsupervised Discovery of Domain-Specific Knowledge from Text The last time the Detroit Lions won a game in the Metrodome, Scott Mitchell threw a touchdown pass to Herman Moore throw(scottmitchell,touchdown,hermanmoore) is.a(scottmitchell,quarterback) is.a(hermanmoore,widereceiver) throw(qb,touchdown,wr) Big, young, talented and inexperienced, Scott Mitchell, the former backup quarterback for the Miami Dolphins, was in prime position to profit Lions wide receiver Herman Moore reflects on the Detroit-Chicago rivalry

IE example: learning football Parse input using automated parser Use parse + named entities to build semantic structure Use multiple levels of semantic representation to identify general rules Learn on 33,000 New York Times articles 95% sensible propositions extracted

Overview Big data demands intelligent analysis methods are out there already plus new ones all the time Think through the problem you want to solve what data sources do you have? what information would you ask for if you could? what structure do you want to impose? which method(s) yield that structure?

Computational methods summary Automated parsing basic step in structuring natural language data won t fail, will buy vs. will fail, won t buy key to extracting specific information Word-sense disambiguation basic step for assessing what s being discussed toilet tank vs. military tank makes sure you re looking at relevant data

Computational methods summary Sentiment analysis general emotional assessment automatic ratings, user triage noisy due to irony, sarcasm, etc. Named-entity recognition figuring out the lexicon what do people talk about? building knowledge of things

Computational methods summary Topic models document-level semantic classification overall gist of an article good for multimedia linkages Information Extraction specific semantic structures Who s doing what to whom? establishing rules & knowledge

Overall summary Computational methods exist to structure large-scale unstructured data Identify what structure you want to get out find the class of methods that develop such structure combine multiple methods if necessary Test extensively! lots of noise in unstructured data

Starting-Point References NER: Derczynski & Bontcheva 2014, Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Recognizing Person Entities in Tweets NER/MM: Berg, Berg, Edwards, & Forsyth 2004, Who s in the Picture? Sentiment: Socher, Bauer, Manning, & Ng 2013, Parsing with Compositional Vector Grammars IE: Hovy, Zhang, Hovy, & Peñas 2011, Unsupervised Discovery of Domain-Specific Knowledge from Text