Information Retrieval, Information Extraction and Social Media Analytics

Transcription

1 Anwendersoftware a Information Retrieval, Information Extraction and Social Media Analytics Based on chapter 10 of the Advanced Information Management lecture Laura Kassner Universität Stuttgart Winter Term 2014

2 Overview Information Retrieval Introduction Relevance Ranking TF-IDF Similarity-Based Retrieval Measuring Retrieval Effectiveness Concept-Based Querying Information Extraction Text Analytics Social Media Analytics Introduction SMA on structured data Sentiment Detection Examples/Discussion 2

3 Information Retrieval Systems simpler data model than database systems Information organized as a collection of documents Documents are unstructured, no schema Goal: locate relevant documents based on user input keywords example documents e.g., find documents containing the words database systems "database system" database system query input IR collection of documents document_x document_y document_z works on textual descriptions provided with non-textual data such as images Example: Web search engines, desktop file search Dr. Holger Schwarz, Universität Stuttgart, IPVS 3

4 Information Retrieval Systems Differences from database systems: No transactional updates (including concurrency control and recovery) Database systems deal with structured data, with schemas that define the data organization IR systems deal with some querying issues not generally addressed by database systems - Approximate searching by keywords - Ranking of retrieved answers by estimated degree of relevance Dr. Holger Schwarz, Universität Stuttgart, IPVS 4

5 Keyword Search In full text retrieval, all the words in each document are considered to be keywords. Word in a document = term query expressions consist of keywords and the logical connectives "and", "or", and "not" and is implicit for queries with several worcs Ranking of documents on the basis of estimated relevance to a query is critical! Factors for relevance: Term frequency - Frequency of occurrence of query keyword in document Inverse document frequency - How many documents the query keyword occurs in Fewer give more importance to keyword Hyperlinks to documents - More links to a document document is more important (cf. PageRank) Dr. Holger Schwarz, Universität Stuttgart, IPVS 5

6 Document Indexing An inverted index maps each keyword K i to a set of documents S i that contain the keyword Documents identified by identifiers Inverted index may record Keyword locations within document to allow proximity based ranking Counts of number of occurrences of keyword to compute TF and operation: Finds documents that contain all of K 1, K 2,..., K n. Intersection S 1 S 2... S n or operation: documents that contain at least one of K 1, K 2,, K n Union S 1 S 2... S n Each S i is kept sorted to allow efficient intersection/union by merging not can also be efficiently implemented by merging of sorted lists Dr. Holger Schwarz, Universität Stuttgart, IPVS 6

7 Relevance Ranking Using Terms TF-IDF (Term frequency/inverse Document frequency) ranking: n(d) = number of terms in the document d n(d, t) = number of occurrences of term t in the document d n(t) = number of documents containing term t Relevance of a document d to a term t n(d, t) TF (d, t) = log 1 + n(d) - log factor to avoid excessive weight to frequent terms Relevance of a term t in document collection D IDF (t) = log D n(t) Dr. Holger Schwarz, Universität Stuttgart, IPVS 7

8 Relevance Ranking Using Terms Relevance of document d to term t: r (d, t) = TF (d, t) IDF(t) Relevance of document d to query Q: r (d, Q) = TF (d, t) n(t) t Q Dr. Holger Schwarz, Universität Stuttgart, IPVS 8

9 Relevance Ranking Using Terms Assume: - document A of 100 words contains the term "database" 3 times and the term "system" 6 times - document base D consists of 1 Mio. documents documents contain the term "database" documents contain the term "system" Relevance of a document d to a term TF(A,"database") = log(1+3/100) = TF(A,"system") = log(1+6/100) = Relevance of a term in document collection D IDF("database") = log(1000) = 3 IDF("system") = log(20) = TF-IDF(A,"database") = 0.013*3 = TF-IDF(A,"system") = 0.025*1.301 = Dr. Holger Schwarz, Universität Stuttgart, IPVS 9

10 Relevance Ranking Using Terms Most systems are more complex than that: Words that occur in title, author list, section headings, etc. are given greater importance Words whose first occurrence is late in the document are given lower importance Very common words such as a, an, the, it etc. are eliminated (stop words) Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart Documents are returned in decreasing order of relevance score (usually only top n documents) Dr. Holger Schwarz, Universität Stuttgart, IPVS 10

11 Similarity Based Retrieval Similarity based retrieval - retrieve documents similar to a given document Similarity may be defined on the basis of common words - E.g. find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find relevance of other documents. Relevance feedback: Similarity can be used to refine answer set to keyword query User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these Dr. Holger Schwarz, Universität Stuttgart, IPVS 11

12 Similarity Based Retrieval Vector space model: Define an n-dimensional space, where n is the number of terms in the document set. Vector for document d goes from origin to a point whose i th coordinate is TF (d,t ) / n (t ) The cosine of the angle between the vectors of two documents is used as a measure of their similarity. Usage in keyword search: Transform set of keywords into a document vector Calculate cosines for every document vector in D Use these to rank documents for retrieval Dr. Holger Schwarz, Universität Stuttgart, IPVS 12

13 Measuring Retrieval Effectiveness Information-retrieval systems save space by using index structures that support only approximate retrieval. This may result in: false negative (false drop): some relevant documents may not be retrieved. false positive: some irrelevant documents may be retrieved. For many applications, false positives are more tolerable than false negatives Dr. Holger Schwarz, Universität Stuttgart, IPVS 13

14 Measuring Retrieval Effectiveness Relevant performance metrics: precision: relevant documents retrieved documents retrieved documents % of retrieved documents that are relevant recall : % of relevant documents that were retrieved relevant documents retrieved documents relevant documents retrieved docs. Dr. Holger Schwarz, Universität Stuttgart, IPVS relevant not relevant 14

15 Measuring Retrieval Effectiveness Recall vs. precision tradeoff: increase recall by retrieving many documents Reduce precision by retrieving many irrelevant documents among them Measures of retrieval effectiveness: Recall as a function of number of documents fetched, or Precision as a function of recall - Equivalently, as a function of number of documents fetched E.g. precision of 75% at recall of 50%, and 60% at a recall of 75% Problem: measures of relevance Dr. Holger Schwarz, Universität Stuttgart, IPVS 15

16 Information Retrieval and Structured Data Information retrieval systems originally treated documents as a collection of words Information extraction systems infer structure from documents, e.g.: Extraction of house attributes (size, address, number of bedrooms, etc.) from a text advertisement Extraction of topic and people named from a news article Relations or XML structures used to store extracted data System seeks connections among data to answer queries Question answering systems Dr. Holger Schwarz, Universität Stuttgart, IPVS 16

17 Concept-Based Querying Approach For each word, determine the concept it represents from context Use one or more ontologies: - Hierarchical structure showing relationship between concepts - E.g.: elephant IS-A mammal can be used to standardize terminology in a specific field Ontologies can link multiple languages Foundation of the Semantic Web (not covered here) Useful for building concept-based querying: information extraction Which concepts make sense for this document collection? Which relations do we detect between concepts in this collection? Dr. Holger Schwarz, Universität Stuttgart, IPVS 17

18 Concept Resource: WordNet Lexical database of English verbs, nouns, and adjectives Taxonomy of concepts as represented by words Links concepts via semantic relations Synonyms happy, glad grouped into synsets Hypernyms and Hyponyms dog, mammal Meronyms wheel, tire Disambiguates word senses Freely available Equivalents exist for several natural languages e.g. GermaNet 18

20 Beyond Search: Information Extraction Information Retrieval only cares about retrieving documents containing a certain content Information Extraction distills content from documents i.e. uses documents as a source for Question answering Summary creation Compiling structured data Discovering new facts and relations This (often) requires text analytics! 20

21 Beyond Search: Text Analytics Tokenization: Splitting a text into words (tokens) - simple: on whitespace and punctuation - complex: what about compound nouns, multiwords, abbreviations, etc.? Sentence Splitting: finding sentence boundaries - Non-trivial: punctuation can also mark an abbreviation ('Dr. W. Jones is out of office today.'), not every sentence is delimited by punctuation (headlines), what about mid-sentence quotes? Stemming / Lemmatization: reducing words to base forms - e.g. running, horses Part-of-Speech-Tagging: Assigning a word its part of speech - Noun, verb, preposition, adverb tagsets - Challenges: ambiguous word class, e.g. 'I run a mile every day' vs. 'Today's run was great!' Chunking: combining several tokens into syntactic chunks, e.g. corresponding to noun phrases, prepositional phrases, adverbial... Parsing: assigning structure to entire sentences - constitutent vs. dependency Dr. Holger Schwarz, Universität Stuttgart, IPVS 21

22 Text Analytics Example Pipeline Text Files Natural Language Processing et al. Structured Information S-Klasse bezeichnet die Oberklasse der Automarke Mercedes-Benz. Sie steht für luxuriöse Limousinen und Coupés. Im Jahr 1972 erschien mit der Baureihe 116 die erste offiziell von Mercedes-Benz (MB) so bezeichnete S-Klasse. (Wikipedia) Entstehungsjahr(S-Klasse): 1972 IS-A(S-Klasse, Luxusauto) 22

23 Text Analytics Example Pipeline Words Parts of Speech Named Entities Sentence Structure S-Klasse bezeichnet die Oberklasse der Automarke Mercedes-Benz. Sie steht für luxuriöse Limousinen und Coupés. Im Jahr 1972 erschien mit der Baureihe 116 die erste offiziell von Mercedes-Benz (MB) so bezeichnete S-Klasse. (Wikipedia) Verbs NP S-Klasse (N) Names S bezeichnet (VFIN) VP NP NP NP die (ART) Oberklasse (N) der (ART) Automarke (N) Mercedes- Benz (N) 23

24 Text Analytics - Challenges Language-specific: Different structures, e.g. English / Turkish / Chinese Statistical tools perform well, but training requires large amounts of (annotated) data best performances usually for English, annotation is labor-intensive Web data: often written by non-native speakers and full of slang, abbreviations, nonstandard language need robust tools for 'ungrammatical' input Domain-specific: Narrow, fixed-structure idioms from one domain are easier to handle but may require manual calibration Free text with no topic restrictions is more difficult to process Complexity: full-blown text analytics is costly and not always precise enough for some applications, surfacey approaches such as regular expression pattern matching may be better suited 24

25 Text Analytics Frameworks and Toolkits Frameworks: Apache UIMA GATE Java Toolkits: OpenNLP Stanford Core NLP Python Toolkits: NLTK TextBlob 25

27 Social Media Analytics Central questions: Who cares about what on the web? What are people saying about [brand person event] online? Which topics are popular / trending? Positive or negative opinions? Which voices are influential? How does opinion spread? Can we identify recurring root causes? Are there correlations with [marketing campaigns product releases new strategies]? Company: Which products should I recommend to customer X based on his buying behavior? User: Which product should I buy? Is this movie worth watching? Do people like my blog? 27

28 Social Media Analytics structured sources Structured data sources: Page views Clicks Likes Followers Friend graphs Retweet/reblog statistics 28

29 Social Media Analytics structured sources 29

30 Social Media Analytics unstructured sources Unstructured data sources: News texts Blog content Reviews Comment sections Tweets and status updates 30

31 Sentiment Detection a.k.a. opinion mining performed mainly on unstructured, free text data sources research focus since early 2000s Machine learning available Large text collections available (the internet) Fed by interest in text summarization throughout 1990s classifies text snippets or entire documents as subjective / objective positive / negative / (neutral) strongly or weakly opinionated (intensity) Connects sentiment to topics / entities e.g. products, productions, persons 31

32 Sentiment Detection Not as easy as it seems 32

33 Text Features for Sentiment Detection Features for Sentiment and Subjectivity Classification Keywords with positive or negative sentiment Frequency Occurrence (yes/no) more effective Bigram or trigram features? Conflicted evidence, but bag-of-word models are problematic e.g. with regard to negation Parts-of-speech Only reliable feature: frequent adjectives signal subjectivity Syntax No clear evidence that parsing is helpful But: syntactic knowledge helps identify valence shifters e.g. negation, intensifiers, diminishers Collocations / syntactic patterns may be useful Predicate-argument combinations may carry sentiment where the single terms do not latent sentiment - The price is low = positive Rule-based classification vs. machine learning approaches 33

34 Creating a Sentiment Dictionary Hand-craft? Extremely time-consuming Even human annotators do not agree on all polarities Cluster terms according to frequencies, context, and constructions 'elegant but over-priced', 'clever and informative' 2 clusters assign orientation (e.g. cluster with more frequent average occurrences = positive seems to work) Use seed words with known polarity find words with similar distribution, co-occurrence, or which are synonymous propagate polarity e.g. across WordNet links 34

35 Sentiment and Topic What units are we looking at? Do we want to classify the document / paragraph / sentence / snippet? Local vs. global sentiment of a text Distance between topic and sentiment term same sentence, same paragraph, title of document? Topic-dependent sentiment Wal-mart reports that profits rose - positive in an article about Wal-mart, negative in an article about Target the Samsung Galaxy S5 is better than the LG 3G - positive for Samsung, negative for LG making things (slightly) easier: let user specify which topic they want to consider Discourse structure Headlines, position in paragraph Quoting and responding behavior in conversation threads 35

36 Resources for Sentiment Detection polarity word lists / nets English: Harvard General Inquirer SentiWordNet German: SentiWS Reviews with both unstructured and structured content labeled data for learning sentiment 36

37 Social Media Analytics Demographic Information What kind of people talk about a product? Men, women, children? Parents? Do they own the product? Are they potential customers? Where do they live? Username: supermama_10 Location: Houston, Texas I usually buy Pampers diapers, they are the best I gave my older daughter a Samsung S3 for Xmas, but now my husband uses it all the time lol 37

38 Social Media Analytics a concrete architecture IBM Social Media Analytics Coutinho et al.,

40 Social Media Analytics Refining Concepts Refining concepts: Concept suggestion component Select a representative sample of the gathered documents (downsampling) Extract the most relevant terms from these documents as keywords Cluster documents based on these keywords Control cluster: using just the initially specified concepts Similar to control cluster add keywords as new concept suggestions Different from control cluster add keywords as blacklist suggestions Feedback to user refined concept selection new crawl for documents 40

42 Sentiment Detection and Concept Extraction Sentiment Detection (similar, published approach: WebFountain sentiment miner, which also belongs to IBM) Linguistic preprocessing: Tokenization POS-tagging Parsing phrase and sentence structures Identify concepts and feature terms Part-of or attribute-of relationship with concept or known feature (e.g. 'lens' part-of 'camera', 'price' attribute-of 'camera') Candidates: beginning definite base noun phrases, i.e. POS-tag/word sequences 'the NN', 'the JJ NN', 'the NN NN' etc. (NN = noun, JJ = adjective) (Yi et al, 2005) 42

43 Sentiment Detection and Concept Extraction Sentiment Detection Sentiment lexicon <entry> <POS-tag> <polarity> excellent JJ + Sentiment patterns <predicate> <sentence-category> <target> <predicate> - a verb <sentence-category> - a subject phrase, object phrase, complement / adjective phrase or prepositional phrase, associated with a polarity + or - Flipped polarity on target is signified by ~ marker <target> - a subject or object phrase at which the sentiment is directed 43

44 Sentiment Detection and Concept Extraction Semantic relationship analysis: identify pattern elements from parse trees, starting with predicates In a pattern, assign sentiment to target based on source sentiment If the phrase or the sentence contains a negation, reverse the sentiment polarity Precision: 86 %, Recall: 56 % 44

45 Social Media Analytics a concrete architecture IBM Social Media Analytics Alper et Coutinho al et al.,

46 Resources / Further Reading Information retrieval: Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to information retrieval. Vol. 1. Cambridge: Cambridge University Press, Sentiment Detection: Pang, Bo, and Lillian Lee. "Opinion mining and sentiment analysis." Foundations and trends in information retrieval (2008): Social Media Analytics: Coutinho, Fabio Cardoso, Alexander Lang, and Bernhard Mitschang. "Making Social Media Analysis More Efficient Through Taxonomy Supported Concept Suggestion." Proceedings of the BTW Alper, Basak, et al. "OpinionBlocks: Visualizing Consumer Reviews." Proceedings of the IEEE VisWeek Workshop on Interactive Text Analytics for Decision Making Yi, Jeonghee, and Wayne Niblak. Sentiment Mining in WebFountain. Proceedings of the 21st ICDE