Information Retrieval, Information Extraction and Social Media Analytics



Similar documents
Search and Information Retrieval

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Clustering Connectionist and Statistical Language Processing

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

Stock Market Prediction Using Data Mining

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Information Retrieval Elasticsearch

Customer Intentions Analysis of Twitter Based on Semantic Patterns

Building a Question Classifier for a TREC-Style Question Answering System

The Seven Practice Areas of Text Analytics

Survey Results: Requirements and Use Cases for Linguistic Linked Data

IT services for analyses of various data samples

How To Write A Summary Of A Review

Text Mining and Analysis

Mining Text Data: An Introduction

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Computer Aided Document Indexing System

Clustering Technique in Data Mining for Text Documents

Semantic analysis of text and speech

Analyzing survey text: a brief overview

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Special Topics in Computer Science

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Text Mining - Scope and Applications

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Keywords social media, internet, data, sentiment analysis, opinion mining, business

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Natural Language to Relational Query by Using Parsing Compiler

Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.

Integrating Public and Private Medical Texts for Patient De-Identification with Apache ctakes

Interest Rate Prediction using Sentiment Analysis of News Information

Data Mining Yelp Data - Predicting rating stars from review text

Introduction. A. Bellaachia Page: 1

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP

Interactive Dynamic Information Extraction

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Sentiment Analysis on Big Data

Technical Report. The KNIME Text Processing Feature:

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 5, Sep-Oct 2015

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Identifying Focus, Techniques and Domain of Scientific Papers

SENTIMENT ANALYSIS: A STUDY ON PRODUCT FEATURES

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-4, Issue-4) Abstract-

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Optimization of Internet Search based on Noun Phrases and Clustering Techniques

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

A Survey on Product Aspect Ranking

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Semantic Search in Portals using Ontologies

Movie Classification Using k-means and Hierarchical Clustering

A Sentiment Analysis Model Integrating Multiple Algorithms and Diverse. Features. Thesis

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Word Completion and Prediction in Hebrew

EXTRACTING BUSINESS INTELLIGENCE FROM ONLINE PRODUCT REVIEWS

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Reputation Management System

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Ontology based ranking of documents using Graph Databases: a Big Data Approach

Text Analytics. A business guide

Why are Organizations Interested?

Why is Internal Audit so Hard?

Terminology Extraction from Log Files

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

Big Data Analytics and Healthcare

Shallow Parsing with Apache UIMA

Knowledge Discovery from patents using KMX Text Analytics

ifinder ENTERPRISE SEARCH

Machine Learning using MapReduce

Natural Language Processing

Reducing Client Incidents through

SENTIMENT ANALYSIS: TEXT PRE-PROCESSING, READER VIEWS AND CROSS DOMAINS EMMA HADDI BRUNEL UNIVERSITY LONDON

Transcription:

Anwendersoftware a Information Retrieval, Information Extraction and Social Media Analytics Based on chapter 10 of the Advanced Information Management lecture Laura Kassner Universität Stuttgart Winter Term 2014

Overview Information Retrieval Introduction Relevance Ranking TF-IDF Similarity-Based Retrieval Measuring Retrieval Effectiveness Concept-Based Querying Information Extraction Text Analytics Social Media Analytics Introduction SMA on structured data Sentiment Detection Examples/Discussion 2

Information Retrieval Systems simpler data model than database systems Information organized as a collection of documents Documents are unstructured, no schema Goal: locate relevant documents based on user input keywords example documents e.g., find documents containing the words database systems "database system" database system query input IR collection of documents document_x document_y document_z works on textual descriptions provided with non-textual data such as images Example: Web search engines, desktop file search Dr. Holger Schwarz, Universität Stuttgart, IPVS 3

Information Retrieval Systems Differences from database systems: No transactional updates (including concurrency control and recovery) Database systems deal with structured data, with schemas that define the data organization IR systems deal with some querying issues not generally addressed by database systems - Approximate searching by keywords - Ranking of retrieved answers by estimated degree of relevance Dr. Holger Schwarz, Universität Stuttgart, IPVS 4

Keyword Search In full text retrieval, all the words in each document are considered to be keywords. Word in a document = term query expressions consist of keywords and the logical connectives "and", "or", and "not" and is implicit for queries with several worcs Ranking of documents on the basis of estimated relevance to a query is critical! Factors for relevance: Term frequency - Frequency of occurrence of query keyword in document Inverse document frequency - How many documents the query keyword occurs in Fewer give more importance to keyword Hyperlinks to documents - More links to a document document is more important (cf. PageRank) Dr. Holger Schwarz, Universität Stuttgart, IPVS 5

Document Indexing An inverted index maps each keyword K i to a set of documents S i that contain the keyword Documents identified by identifiers Inverted index may record Keyword locations within document to allow proximity based ranking Counts of number of occurrences of keyword to compute TF and operation: Finds documents that contain all of K 1, K 2,..., K n. Intersection S 1 S 2... S n or operation: documents that contain at least one of K 1, K 2,, K n Union S 1 S 2... S n Each S i is kept sorted to allow efficient intersection/union by merging not can also be efficiently implemented by merging of sorted lists Dr. Holger Schwarz, Universität Stuttgart, IPVS 6

Relevance Ranking Using Terms TF-IDF (Term frequency/inverse Document frequency) ranking: n(d) = number of terms in the document d n(d, t) = number of occurrences of term t in the document d n(t) = number of documents containing term t Relevance of a document d to a term t n(d, t) TF (d, t) = log 1 + n(d) - log factor to avoid excessive weight to frequent terms Relevance of a term t in document collection D IDF (t) = log D n(t) Dr. Holger Schwarz, Universität Stuttgart, IPVS 7

Relevance Ranking Using Terms Relevance of document d to term t: r (d, t) = TF (d, t) IDF(t) Relevance of document d to query Q: r (d, Q) = TF (d, t) n(t) t Q Dr. Holger Schwarz, Universität Stuttgart, IPVS 8

Relevance Ranking Using Terms Assume: - document A of 100 words contains the term "database" 3 times and the term "system" 6 times - document base D consists of 1 Mio. documents - 1000 documents contain the term "database" - 50000 documents contain the term "system" Relevance of a document d to a term TF(A,"database") = log(1+3/100) = 0.013 TF(A,"system") = log(1+6/100) = 0.025 Relevance of a term in document collection D IDF("database") = log(1000) = 3 IDF("system") = log(20) = 1.301 TF-IDF(A,"database") = 0.013*3 = 0.039 TF-IDF(A,"system") = 0.025*1.301 = 0.033 Dr. Holger Schwarz, Universität Stuttgart, IPVS 9

Relevance Ranking Using Terms Most systems are more complex than that: Words that occur in title, author list, section headings, etc. are given greater importance Words whose first occurrence is late in the document are given lower importance Very common words such as a, an, the, it etc. are eliminated (stop words) Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart Documents are returned in decreasing order of relevance score (usually only top n documents) Dr. Holger Schwarz, Universität Stuttgart, IPVS 10

Similarity Based Retrieval Similarity based retrieval - retrieve documents similar to a given document Similarity may be defined on the basis of common words - E.g. find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find relevance of other documents. Relevance feedback: Similarity can be used to refine answer set to keyword query User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these Dr. Holger Schwarz, Universität Stuttgart, IPVS 11

Similarity Based Retrieval Vector space model: Define an n-dimensional space, where n is the number of terms in the document set. Vector for document d goes from origin to a point whose i th coordinate is TF (d,t ) / n (t ) The cosine of the angle between the vectors of two documents is used as a measure of their similarity. Usage in keyword search: Transform set of keywords into a document vector Calculate cosines for every document vector in D Use these to rank documents for retrieval Dr. Holger Schwarz, Universität Stuttgart, IPVS 12

Measuring Retrieval Effectiveness Information-retrieval systems save space by using index structures that support only approximate retrieval. This may result in: false negative (false drop): some relevant documents may not be retrieved. false positive: some irrelevant documents may be retrieved. For many applications, false positives are more tolerable than false negatives Dr. Holger Schwarz, Universität Stuttgart, IPVS 13

Measuring Retrieval Effectiveness Relevant performance metrics: precision: relevant documents retrieved documents retrieved documents % of retrieved documents that are relevant recall : % of relevant documents that were retrieved relevant documents retrieved documents relevant documents retrieved docs. Dr. Holger Schwarz, Universität Stuttgart, IPVS relevant not relevant 14

Measuring Retrieval Effectiveness Recall vs. precision tradeoff: increase recall by retrieving many documents Reduce precision by retrieving many irrelevant documents among them Measures of retrieval effectiveness: Recall as a function of number of documents fetched, or Precision as a function of recall - Equivalently, as a function of number of documents fetched E.g. precision of 75% at recall of 50%, and 60% at a recall of 75% Problem: measures of relevance Dr. Holger Schwarz, Universität Stuttgart, IPVS 15

Information Retrieval and Structured Data Information retrieval systems originally treated documents as a collection of words Information extraction systems infer structure from documents, e.g.: Extraction of house attributes (size, address, number of bedrooms, etc.) from a text advertisement Extraction of topic and people named from a news article Relations or XML structures used to store extracted data System seeks connections among data to answer queries Question answering systems Dr. Holger Schwarz, Universität Stuttgart, IPVS 16

Concept-Based Querying Approach For each word, determine the concept it represents from context Use one or more ontologies: - Hierarchical structure showing relationship between concepts - E.g.: elephant IS-A mammal can be used to standardize terminology in a specific field Ontologies can link multiple languages Foundation of the Semantic Web (not covered here) Useful for building concept-based querying: information extraction Which concepts make sense for this document collection? Which relations do we detect between concepts in this collection? Dr. Holger Schwarz, Universität Stuttgart, IPVS 17

Concept Resource: WordNet Lexical database of English verbs, nouns, and adjectives http://wordnet.princeton.edu/ Taxonomy of concepts as represented by words Links concepts via semantic relations Synonyms happy, glad grouped into synsets Hypernyms and Hyponyms dog, mammal Meronyms wheel, tire Disambiguates word senses Freely available Equivalents exist for several natural languages e.g. GermaNet 18

Overview Information Retrieval Introduction Relevance Ranking TF-IDF Similarity-Based Retrieval Measuring Retrieval Effectiveness Concept-Based Querying Information Extraction Text Analytics Social Media Analytics Introduction SMA on structured data Sentiment Detection Examples/Discussion 19

Beyond Search: Information Extraction Information Retrieval only cares about retrieving documents containing a certain content Information Extraction distills content from documents i.e. uses documents as a source for Question answering Summary creation Compiling structured data Discovering new facts and relations This (often) requires text analytics! 20

Beyond Search: Text Analytics Tokenization: Splitting a text into words (tokens) - simple: on whitespace and punctuation - complex: what about compound nouns, multiwords, abbreviations, etc.? Sentence Splitting: finding sentence boundaries - Non-trivial: punctuation can also mark an abbreviation ('Dr. W. Jones is out of office today.'), not every sentence is delimited by punctuation (headlines), what about mid-sentence quotes? Stemming / Lemmatization: reducing words to base forms - e.g. running, horses Part-of-Speech-Tagging: Assigning a word its part of speech - Noun, verb, preposition, adverb tagsets - Challenges: ambiguous word class, e.g. 'I run a mile every day' vs. 'Today's run was great!' Chunking: combining several tokens into syntactic chunks, e.g. corresponding to noun phrases, prepositional phrases, adverbial... Parsing: assigning structure to entire sentences - constitutent vs. dependency Dr. Holger Schwarz, Universität Stuttgart, IPVS 21

Text Analytics Example Pipeline Text Files Natural Language Processing et al. Structured Information S-Klasse bezeichnet die Oberklasse der Automarke Mercedes-Benz. Sie steht für luxuriöse Limousinen und Coupés. Im Jahr 1972 erschien mit der Baureihe 116 die erste offiziell von Mercedes-Benz (MB) so bezeichnete S-Klasse. (Wikipedia) Entstehungsjahr(S-Klasse): 1972 IS-A(S-Klasse, Luxusauto) 22

Text Analytics Example Pipeline Words Parts of Speech Named Entities Sentence Structure S-Klasse bezeichnet die Oberklasse der Automarke Mercedes-Benz. Sie steht für luxuriöse Limousinen und Coupés. Im Jahr 1972 erschien mit der Baureihe 116 die erste offiziell von Mercedes-Benz (MB) so bezeichnete S-Klasse. (Wikipedia) Verbs NP S-Klasse (N) Names S bezeichnet (VFIN) VP NP NP NP die (ART) Oberklasse (N) der (ART) Automarke (N) Mercedes- Benz (N) 23

Text Analytics - Challenges Language-specific: Different structures, e.g. English / Turkish / Chinese Statistical tools perform well, but training requires large amounts of (annotated) data best performances usually for English, annotation is labor-intensive Web data: often written by non-native speakers and full of slang, abbreviations, nonstandard language need robust tools for 'ungrammatical' input Domain-specific: Narrow, fixed-structure idioms from one domain are easier to handle but may require manual calibration Free text with no topic restrictions is more difficult to process Complexity: full-blown text analytics is costly and not always precise enough for some applications, surfacey approaches such as regular expression pattern matching may be better suited 24

Text Analytics Frameworks and Toolkits Frameworks: Apache UIMA http://uima.apache.org/ GATE https://gate.ac.uk/ Java Toolkits: OpenNLP https://opennlp.apache.org/ Stanford Core NLP http://nlp.stanford.edu/software/corenlp.shtml Python Toolkits: NLTK http://www.nltk.org/ TextBlob http://textblob.readthedocs.org/ 25

Overview Information Retrieval Introduction Relevance Ranking TF-IDF Similarity-Based Retrieval Measuring Retrieval Effectiveness Concept-Based Querying Information Extraction Text Analytics Social Media Analytics Introduction SMA on structured data Sentiment Detection Examples/Discussion 26

Social Media Analytics Central questions: Who cares about what on the web? What are people saying about [brand person event] online? Which topics are popular / trending? Positive or negative opinions? Which voices are influential? How does opinion spread? Can we identify recurring root causes? Are there correlations with [marketing campaigns product releases new strategies]? Company: Which products should I recommend to customer X based on his buying behavior? User: Which product should I buy? Is this movie worth watching? Do people like my blog? 27

Social Media Analytics structured sources Structured data sources: Page views Clicks Likes Followers Friend graphs Retweet/reblog statistics 28

Social Media Analytics structured sources 29

Social Media Analytics unstructured sources Unstructured data sources: News texts Blog content Reviews Comment sections Tweets and status updates 30

Sentiment Detection a.k.a. opinion mining performed mainly on unstructured, free text data sources research focus since early 2000s Machine learning available Large text collections available (the internet) Fed by interest in text summarization throughout 1990s classifies text snippets or entire documents as subjective / objective positive / negative / (neutral) strongly or weakly opinionated (intensity) Connects sentiment to topics / entities e.g. products, productions, persons 31

Sentiment Detection Not as easy as it seems 32

Text Features for Sentiment Detection Features for Sentiment and Subjectivity Classification Keywords with positive or negative sentiment Frequency Occurrence (yes/no) more effective Bigram or trigram features? Conflicted evidence, but bag-of-word models are problematic e.g. with regard to negation Parts-of-speech Only reliable feature: frequent adjectives signal subjectivity Syntax No clear evidence that parsing is helpful But: syntactic knowledge helps identify valence shifters e.g. negation, intensifiers, diminishers Collocations / syntactic patterns may be useful Predicate-argument combinations may carry sentiment where the single terms do not latent sentiment - The price is low = positive Rule-based classification vs. machine learning approaches 33

Creating a Sentiment Dictionary Hand-craft? Extremely time-consuming Even human annotators do not agree on all polarities Cluster terms according to frequencies, context, and constructions 'elegant but over-priced', 'clever and informative' 2 clusters assign orientation (e.g. cluster with more frequent average occurrences = positive seems to work) Use seed words with known polarity find words with similar distribution, co-occurrence, or which are synonymous propagate polarity e.g. across WordNet links 34

Sentiment and Topic What units are we looking at? Do we want to classify the document / paragraph / sentence / snippet? Local vs. global sentiment of a text Distance between topic and sentiment term same sentence, same paragraph, title of document? Topic-dependent sentiment Wal-mart reports that profits rose - positive in an article about Wal-mart, negative in an article about Target the Samsung Galaxy S5 is better than the LG 3G - positive for Samsung, negative for LG making things (slightly) easier: let user specify which topic they want to consider Discourse structure Headlines, position in paragraph Quoting and responding behavior in conversation threads 35

Resources for Sentiment Detection polarity word lists / nets English: Harvard General Inquirer http://www.wjh.harvard.edu/~inquirer/, SentiWordNet http://sentiwordnet.isti.cnr.it/ German: SentiWS http://asv.informatik.uni-leipzig.de/download/sentiws.html Reviews with both unstructured and structured content labeled data for learning sentiment 36

Social Media Analytics Demographic Information What kind of people talk about a product? Men, women, children? Parents? Do they own the product? Are they potential customers? Where do they live? Username: supermama_10 Location: Houston, Texas I usually buy Pampers diapers, they are the best I gave my older daughter a Samsung S3 for Xmas, but now my husband uses it all the time lol 37

Social Media Analytics a concrete architecture IBM Social Media Analytics Coutinho et al., 2013 38

Social Media Analytics a concrete architecture IBM Social Media Analytics Coutinho et al., 2013 39

Social Media Analytics Refining Concepts Refining concepts: Concept suggestion component Select a representative sample of the gathered documents (downsampling) Extract the most relevant terms from these documents as keywords Cluster documents based on these keywords Control cluster: using just the initially specified concepts Similar to control cluster add keywords as new concept suggestions Different from control cluster add keywords as blacklist suggestions Feedback to user refined concept selection new crawl for documents 40

Social Media Analytics a concrete architecture IBM Social Media Analytics Coutinho et al., 2013 41

Sentiment Detection and Concept Extraction Sentiment Detection (similar, published approach: WebFountain sentiment miner, which also belongs to IBM) Linguistic preprocessing: Tokenization POS-tagging Parsing phrase and sentence structures Identify concepts and feature terms Part-of or attribute-of relationship with concept or known feature (e.g. 'lens' part-of 'camera', 'price' attribute-of 'camera') Candidates: beginning definite base noun phrases, i.e. POS-tag/word sequences 'the NN', 'the JJ NN', 'the NN NN' etc. (NN = noun, JJ = adjective) (Yi et al, 2005) 42

Sentiment Detection and Concept Extraction Sentiment Detection Sentiment lexicon <entry> <POS-tag> <polarity> excellent JJ + Sentiment patterns <predicate> <sentence-category> <target> <predicate> - a verb <sentence-category> - a subject phrase, object phrase, complement / adjective phrase or prepositional phrase, associated with a polarity + or - Flipped polarity on target is signified by ~ marker <target> - a subject or object phrase at which the sentiment is directed 43

Sentiment Detection and Concept Extraction Semantic relationship analysis: identify pattern elements from parse trees, starting with predicates In a pattern, assign sentiment to target based on source sentiment If the phrase or the sentence contains a negation, reverse the sentiment polarity Precision: 86 %, Recall: 56 % 44

Social Media Analytics a concrete architecture IBM Social Media Analytics Alper et Coutinho al. 2011 et al., 2013 45

Resources / Further Reading Information retrieval: Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to information retrieval. Vol. 1. Cambridge: Cambridge University Press, 2008. Sentiment Detection: Pang, Bo, and Lillian Lee. "Opinion mining and sentiment analysis." Foundations and trends in information retrieval 2.1-2 (2008): 1-135. Social Media Analytics: Coutinho, Fabio Cardoso, Alexander Lang, and Bernhard Mitschang. "Making Social Media Analysis More Efficient Through Taxonomy Supported Concept Suggestion." Proceedings of the BTW. 2013 Alper, Basak, et al. "OpinionBlocks: Visualizing Consumer Reviews." Proceedings of the IEEE VisWeek Workshop on Interactive Text Analytics for Decision Making. 2011. Yi, Jeonghee, and Wayne Niblak. Sentiment Mining in WebFountain. Proceedings of the 21st ICDE. 2005 46