Where Text Analytics is at Today

Similar documents
Text Analytics Evaluation Case Study - Amdocs

Why is Internal Audit so Hard?

The Text Analytics Market(s)

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Search and Information Retrieval

Text Analytics Software Choosing the Right Fit

Get the most value from your surveys with text analysis

Expert System. Deep Semantic vs. Keyword and Shallow Linguistic: A New Approach for Supporting Exploitation

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Introduction to IE with GATE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Microsoft FAST Search Server 2010 for SharePoint Evaluation Guide

Social Media Analytics Summit April 17-18, 2012 Hotel Kabuki, San Francisco WELCOME TO THE SOCIAL MEDIA ANALYTICS SUMMIT #SMAS12

Fraud Detection In Insurance Claims. Bob Biermann April 15, 2013

Text Mining - Scope and Applications

Hexaware E-book on Predictive Analytics

What is a Domain Name?

Fact Sheet In-Memory Analysis

SUPEROFFICE 7 ONLINE. t e s t s

Beyond listening Driving better decisions with business intelligence from social sources

Find the signal in the noise

IBM Content Analytics with Enterprise Search, Version 3.0

Specialists in Strategic, Enterprise and Project Risk Management. Enterprise Risk Management. the effect of uncertainty on objectives.

Hadoop and NoSQL Basics: Big Data Demystified. NYS Innovation Summit, 12/17/2013. Matt

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Identifying Focus, Techniques and Domain of Scientific Papers

Data Mining Boot Camp 2. Text Mining

SaxoTrader Quick Start Guide

Text Analytics for Competitive Analysis and Market Intelligence Aiaioo Labs

What is the number one issue that Organizational Leaders are facing today?

Introduction to Cloud Services

CLOUD ANALYTICS: Empowering the Army Intelligence Core Analytic Enterprise

Managed Exchange. Access secure from any Internet connection in the world

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

THE AUSTRALIAN PUBLIC SERVICE BIG DATA STRATEGY. Comments from AIIA

Data Warehousing and Data Mining in Business Applications

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Introduction to Information Retrieval

RevoScaleR Speed and Scalability

Scenic Video Transcript Direct Cash Flow Statements Topics. Introduction. Purpose of cash flow statements. Level 1 analysis: major categories

A 2015 guide to HTML marketing

Knowledge Discovery from patents using KMX Text Analytics

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

BusinessMan CRM. Contents. Walkthrough. Computech IT Services Ltd Tuesday, June 1 st 2014 Technical Document Version 6.

Slide 7. Jashapara, Knowledge Management: An Integrated Approach, 2 nd Edition, Pearson Education Limited Nisan 14 Pazartesi

Corporate Presentation

CONTACT DATABASES IN MICROSOFT OUTLOOK

Semantic SharePoint. Technical Briefing. Helmut Nagy, Semantic Web Company Andreas Blumauer, Semantic Web Company

Network Patent Analytics

Text Analytics Industry Use Cases (& the Path Forward for Text Analytics) Aiaioo Labs Bangalore, India. team@aiaioo.com

Off-market Buy-Back booklet

White Paper April 2006

How To Write A Summary Of A Review

Taxonomies for Auto-Tagging Unstructured Content. Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 1, 2013

The Marketing Manager s Ultimate Cheat Sheet for Google Analytics

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Financial Text Mining

PDF PREVIEW EMERGING TECHNOLOGIES. Applying Technologies for Social Media Data Analysis

While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot.

Implementing Data Models and Reports with Microsoft SQL Server 2012 MOC 10778

THE POWER OF BIG DATA

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

From Grassroots to the Cloud

Big Data. What is Big Data? Over the past years. Big Data. Big Data: Introduction and Applications

Company Improves Data Analysis and Reporting with Business Intelligence Solution

Automatic Document Categorization A Hummingbird White Paper

Natural Language to Relational Query by Using Parsing Compiler

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

DATA INTEGRATION CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

The end of the road for traditional mail and document storage services?

Cloud Computing at Google. Architecture

Configuring SharePoint 2013 Document Management and Search. Scott Jamison Chief Architect & CEO Jornata scott.jamison@jornata.com

Digitally Sign an InfoPath form and Submit using

Introduction. A. Bellaachia Page: 1

Data Mining Yelp Data - Predicting rating stars from review text

Data loss prevention and endpoint security. Survey findings

Why are Organizations Interested?

Content Analyst's Cerebrant Combines SaaS Discovery, Machine Learning, and Content to Perform Next-Generation Research

27 August Dear Shareholder,

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

Setting up Junk Filters By Louise Ryan, NW District IT Expert

Twitter sentiment vs. Stock price!

Corporate Presentation. Mercato AIM - Italia: opportunità e rischi per emittenti ed investitori

Data-Driven Performance Management in Practice for Online Services

IBM Next Best Action. Tony Hocevar Business Analytics Growth Markets

Record keeping. Course 7

Information extraction from texts. Technical and business challenges

Activities Report & Cash Flow Statement For the 3 months ending 30 September 2015

Supply Chains: From Inside-Out to Outside-In

MICROSOFT AZURE MACHINE LEARNING. Oscar Naim Microsoft

Case study 1. Security operations. Step 1: Identify core skills required for work. Step 2: Identify learner s core skill levels

Act! Training Guide

Semantic Search in E-Discovery. David Graus & Zhaochun Ren

Word Completion and Prediction in Hebrew

Business 360 Online - Product concepts and features

Cognitive z. Mathew Thoennes IBM Research System z Research June 13, 2016

Transcription:

Where Text Analytics is at Today Robert Dale Centre for Language Technology www.clt.mq.edu.au Sydney Data Miners 2008-07-29 Robert Dale 1

The Aim of This Talk To give you an idea of where the commercial activity is in text analytics To give you an idea of where current research is focussed in text analytics Sydney Data Miners 2008-07-29 Robert Dale 2

The Problem: Too Much Information 80% of data is locked up in text How big is the web? A recent frequently cited answer: 155,583,825 sites Google indexes in excess of 8 billion pages FAST s enterprise search solution handles up to10 billion documents My hard disk contains 29.7Gb of work-related data in 106k documents My Outlook file is 800Mb with no attachments included Sydney Data Miners 2008-07-29 Robert Dale 3

A Solution: Text Analytics Text Analytics is the process of automatically extracting structured or semi-structured information from unstructured machine-readable documents Sydney Data Miners 2008-07-29 Robert Dale 4

Text Analytics Components Search Text Categorisation and Clustering Entity Recognition Cross-Document Name Matching Text Summarisation Event Recognition Sentiment Analysis Question Answering Sydney Data Miners 2008-07-29 Robert Dale 5

Search aka Information Retrieval Basic techniques: treat documents as bags of words, look for overlap with query Smarter ideas: Page rank; link analysis; hubs and authorities Using linguistic knowledge State of the art: Still really searching for words rather than searching for concepts Sydney Data Miners 2008-07-29 Robert Dale 6

Linguistic Knowledge in Search High frequency closed class words are not very informative: the, and, of Automatic detection of multiword concepts: New York City subway Stemming or morphological analysis can find pages otherwise missed: apples apple, application apply Query expansion via synonyms: car automobile Sydney Data Miners 2008-07-29 Robert Dale 7

Sydney Data Miners 2008-07-29 Robert Dale 8

Sydney Data Miners 2008-07-29 Robert Dale 9

Sydney Data Miners 2008-07-29 Robert Dale 10

Sydney Data Miners 2008-07-29 Robert Dale 11

Text Analytics Components Search Text Categorisation and Clustering Entity Recognition Cross-Document Name Matching Text Summarisation Event Recognition Sentiment Analysis Question Answering Sydney Data Miners 2008-07-29 Robert Dale 12

Text Categorisation Given a set of categories, automatically determine which category a new item belongs to Machine learning identifies the characteristics of existing documents that cause them to belong to their categories Sydney Data Miners 2008-07-29 Robert Dale 13

Uses of Text Categorisation Organising information into a structure Email folders Spam and junk mail detection ASX company announcements around 150 distinct document types Text categorisation can be rule-based or machine-learned Sydney Data Miners 2008-07-29 Robert Dale 14

Text Clustering From a given set of uncategorized items, determine the most natural data-driven groupings Machine learning identifies the similarites and differences between existing documents Sydney Data Miners 2008-07-29 Robert Dale 15

Sydney Data Miners 2008-07-29 Robert Dale 16

Sydney Data Miners 2008-07-29 Robert Dale 17

Sydney Data Miners 2008-07-29 Robert Dale 18

Sydney Data Miners 2008-07-29 Robert Dale 19

Text Categorisation and Clustering Categorisation: Taxonomies are alive and well Vendor claim: deploying taxonomies can reduce the amount of time it takes to find information by 50% Clustering: Available in some second-tier search engines Why hasn't Google prime-timed it? Sydney Data Miners 2008-07-29 Robert Dale 20

Text Analytics Components Search Text Categorisation and Clustering Entity Recognition Cross-Document Name Matching Text Summarisation Event Recognition Sentiment Analysis Question Answering Sydney Data Miners 2008-07-29 Robert Dale 21

Named Entities The idea: entities are more important than words Traditionally: People, organizations, geographical locations Time expressions, currency amounts, weights and measures Now: All of the above, plus key concepts or terms in any field Sydney Data Miners 2008-07-29 Robert Dale 22

Approaches Gazetteers: large lists of names and variants Rule-based systems: Eg, "any initcapped string with 'Ltd' at the end is a company" Machine learning: Based on the words that precede and follow initcapped strings, determine if they are likely to be named entities of a specific type Sydney Data Miners 2008-07-29 Robert Dale 23

Sydney Data Miners 2008-07-29 Robert Dale 24

Sydney Data Miners 2008-07-29 Robert Dale 25

Sydney Data Miners 2008-07-29 Robert Dale 26

Sydney Data Miners 2008-07-29 Robert Dale 27

Sydney Data Miners 2008-07-29 Robert Dale 28

Sydney Data Miners 2008-07-29 Robert Dale 29

Sydney Data Miners 2008-07-29 Robert Dale 30

Sydney Data Miners 2008-07-29 Robert Dale 31

Sydney Data Miners 2008-07-29 Robert Dale 32

Sydney Data Miners 2008-07-29 Robert Dale 33

Sydney Data Miners 2008-07-29 Robert Dale 34

Issues Complex names: John and Mary Smith Proctor and Gamble Book and movie titles Not Every Initcapped String Is A Name Sydney Data Miners 2008-07-29 Robert Dale 35

Text Analytics Components Search Text Categorisation and Clustering Entity Recognition Cross-Document Name Matching Text Summarisation Event Recognition Sentiment Analysis Question Answering Sydney Data Miners 2008-07-29 Robert Dale 36

Sydney Data Miners 2008-07-29 Robert Dale 37

Sydney Data Miners 2008-07-29 Robert Dale 38

Sydney Data Miners 2008-07-29 Robert Dale 39

Issues Name variation: Alexander Smith, Alex Smith, A Smith, Mr Smith IBM = Big Blue; Qantas = the Red Kangaroo Two sides to the coin: One name, many people Many names, one person Sydney Data Miners 2008-07-29 Robert Dale 40

Text Analytics Components Search Text Categorisation and Clustering Named Entity Recognition Cross-Document Name Matching Text Summarisation Event Recognition Sentiment Analysis Question Answering Sydney Data Miners 2008-07-29 Robert Dale 41

Types of Summarisation Informative vs Indicative Extractive vs Abstractive Sydney Data Miners 2008-07-29 Robert Dale 42

Informative vs Indicative An informative summary represents or replaces the original document: Needs to contain all the core information and omit ancillary information An indicative summary suggests the contents of the article without giving away detail on the actual content Can serve to entice the user into retrieving the full form Examples: book jackets, card catalog entries, movie trailers Sydney Data Miners 2008-07-29 Robert Dale 43

Extractive vs Abstractive Extractive summaries (extracts) are made up of fragments (whole sentences or phrases) of the source text. Abstractive summaries (abstracts) may contain content not directly present in the source may require generalization or abstraction Sydney Data Miners 2008-07-29 Robert Dale 44

Techniques for Extractive Summarisation Basic idea: Extract key sentences Approaches: Rule-based vs machine-learned Relevant features Position of sentence in text; frequency of words appearing in sentence; density of named entities Sydney Data Miners 2008-07-29 Robert Dale 45

Related Techniques Sentence compression Geoff Dixon, CEO of Qantas, today announced that the company will sell off Qantas will sell off Microsoft's IntelliShrink DrctDepositPymntsWllBAvlbleInYrAccntWthn3BsnssDysFrmPymnt DteBlw 'Snippet' or 'teaser' extraction Sydney Data Miners 2008-07-29 Robert Dale 46

Sydney Data Miners 2008-07-29 Robert Dale 47

Issues Impacting on Performance Real summarisation requires real understanding Quality of knowledge-free summarisation relies on aspects of the document other than content Problems with out-of-context or dangling references: 'He said the price offered in the buyback scheme would be ' Sydney Data Miners 2008-07-29 Robert Dale 48

Sydney Data Miners 2008-07-29 Robert Dale 49

Sydney Data Miners 2008-07-29 Robert Dale 50

Sydney Data Miners 2008-07-29 Robert Dale 51

Sydney Data Miners 2008-07-29 Robert Dale 52

Sydney Data Miners 2008-07-29 Robert Dale 53

Text Analytics Components Search Text Categorisation and Clustering Named Entity Recognition Cross-Document Name Matching Text Summarisation Event Recognition Sentiment Analysis Question Answering Sydney Data Miners 2008-07-29 Robert Dale 54

Event Recognition Working out who did what to whom and when Also known as 'Information Extraction' Sydney Data Miners 2008-07-29 Robert Dale 55

Doing It Yourself GATE, from the University of Sheffield IBM s UIMA Sydney Data Miners 2008-07-29 Robert Dale 56

An Example Document San Salvador, 19 Apr 89 (ACAN-EFE) -- [TEXT] Salvadoran Presidentelect Alfredo Cristiani condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti National Liberation Front (FMLN) of the crime.... Incident: Date 19-Apr-89 Garcia Alvarado, 56, was Incident: killed when Location a bomb placed by urban El Salvador: San Salvador (CITY) guerrillas on his vehicle exploded as it came to a halt at an intersection in downtown Incident: San Salvador. Type Bombing... Perpetrator: Individual ID urban guerrillas Vice President-elect Francisco Merino said that when the attorney general's car stopped at Perpetrator: a light on a street Organization in downtown ID San FMLN Salvador, an individual placed a bomb Perpetrator: on the roof Confidence of the armored vehicle. Suspected or Accused by Authorities: FMLN Physical Target: Description vehicle Physical Target: Effect Some Damage: vehicle Human Target: Name Roberto Garcia Alvarado Human Target: Description attorney general: Roberto Garcia Alvarado Human Target: Effect Death: Roberto Garcia Alvarado Sydney Data Miners 2008-07-29 Robert Dale 57

Techniques Identify named entities (see earlier) Identify stated relationships between named entities: Generally achieved via pattern matching As pattern matching becomes more complex, it effectively becomes syntactic parsing Sydney Data Miners 2008-07-29 Robert Dale 58

Example: GainSpring Problem: Goal: 100k+ company announcements per year issued through the Australian Stock Exchange Extracting key facts from documents automatically and deliver them via different mechanisms web page, email alerts, SMS, synthesized voice Sydney Data Miners 2008-07-29 Robert Dale 59

Sydney Data Miners 2008-07-29 Robert Dale 60

Sydney Data Miners 2008-07-29 Robert Dale 61

Example Template Fill: Becoming a Substantial Holder Element Example Contents DocumentCategory 02001 AcquiringPartyASX TCN AcquiringParty TCNZ Australia Investments Pty Ltd AcquiredPartyASX AAP AcquiredParty AAPT Limited DateOfTransaction 4/07/1999 NumberOfShares 243,756,813 ShareType ordinary shares PercentageOfShares 79.90% Sydney Data Miners 2008-07-29 Robert Dale 62

Sydney Data Miners 2008-07-29 Robert Dale 63

Sydney Data Miners 2008-07-29 Robert Dale 64

A JAPE Rule from GATE Rule: FirstName ({Lookup.majorType == person_first}):person --> { gate.annotationset person = (gate.annotationset)bindings.get("person"); gate.annotation personann = (gate.annotation)person.iterator().next(); gate.featuremap features = Factory.newFeatureMap(); features.put("gender", personann.getfeatures().get("minortype")); features.put("rule", "FirstName"); outputas.add(person.firstnode(), person.lastnode(), "FirstPerson", features); } Sydney Data Miners 2008-07-29 Robert Dale 65

Issues Impacting on Performance Difficult to achieve high recall and precision because of variations in language Success depends on the predictability of document content High performance really requires full natural language processing Sydney Data Miners 2008-07-29 Robert Dale 66

Text Analytics Components Search Text Categorisation and Clustering Named Entity Recognition Cross-Document Name Matching Text Summarisation Event Recognition Sentiment Analysis Question Answering Sydney Data Miners 2008-07-29 Robert Dale 67

Sentiment Analysis Goal: To automatically determine the polarity of sentiment expressed towards a company, product or individual Increasingly seen as important because of the blogosphere This year s hot topic, aka Voice of the Customer Sydney Data Miners 2008-07-29 Robert Dale 68

Approaches to Sentiment Analysis Identify specific key words that are indicative: The General Inquirer Positives (1914 words) eg: decent, ardent, wholesome, triumph, Negatives (2293 words) eg: allege, despicable, harmful, worry, terrorize, Sydney Data Miners 2008-07-29 Robert Dale 69

Sydney Data Miners 2008-07-29 Robert Dale 70

Sydney Data Miners 2008-07-29 Robert Dale 71

Sydney Data Miners 2008-07-29 Robert Dale 72

Sydney Data Miners 2008-07-29 Robert Dale 73

Sydney Data Miners 2008-07-29 Robert Dale 74

Sydney Data Miners 2008-07-29 Robert Dale 75

Sydney Data Miners 2008-07-29 Robert Dale 76

Why Sentiment Analysis is Hard This laptop is a great deal. A great deal of media attention surrounded the release of the new laptop. This laptop is a great deal... and I've got a nice bridge you might be interested in. [Example from Lilian Lee] Sydney Data Miners 2008-07-29 Robert Dale 77

Text Analytics Components Search Text Categorisation and Clustering Named Entity Recognition Cross-Document Name Matching Text Summarisation Event Recognition Sentiment Analysis Question Answering Sydney Data Miners 2008-07-29 Robert Dale 78

Question Answering Current paradigm: Type in a query, get back a list of documents Future paradigm: Type in a question, get back an answer Sydney Data Miners 2008-07-29 Robert Dale 79

Sydney Data Miners 2008-07-29 Robert Dale 80

Sydney Data Miners 2008-07-29 Robert Dale 81

Sydney Data Miners 2008-07-29 Robert Dale 82

Sydney Data Miners 2008-07-29 Robert Dale 83

Sydney Data Miners 2008-07-29 Robert Dale 84

Sydney Data Miners 2008-07-29 Robert Dale 85

Sydney Data Miners 2008-07-29 Robert Dale 86

Sydney Data Miners 2008-07-29 Robert Dale 87

Sydney Data Miners 2008-07-29 Robert Dale 88

Sydney Data Miners 2008-07-29 Robert Dale 89

Issues Affecting Performance Hard to break the 2.3-word-query habit: In our four year 'Just Ask!' experiment, less than 10% of invited questions were actually questions 'Deictic' and other context-specific questions: are my phys149 results available? are practicals on this week? Sydney Data Miners 2008-07-29 Robert Dale 90

Other Related Areas Worth Watching Machine Translation Audio Search and Audio Mining Sydney Data Miners 2008-07-29 Robert Dale 91

Where We're At Search provides 80% of the value of text analytics but this decreasing as document datasets get bigger and bigger Text categorisation provides value if implemented appropriately Text clustering can help if 100% accuracy is not required Text summarisation needs to be tailored to document sets Named entity recognition already improves on basic search Event recognition is hampered by low accuracy Sentiment analysis is this year s hot topic Sydney Data Miners 2008-07-29 Robert Dale 92

What s Coming Fact extraction and aggregation: open information extraction Cross-document entity tracking Multi-document summarisation Sydney Data Miners 2008-07-29 Robert Dale 93

Finding Out More: Industry Check out the companies mentioned in this presentation Go to the Text Analytics Summit Type text analytics into any search engine Read my Industry Watch column http://www.ics.mq.edu.au/~rdale/publications/industrywatch/ Sydney Data Miners 2008-07-29 Robert Dale 94

Finding Out More: Research Conferences: The Association for Computational Linguistics Empirical Methods in Natural Language Processing COLING Journals: Computational Linguistics Natural Language Engineering The ACL Anthology: http://aclweb.org/anthology-new/ Sydney Data Miners 2008-07-29 Robert Dale 95

Finding Out More: Do a PhD! Current topics being pursued at the Centre for Language Technology include: Extracting tabular information from documents Identifying requests and commitments from emails Tracking entities across documents Automatic agenda construction from UN documents High-value PhD scholarships available via the Capital Markets Cooperative Research Centre Sydney Data Miners 2008-07-29 Robert Dale 96

Questions? Sydney Data Miners 2008-07-29 Robert Dale 97