Information extraction from texts. Technical and business challenges

Similar documents
Information Extraction

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Word Completion and Prediction in Hebrew

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

Technical Report. The KNIME Text Processing Feature:

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Information Retrieval

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Text Mining - Scope and Applications

Mining Text Data: An Introduction

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Overview of MT techniques. Malek Boualem (FT)

31 Case Studies: Java Natural Language Tools Available on the Web

Semantic annotation of requirements for automatic UML class diagram generation

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

IBM Content Analytics with Enterprise Search, Version 3.0

Why are Organizations Interested?

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

2015 Workshops for Professors

Building a Question Classifier for a TREC-Style Question Answering System

Role of Text Mining in Business Intelligence

Collecting Polish German Parallel Corpora in the Internet

Master Degree Project Ideas (Fall 2014) Proposed By Faculty Department of Information Systems College of Computer Sciences and Information Technology

Interactive Dynamic Information Extraction

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Flattening Enterprise Knowledge

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Customizing an English-Korean Machine Translation System for Patent Translation *

Computer-aided Document Indexing System

Automatic Text Analysis Using Drupal

How To Make Sense Of Data With Altilia

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

Text Mining and Analysis

Introduction. Philipp Koehn. 28 January 2016

Text Mining and its Applications to Intelligence, CRM and Knowledge Management

Timeline (1) Text Mining Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining?

Semantic analysis of text and speech

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Delivering Smart Answers!

7 th Grade Web Design Name: Project 1 Rubric Personal Web Page

Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources

Facilitating Business Process Discovery using Analysis

Computer Aided Document Indexing System

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Processing: current projects and research at the IXA Group

Inner Classification of Clusters for Online News

Natural Language to Relational Query by Using Parsing Compiler

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Specialty Answering Service. All rights reserved.

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

Web Database Integration

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

The Future of Communication

Joint Research Centre

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Survey Results: Requirements and Use Cases for Linguistic Linked Data

PROMT Technologies for Translation and Big Data

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

HOPS Project presentation

Taxonomies for Auto-Tagging Unstructured Content. Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 1, 2013

Natural Language Processing

Exalead CloudView. Semantics Whitepaper

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

Slide 7. Jashapara, Knowledge Management: An Integrated Approach, 2 nd Edition, Pearson Education Limited Nisan 14 Pazartesi

ATLAS.ti 5 HyperResearch 2.6 MAXqda The Ethnograph 5.08 QSR N 6 QSR NVivo. Media types: rich text. Editing of coded documents supported

Introduction. A. Bellaachia Page: 1

Chapter 8: Types of Business Process Outsourcing. Imtiaz Surve H. Surve

Hexaware E-book on Predictive Analytics

Introduction to Pattern Recognition

Hybrid Strategies. for better products and shorter time-to-market

Transcription:

Information extraction from texts Technical and business challenges

Overview Mentis Text mining field overview Application: Information Extraction Motivation & Overview Page 2

Mentis - Overview Consulting and Software company Created in 2005. Spin- Off of IRIDIA (artificial Intelligence laboratory at ULB) Strong partnership with the University Multi- disciplinary team: Math, physics, IT, Optimization 3 PHD 7 engineers Page 3

Mentis Some references Page 4

Mentis - Mission Offer High value solutions to help understand, structure, find, and enrich unstructured & Structured Textual information and transform it into knowledge. Understand Deep analysis based on domain specific knowledge bases Structure Use domain specific taxonomies to normalize and structure Find Better Information retrieval through semantic search engines. Enrich Assign entities and facts to words and terms. Page 5

Mentis Core business Technology provider Text Analysis Engine Outsourcing Expert shopping Process outsourcing Solution provider Innovation business (solve the insolvable) Page 6

Research communities Information Retrieval Semantic Web Computational Linguistic Machine Learning Top/ Down Language Statistics Search Interoper ability Approaches Collaborative Bottom/ Up Methodologies Page 7

Semantic Levels of Representation Semantic Models Ontologies Taxonomies Collaborative Tagging Syntactic Lexical Cross-modality Full Parsing Language Models Vector space Model Part Of speech Tags Phrases Words Character Page 8

Semantic Models Semantic Ontologies Taxonomies Collaborative Tagging Levels of Representation Domain Knowledge - Reasoning Unifying Semantics and Data: Flicker, Cross-modality Syntactic Full Parsing Language Models Vector space Model Part Of speech Tags Lexical Phrases Words Character Multi-Lingual Search Text/Image/Audio Machine Translation Spam Detection Classification Clustering - search Named Entity Extraction Language identification Copy detection Page 9

Semantic Models Semantic Ontologies Taxonomies Collaborative Tagging Cross-modality Syntactic Full Parsing Language Models Vector space Model Part Of speech Tags Lexical Phrases Words Character Levels of Representation Machine Learning IR CL Semantic Web Page 10

Store Organize and Retrieve knowledge Page 11

Digital Age World Wide Web Digital Libraries Special purpose content providers (Lexis Nexis) Company Intranet and digital assets Scientific Literature (siteseer, Medline) Patent Database Page 12

Types of documents Structured Unstructured Semi- Structured. Page 13

Examples of Semi Structured Docs Page 14

Information Extraction 15

Information Extraction October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. NAME TITLE ORGANIZATION "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Page Software 16 Foundation, countered saying

Information Extraction October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Page Software 17 Foundation, countered saying IE NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft..

Information Extraction Used family of techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Microsoft Corporation CEO Bill Gates Microsoft Gates aka named entity Microsoft extraction Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Richard Stallman, founder of the Free Page Software 18 Foundation, countered saying

Information Extraction Used family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Richard Stallman, founder of the Free Page Software 19 Foundation, countered saying

Information Extraction Used family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Richard Stallman, founder of the Free Page Software 20 Foundation, countered saying

What is Information Extraction Used family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. * * * * Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. Richard Stallman, founder of the Free Page Software 21 Foundation, countered saying

How It works OCR Scaning Physical Document Mark-Up and Extraction Database Analyze Page 22 Part Fuel Pump Pump Relay Proble m Fails Shorts Conditio n corroded Cold weather Headlight Fails Running hot Engine Stalls At low speeds

Relevant IE Definitions Entity: an object of interest such as a person or organization. Attribute: a property of an entity such as its name, alias, descriptor, or type. Fact: a relationship held between two or more entities such as Position of a Person in a Company. Event: an activity involving several entities such as a terrorist act, airline crash, management change, new product introduction. Page 23

Accuracy by Information Type Information Type Accuracy Entities 90-98% Attributes 80% Facts 60-70% Events 50-60% Page 24

Applications of Information Extraction Routing of Information Infrastructure for IR and for Categorization (higher level features) Event Based Summarization. Automatic Creation of Databases and Knowledge Bases. Page 25

Approaches for Building IE Systems Knowledge Engineering Approach Rules are crafted by linguists in cooperation with domain experts. Most of the work is done by inspecting a set of relevant documents. Can take a lot of time to fine tune the rule set. Best results were achieved with KB based IE systems. Skilled/gifted developers are needed. A strong development environment is a MUST! Page 26

Approaches for Building IE Systems Automatically Trainable Systems The techniques are based on pure statistics and almost no linguistic knowledge They are language independent The main input is an annotated corpus Need a relatively small effort when building the rules, however creating the annotated corpus is extremely laborious. Huge number of training examples is needed in order to achieve reasonable accuracy. Hybrid approaches can utilize the user input in the development loop. Page 27

Good Features for IE begins-with-number begins-with-ordinal begins-with-punctuation begins-with-question-word begins-with-subject blank contains-alphanum contains-bracketednumber contains-http contains-non-space contains-number contains-pipe Page 28 Creativity and Domain Knowledge Required! Example word features: identity of word is in all caps ends in -ski is part of a noun phrase is in a list of city names is under node X in WordNet or Cyc is in bold font is in hyperlink anchor features of past & future last person name was female next two words are and Associates contains-question-mark contains-question-word ends-with-question-mark first-alpha-is-capitalized indented indented-1-to-4 indented-5-to-10 more-than-one-third-space only-punctuation prev-is-blank prev-begins-with-ordinal shorter-than-30

Good Features for IE Creativity and Domain Knowledge Required! Is Capitalized Is Mixed Caps Is All Caps Initial Cap Contains Digit All lowercase Is Initial Punctuation Period Comma Apostrophe Dash Preceded by HTML tag Page 29 Character n-gram classifier says string is a person name (80% accurate) In stopword list (the, of, their, etc) In honorific list (Mr, Mrs, Dr, Sen, etc) In person suffix list (Jr, Sr, PhD, etc) In name particle list (de, la, van, der, etc) In Census lastname list; segmented by P(name) In Census firstname list; segmented by P(name) In locations lists (states, cities, countries) In company name list ( J. C. Penny ) In list of company suffixes (Inc, & Associates, Foundation) Word Features lists of job titles, Lists of prefixes Lists of suffixes 350 informative phrases HTML/Formatting Features {begin, end, in} x {<b>, <i>, <a>, <hn>} x {lengths 1, 2, 3, 4, or longer} {begin, end} of line

IE Difficult Different Languages Morphology is very easy in English, much harder in German. Identifying word and sentence boundaries is fairly easy in European language, much harder in Chinese and Japanese. Some languages use orthography (like english) while others (like hebrew, arabic etc) do no have it. Different types of style Scientific papers Newspapers memos Emails Speech transcripts Type of Document Tables Graphics Small messages vs. Books Page 30

Page 31 Thanks Questions?