Timeline (1) Text Mining 2004-2005 Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining?



Similar documents
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Natural Language to Relational Query by Using Parsing Compiler

A chart generator for the Dutch Alpino grammar

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Semantic analysis of text and speech

Specialty Answering Service. All rights reserved.

Flattening Enterprise Knowledge

Search and Information Retrieval

The University of Amsterdam s Question Answering System at QA@CLEF 2007

Special Topics in Computer Science

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

Word Completion and Prediction in Hebrew

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Customizing an English-Korean Machine Translation System for Patent Translation *

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

D2.4: Two trained semantic decoders for the Appointment Scheduling task

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Building a Question Classifier for a TREC-Style Question Answering System

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Find the signal in the noise

Hybrid Strategies. for better products and shorter time-to-market

Role of Text Mining in Business Intelligence

INF5820 Natural Language Processing - NLP. H2009 Jan Tore Lønning jtl@ifi.uio.no

Why are Organizations Interested?

Stock Market Prediction Using Data Mining

Interactive Dynamic Information Extraction

Computational Linguistics and Learning from Big Data. Gabriel Doyle UCSD Linguistics

Facilitating Business Process Discovery using Analysis

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

Overview of the TACITUS Project

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

» A Hardware & Software Overview. Eli M. Dow <emdow@us.ibm.com:>

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Christian Leibold CMU Communicator CMU Communicator. Overview. Vorlesung Spracherkennung und Dialogsysteme. LMU Institut für Informatik

M LTO Multilingual On-Line Translation

Schema documentation for types1.2.xsd

Text Mining: The state of the art and the challenges

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

Using Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance

Comprendium Translator System Overview

Master of Science in Computer Science

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Semantic annotation of requirements for automatic UML class diagram generation

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Processing: current projects and research at the IXA Group

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

Semantic Analysis of Natural Language Queries Using Domain Ontology for Information Access from Database

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

INF5820, Obligatory Assignment 3: Development of a Spoken Dialogue System

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Text-To-Speech Technologies for Mobile Telephony Services

Terminology Extraction from Log Files

CENG 734 Advanced Topics in Bioinformatics

The Prolog Interface to the Unstructured Information Management Architecture

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

Automated Extraction of Security Policies from Natural-Language Software Documents

Topics in basic DBMS course

Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde

Automatic Text Analysis Using Drupal

Information extraction from texts. Technical and business challenges

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

A pixlogic White Paper 4984 El Camino Real Suite 205 Los Altos, CA T

INTEGRATED DEVELOPMENT ENVIRONMENTS FOR NATURAL LANGUAGE PROCESSING

Outline of today s lecture

A Framework of User-Driven Data Analytics in the Cloud for Course Management

The Seven Practice Areas of Text Analytics

Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features

Exalead CloudView. Semantics Whitepaper

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Natural Language Database Interface for the Community Based Monitoring System *

Transcription:

Text Mining 2004-2005 Master TKI Antal van den Bosch en Walter Daelemans http://ilk.uvt.nl/~antalb/textmining/ Dinsdag, 10.45-12.30, SZ33 Timeline (1) [1 februari 2005] Introductie (WD) [15 februari 2005] Syntactic pipeline 1: Tokenization, POS tagging (AB) [22 februari 2005] Concept chunking (Sander Canisius) [1 maart 2005] Syntactic pipeline 2: chunking, relation finding (WD) Timeline (2) [8 maart 2005] Named-entity recognition (Toine Borgers) [15 maart 2005] Information extraction (WD) [5 april 2005] Tools (AB) [12 april 2005] Industrial information extraction Timeline (3) [19 april 2005] Factoids (AB) [26 april 2005] Ontology learning (Marie-Laure Reinberger) [3 mei 2005] Information extraction from spoken user input (Piroska Lendvai) [10 mei 2005] Presentaties Overview What is Text Mining? Which Language Technology tools are useful? What is Text Mining? Evaluation: 2 exercises: software assignments (programming / tuning / testing of modules) Final paper and presentation 1

LE Components Lexical / Morphological Analysis Tagging Chunking Syntactic Analysis Word Sense Disambiguation Shallow parsing Grammatical Relation Finding Named Entity Recognition Semantic Analysis Reference Resolution Discourse Analysis Text Meaning Applications OCR Spelling Error Correction Grammar Checking Information retrieval Document Classification Information Extraction Summarization Question Answering Text Mining Ontology Extraction and Refinement Dialogue Systems Machine Translation Text Mining Automatic extraction of reusable information (knowledge) from text, based on linguistic features of the text Goals: Data mining (KDD) from unstructured and semistructured data (Corporate) Knowledge Management Intelligence Examples: Email routing and filtering Finding protein interactions in biomedical text Matching resumes and vacancies Document Set of Documents Structured Information + existing data Author Recognition Document Dating Language Identification Text Categorization Information Extraction Summarization Question Answering Topic Detection and Tracking Document Clustering Terminology Extraction Ontology Extraction Knowledge Discovery Role of Language Technology: Compute Text Representation Units Character n-grams Words, phrases, heads of phrases POS tags Parse tree (fragment)s Grammatical Relations Frames and scripts meaning (?) Text is a special kind of data Direct entry, OCR (.99 accuracy), Speech Recognition output (.50-.90 accuracy), What we have: Characters, character n-grams, words, word n-grams, lay-out, counts, lengths, What we want: Meaning (answering questions, relating with previous knowledge) Bridging the gap: Tagging, lemmatization, phrase chunking, grammatical relations, I.e.: Language Technology What is Language Technology? 2

Language comprehension Speech recognition Meaning translation Text Speech Text generation Speech synthesis Language Technology (Natural Language Processing, Computational Linguistics) is based on the complex transformation of linguistic representations Examples from text to speech from words to morphemes from words to syntactic structures from syntactic structures to conceptual dependency networks In this transformation, two processes play a role segmentation of representations disambiguation of possible transformations of representation units Similar representations at input level correspond to similar representations at the output level Complexity because of context-sensitivity (regularities, subregularities, exceptions) gebruiksvriendelijkheid ge+bruik+s+vriend+elijk+heid The old man the boats det N-plur V-plur det N-plur Punc The old man the boats (S (NP (DET the) (N old)) (VP (V man) (NP (DET the) (N boats)))) The old man the boats De ouden bemannen de schepen (S (NP (DET the) (N old)) (VP (V man) (NP (DET the) (N boats)))) (man-action (agent (def plur old-person)) (object (def plur boat))) How to reach Language Understanding? A fundamental solution for the problem of language understanding presupposes Representation and use of knowledge / meaning Acquisition of human-level knowledge What is meaning? Eleni eats a pizza with banana Eleni eat pizza banana contain Pizza = {p1, p2, p3, } Eat ={<Nicolas,p1>,<Nicolas,p3>,<Eleni,p2>, } Contain ={<p1,ansjovis>,<p1,tomaat>,<p2,banaan>, } x=p2 Semantic networks, Frames First-order predicate calculus (x): pizza(x) eat(eleni,x) contain (x,banaan ) Set theory Symbol grounding problem Representation and processing of time, causality, modality, defaults, common sense, Meaning is in the mind of the beholder 3

Language Technology Language Data Acquisition Language Data Input Representation Model Output Representation Input Representation Model Output Representation Processing S -> NP VP NP -> ART A N NP -> ART N VP -> V NP... Acquisition Construct a (rule-based) model about the domain of the transformation. Processing Use rule-based reasoning, deduction, on these models to solve new problems in the domain. Input Representation Language Data De computertaalkunde, in navolging van de taalkunde aanvankelijk sterk regel-gebaseerd, is onder druk van toepasbaarheid en grotere rekenkracht de laatste tien jaar geleidelijk geevolueerd naar een meer statistische, c orpus-gebaseerde en inductieve benadering. De laatste jaren hebben ook technieken uit de theorie van z elflerende systemen (Machine Learning, zie Mitchell, 1998 voor een inleiding) aan belang gewonnen. Deze technieken zijn in z ekere zin Model Output Representation p(corpustaalkunde de), p(in corpustaalkunde), p(corpustaalkunde), p(de), p(in),... Advantages Acquisition Induce a stochastic model from a corpus of examples of the transformation. Processing Use statistical inference (generalization) from the stochastic model to solve new problems in the domain. Linguistic knowledge and intuition can be used Precision Fast development of model Good coverage Good robustness (preference statistics) Knowledge-poor Scalable / Applicable 4

Problems Representation of sub/irregularity Cost and time of model development (Not scalable / applicable) Sparse data Estimation of relevance statistical events Applications of Text Mining Question Answering Give answer to question (document retrieval: find documents relevant to query) Who invented the telephone? Alexander Graham Bell When was the telephone invented? 1876 QA System: Shapaqa Parse question When was the telephone invented? Which slots are given? Verb invented Object telephone Which slots are asked? Temporal phrase linked to verb Document retrieval on internet with given slot keywords Parsing of sentences with all given slots Count most frequent entry found in asked slot (temporal phrase) Shapaqa: example When was the telephone invented? Google: invented AND the telephone produces 835 pages 53 parsed sentences with both slots and with a temporal phrase is through his interest in Deafness and fascination with acoustics that the telephone was invented in 1876, with the intent of helping Deaf and hard of hearing The telephone was invented by Alexander Graham Bell in 1876 When Alexander Graham Bell invented the telephone in 1876, he hoped that these same electrical signals could Shapaqa: example (2) So when was the phone invented? Internet answer is noisy, but robust 17: 1876 3: 1874 2:ago 2:later 1:Bell System was developed quickly Precision 76% (Google 31%) International competition (TREC): MRR 0.45 5

4 x OSWALD * www.anusha.com/jfk.htm situation in which Oswald shot Kennedy on November 22, 1963. * www.mcb.ucdavis.edu/people/hemang/spooky.html Lee Harvey Oswald shot Kennedy from a warehouse and ran. * www.gallica.co.uk/monarch.htm November 1963 U.S. President Kennedy was shot by Lee Harvey Oswald. * astrospeak.indiatimes.com/mystic_corner.htm Lee Harvey Oswald shot Kennedy from a warehouse and fled. 2 x BISHOP * www.powells.com/biblio/0-200/000637901x.html The day Kennedy was shot by Jim Bishop. * www.powells.com/biblio/49200-49400/0517431009.html The day Kennedy was shot by by Jim Bishop. 1 x BULLET * www.lustlysex.com/index_m.htm President John F. Kennedy was shot by a Republican bullet. 1 x MAN * www.ncas.org/condon/text/appndx-p.htm KENNEDY ASSASSINATION Kennedy was shot by a man who was not. Who shot Kennedy? Reuters Topic Detection and Tracking Le Monde De Standaard CNN First Story Detection Topic Detection Topic Tracking Topic Segmentation Link Detection Radio 1 WWW = document classification 6