CS50AE: Information Extraction and Text Analytics

Similar documents
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Text Mining - Scope and Applications

Introduction to Text Mining and Semantics. Seth Grimes -- President, Alta Plana

Why are Organizations Interested?

Hexaware E-book on Predictive Analytics

Search and Information Retrieval

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

The Prolog Interface to the Unstructured Information Management Architecture

31 Case Studies: Java Natural Language Tools Available on the Web

COMP9321 Web Application Engineering

Find the signal in the noise

Big Data and Analytics: Challenges and Opportunities

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

Role of Text Mining in Business Intelligence

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

English Grammar Checker

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

How To Complete The Danish Masters Program In Lct

Text Mining and Analysis

text data analytics insights unstructured predictive improve source Extracting Value from Unstructured Data use behavior characteristics customer

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

How To Make Sense Of Data With Altilia

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Specialty Answering Service. All rights reserved.

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

Machine Learning Business Intelligence, Culturomics and Life Sciences

Terminology Extraction from Log Files

Text Mining for Business Intelligence

Internet of Things, data management for healthcare applications. Ontology and automatic classifications

Introduction to IE with GATE

Natural Language to Relational Query by Using Parsing Compiler

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Natural Language Processing. What s this story about?

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

An Introduction to Data Mining

The Seven Practice Areas of Text Analytics

European Masters Program in Language and Communication Technologies (LCT) Module Handbook for Prospective Students

Named Entity Recognition Experiments on Turkish Texts

Collecting Polish German Parallel Corpora in the Internet

Language and Computation

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Text Mining and its Applications to Intelligence, CRM and Knowledge Management

Text Mining: The state of the art and the challenges

SOCIS: Scene of Crime Information System - IGR Review Report

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

Customizing an English-Korean Machine Translation System for Patent Translation *

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Web 3.0 image search: a World First

Domain Knowledge Extracting in a Chinese Natural Language Interface to Databases: NChiql

Building a Question Classifier for a TREC-Style Question Answering System

Introduction to Text Mining. Module 2: Information Extraction in GATE

An Ontology Based Text Analytics on Social Media

Text Analytics. A business guide

Automatic Knowledge Base Construction Systems. Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

Master of Science in Computer Science

Clustering Connectionist and Statistical Language Processing

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Mining Text Data for Useful Information in Higher Education John Zilvinskis Indiana University

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014

TechWatch. Technology and Market Observation powered by SMILA

Automated Annotation of Events Related to Central Venous Catheterization in Norwegian Clinical Notes

RRSS - Rating Reviews Support System purpose built for movies recommendation

Shorter build-measure-learn cycle in software development by using natural language to query big data sets!

Reinventing Business Intelligence through Big Data

A Platform for Managing Term Dictionaries for Utilizing Distributed Interview Archives

Uncovering Value in Healthcare Data with Cognitive Analytics. Christine Livingston, Perficient Ken Dugan, IBM

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-4, Issue-4) Abstract-

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

A semantic extension of a hierarchical storage management system for small and medium-sized enterprises.

Extraction and Visualization of Protein-Protein Interactions from PubMed

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support

IBM SPSS Modeler Text Analytics 16 User's Guide

Real Time Data Detecting Trend Process and Predictions using Living Analytics

Question Answering and Multilingual CLEF 2008

Applications of Deep Learning to the GEOINT mission. June 2015

Introduction to Big Data Science

Using NLP and Ontologies for Notary Document Management Systems

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship

Cognitive z. Mathew Thoennes IBM Research System z Research June 13, 2016

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

M3039 MPEG 97/ January 1998

The Italian Hate Map:

TOOL OF THE INTELLIGENCE ECONOMIC: RECOGNITION FUNCTION OF REVIEWS CRITICS. Extraction and linguistic analysis of sentiments

ViewerPro enables traders to automatically capture the impact of news on their trading portfolios

How To Become A Data Scientist

Semantic Search in E-Discovery. David Graus & Zhaochun Ren

CREDIT TRANSFER: GUIDELINES FOR STUDENT TRANSFER AND ARTICULATION AMONG MISSOURI COLLEGES AND UNIVERSITIES

Sentiment analysis on tweets in a financial domain

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

The 2006 IEEE / WIC / ACM International Conference on Web Intelligence Hong Kong, China

Transcription:

CS50AE: Information Extraction and Text Analytics Introduction Adam Wyner (Course Organiser) Advaith Siddharthan Reading: Chapter 1 (Jurafsky&Martin)

Course Admin - Website http://homepages.abdn.ac.uk/azwyner/pages/teaching/ CS50AE/index.html

The MSc

Definition of Information Extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). https://en.wikipedia.org/wiki/information_extraction Modify to (semi-)automatically: point may be that some human interaction is useful for analysis development and querying.

Definition of Text Analytics The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. https://en.wikipedia.org/wiki/text_mining#text_mining_and_text_analytics Linguistic here means adding information to the text (metadata/ annotations) or using linguistic resources to support some of the other techniques.

Two approaches Maths heavy/knowledge light in terms of knowledge of the domain or of language statistical or machine learning approaches. Algorithmically compare and contrast large bodies of textual data, identifying regularities and similarities. Large corpora. Sparse data problem. Often needs a gold standard. No rules extracted. Opaque to modification. Maths light/knowledge heavy in terms of lists, rules, and processes. Labour and knowledge intensive. Particular corpora (extensible). Create a gold standard. Transparent analysis. Can do either or mix them. Depends what one wants to do and what results one wants to achieve.

Examples of Text Analytics Text Classification Sentiment Analysis Information Retrieval Text Summarisation Named Entity Identification Argumentation Mining Concept analysis and extraction Ontology population Rule extraction Linking resources Coreference Resolution Relationship Identification

Introduction: What is Linguistics? The study of language breaks down into a number of fields: Phonetics - sound signal <-> phonemes Morphology - eat, eating, eats, eaten, ate Syntax - the dog ate the cat - the cat ate the dog

Introduction: What is Linguistics? Semantics - Delete all text files -> rm *.txt Pragmatics - Do you know what time it is? - Can I have some cake?

Pragmatics

Natural Language Processing (NLP) Computer Programs that can analyse human written texts: Use black-box models based on statistics or machine learning Implement algorithms and data structures based on linguistic theories Create linguistic resources which describe a language dictionaries, grammars, corpora,

Example How to extract relationships from: The word of the Lord came to Zechariah, son of Berekiah, son of Iddo, the prophet.

Example The word of the Lord came to Zechariah, son of Berekiah, son of Iddo, the prophet. son_of (Zecharia, Berekiah) son_of(zecharia, Iddo) son_of(berekiah, Iddo) prophet(iddo) prophet(berekiah) prophet(zechariah)

Example: Local Attachment Heuristic The word of the Lord came to Zechariah, son of Berekiah, son of Iddo, the prophet. son_of (Zecharia, Berekiah) son_of(zecharia, Iddo) son_of(berekiah, Iddo) prophet(iddo) prophet(berekiah) prophet(zechariah)

History 1940 s and 1950 s Fundamental theoretical developments: - Formal language theory (e.g. Chomsky) - Noisy channel model for transmission of language by identifying redundancy and patterns (Shannon and Weaver) - The beginnings of Information Retrieval: Luhn (1957): the frequency of word occurrence in an article furnishes a useful measurement of word significance

History 1960s Symbolic models inspired by Chomsky s context-free and transformational grammar Salton (1968): Vector Space Model for Information Retrieval Document Clustering based on vector similarity

History 1970s Explicit use of grammars and parsing Development of hidden Markov models Logic-based approaches to syntax and reasoning K. Spärck Jones (1972): Inverse Document Frequency and tf*idf

History 1980s and 1990s Construction of Question-Answering systems for small domains (PHLIQA, Core language Engine) Revival of work on finite-state models, e.g. for morphology Revival of probabilistic models based on IBM models of speech recognition part-of-speech tagging, statistical parsing, connectionist approaches. Beginning of work in information extraction (JASPER: real time extraction of financial news) The beginning of annual Text REtrieval Conference (TREC) and Message Understanding Conference (MUC) with a focus on system evaluation

History 2000s present Standard use of probabilistic and data-driven models throughout the field, informed by theoretical insights Increasingly rigorous evaluation methodologies Commercial exploitation (Billion $ business) e.g. Sentiment Analysis and Opinion mining, NER, relationship mining

Ambiguity Perhaps the most significant problem for language recognition/interpretation/understanding: Many sentences are ambiguous - Time flies like an arrow - I made her duck Computer sees ambiguities we don t - I shot an elephant in my pyjamas Resolve with knowledge - world knowledge, contextual knowledge, statistical knowledge

Research We will be discussing State-of-the-art systems which don t work perfectly, but often well enough for some practical purpose Theories and models which are the best we can do but might still have many problems Text Analytics and Information Extraction are research areas!