CS50AE: Information Extraction and Text Analytics

CS50AE: Information Extraction and Text Analytics Introduction Adam Wyner (Course Organiser) Advaith Siddharthan Reading: Chapter 1 (Jurafsky&Martin)

Course Admin - Website http://homepages.abdn.ac.uk/azwyner/pages/teaching/ CS50AE/index.html

The MSc

Definition of Information Extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). https://en.wikipedia.org/wiki/information_extraction Modify to (semi-)automatically: point may be that some human interaction is useful for analysis development and querying.

Definition of Text Analytics The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. https://en.wikipedia.org/wiki/text_mining#text_mining_and_text_analytics Linguistic here means adding information to the text (metadata/ annotations) or using linguistic resources to support some of the other techniques.

Two approaches Maths heavy/knowledge light in terms of knowledge of the domain or of language statistical or machine learning approaches. Algorithmically compare and contrast large bodies of textual data, identifying regularities and similarities. Large corpora. Sparse data problem. Often needs a gold standard. No rules extracted. Opaque to modification. Maths light/knowledge heavy in terms of lists, rules, and processes. Labour and knowledge intensive. Particular corpora (extensible). Create a gold standard. Transparent analysis. Can do either or mix them. Depends what one wants to do and what results one wants to achieve.

Examples of Text Analytics Text Classification Sentiment Analysis Information Retrieval Text Summarisation Named Entity Identification Argumentation Mining Concept analysis and extraction Ontology population Rule extraction Linking resources Coreference Resolution Relationship Identification

Introduction: What is Linguistics? The study of language breaks down into a number of fields: Phonetics - sound signal <-> phonemes Morphology - eat, eating, eats, eaten, ate Syntax - the dog ate the cat - the cat ate the dog

Introduction: What is Linguistics? Semantics - Delete all text files -> rm *.txt Pragmatics - Do you know what time it is? - Can I have some cake?

Pragmatics

Natural Language Processing (NLP) Computer Programs that can analyse human written texts: Use black-box models based on statistics or machine learning Implement algorithms and data structures based on linguistic theories Create linguistic resources which describe a language dictionaries, grammars, corpora,

Example How to extract relationships from: The word of the Lord came to Zechariah, son of Berekiah, son of Iddo, the prophet.

Example The word of the Lord came to Zechariah, son of Berekiah, son of Iddo, the prophet. son_of (Zecharia, Berekiah) son_of(zecharia, Iddo) son_of(berekiah, Iddo) prophet(iddo) prophet(berekiah) prophet(zechariah)

Example: Local Attachment Heuristic The word of the Lord came to Zechariah, son of Berekiah, son of Iddo, the prophet. son_of (Zecharia, Berekiah) son_of(zecharia, Iddo) son_of(berekiah, Iddo) prophet(iddo) prophet(berekiah) prophet(zechariah)

History 1940 s and 1950 s Fundamental theoretical developments: - Formal language theory (e.g. Chomsky) - Noisy channel model for transmission of language by identifying redundancy and patterns (Shannon and Weaver) - The beginnings of Information Retrieval: Luhn (1957): the frequency of word occurrence in an article furnishes a useful measurement of word significance

History 1960s Symbolic models inspired by Chomsky s context-free and transformational grammar Salton (1968): Vector Space Model for Information Retrieval Document Clustering based on vector similarity

History 1970s Explicit use of grammars and parsing Development of hidden Markov models Logic-based approaches to syntax and reasoning K. Spärck Jones (1972): Inverse Document Frequency and tf*idf

History 1980s and 1990s Construction of Question-Answering systems for small domains (PHLIQA, Core language Engine) Revival of work on finite-state models, e.g. for morphology Revival of probabilistic models based on IBM models of speech recognition part-of-speech tagging, statistical parsing, connectionist approaches. Beginning of work in information extraction (JASPER: real time extraction of financial news) The beginning of annual Text REtrieval Conference (TREC) and Message Understanding Conference (MUC) with a focus on system evaluation

History 2000s present Standard use of probabilistic and data-driven models throughout the field, informed by theoretical insights Increasingly rigorous evaluation methodologies Commercial exploitation (Billion $ business) e.g. Sentiment Analysis and Opinion mining, NER, relationship mining

Ambiguity Perhaps the most significant problem for language recognition/interpretation/understanding: Many sentences are ambiguous - Time flies like an arrow - I made her duck Computer sees ambiguities we don t - I shot an elephant in my pyjamas Resolve with knowledge - world knowledge, contextual knowledge, statistical knowledge

Research We will be discussing State-of-the-art systems which don t work perfectly, but often well enough for some practical purpose Theories and models which are the best we can do but might still have many problems Text Analytics and Information Extraction are research areas!