ConText: Network Generation from Text Data. Jana Diesner Bochum, Apr 2015



Similar documents
Hands-on Tutorial: From Words to Networks: Extraction and Analysis of Semantic Network Data from Text Data. What we will do today

Technical Report. The KNIME Text Processing Feature:

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Building a Question Classifier for a TREC-Style Question Answering System

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Special Topics in Computer Science

Word Completion and Prediction in Hebrew

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Term extraction for user profiling: evaluation by the user

Search Engines. Stephen Shaw 18th of February, Netsoc

Computer-aided Document Indexing System

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

Computer Aided Document Indexing System

Search and Information Retrieval

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Data Pre-Processing in Spam Detection

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Web Document Clustering

Clustering Technique in Data Mining for Text Documents

Taxonomies for Auto-Tagging Unstructured Content. Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 1, 2013

Natural Language to Relational Query by Using Parsing Compiler

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES

Stock Market Prediction Using Data Mining

Facilitating Business Process Discovery using Analysis

Investigating Clinical Care Pathways Correlated with Outcomes

Knowledge Discovery from patents using KMX Text Analytics

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Mining Text Data: An Introduction

How To Write A Summary Of A Review

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Cleaned Data. Recommendations

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Clustering Connectionist and Statistical Language Processing

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms

An Introduction to Data Mining

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

Latent Dirichlet Markov Allocation for Sentiment Analysis

How To Use Neural Networks In Data Mining

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

Why is Internal Audit so Hard?

SPATIAL DATA CLASSIFICATION AND DATA MINING

Movie Classification Using k-means and Hierarchical Clustering

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Why are Organizations Interested?

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

1 o Semestre 2007/2008

Transaction-Typed Points TTPoints

Research of Postal Data mining system based on big data

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP

SENTIMENT ANALYSIS: A STUDY ON PRODUCT FEATURES

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Natural Language Database Interface for the Community Based Monitoring System *

Terminology Extraction from Log Files

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

Big Data Analytics and Healthcare

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Specialty Answering Service. All rights reserved.

Data Warehousing and Data Mining in Business Applications

Big Data Text Mining and Visualization. Anton Heijs

Financial Trading System using Combination of Textual and Numerical Data

How To Make Sense Of Data With Altilia

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Automated Content Analysis of Discussion Transcripts

Segmentation and Classification of Online Chats

Interactive Dynamic Information Extraction

Colorado School of Mines Computer Vision Professor William Hoff

A Survey on Product Aspect Ranking

Machine Learning Approach To Augmenting News Headline Generation

Distributed Computing and Big Data: Hadoop and MapReduce

IBM SPSS Modeler Text Analytics 16 User's Guide

Text Analytics. A business guide

Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications

Discovering suffixes: A Case Study for Marathi Language

Projektgruppe. Categorization of text documents via classification

Analyzing survey text: a brief overview

Semantic analysis of text and speech

Get the most value from your surveys with text analysis

Collecting Polish German Parallel Corpora in the Internet

Chapter ML:XI. XI. Cluster Analysis

Transcription:

ConText: Network Generation from Text Data Jana Diesner Bochum, Apr 2015

The concept of Semantic Networks Structured representations of information and knowledge Originally meant to be used for reasoning and inference E.g. theory of spreading activation Collins and Loftus (1975) 2

Semantic Networks: Usage Summarization: retrieve a concept s ego-network to distill the essence of some data in concise and structured form Disambiguation: network clustering to identify different aspect of the meaning of a concept Prediction: forecast the set of concepts that will be evoked when a certain note is activated Griffiths, Steyvers& Tenenbaum 2007 3

Text based Networks: Association Networks Les Miserables: M. E. J. Newman and M. Girvan, Finding and evaluating community structure in networks, Physical Review E 69, 026113 (2004). 4

Text based Networks: Relation Extraction: Social Networks (of tribes in Sudan) 2003 2004 2005 2006 2007 2008 Diesner J, Carley KM, Tamabyong L (2012) Mapping socio-cultural networks of Sudan from opensource, large-scale text data. Journal of Computational and Mathematical Organization Theory (CMOT) 18(3), 328-339.

Text based Networks: Relation Extraction: Organizations and Resources Conflict: Agriculture, Livestock (farmers vs. herders) War: Land Resource (concept of dar) Conflict and War: Oil, Civic, Transportation 6

Text based Networks: Meta Data Networks 7

Basic Types of Information in Text Data Morphology: structure of words E.g. spelling, inflections, derivations Syntax: relationships between words e.g. parts of speech tagging Semantics: meaning of language e.g. word sense disambiguation, grammars Pragmatics: language in context and social use of language e.g. sentiment analysis, discourse analysis Relation Extraction (this lab): borrows from all of the above 8

Methods for Constructing Networks of Words 1. Mental Models (Spreading Activation) (Collins & Loftus 1975) 2. Case Grammar and Frame Semantics (Fillmore 1982, 1986) 3. Discourse Representation Theory (Kamp 1981) 4. Knowledge representation in AI, assertional semantic networks (Shapiro 1971, Woods 1975) 5. Centering Resonance Analysis (Corman et al. 2002) 6. Mind maps (Buzan 1974) 7. Concept maps (Novak & Gowin 1984) 8. Hypertext (Trigg & Weiser 1986) 9. Qualitative text coding (Grounded Theory) (Glaser & Strauss 1967) 10. Definitional semantic networks incl. text coding with ontologies (Fellbaum 1998) 11. Semantic Web (Berners-Lee et al. 2001, Van Atteveldt 2008) 12. Frames (Minsky 1974) 13. Semantic Grammars (Franzosi 1989, Roberts 1997) 14. Network Text Analysis in social science (Carley & Palmquist 1991) 15. Event Coding in pol. science (King & Lowe 2003, Schrodt et al. 2008) 16. Semantic networks in comm. science (Danowski 1993, Doerfel 1998) 17. Probabilistic graphical models (Howard 1989, Pearl 1988) Automation Abstraction Generalization Diesner, J. (2012). Uncovering and managing the impact of methodological choices for the computational construction of socio-technical networks from texts. CMU-ISR-12-101, Carnegie Mellon University. 9

Text Data Network Analysis Need: scalable, reliable, robust methods & technologies Network representation Applications Answer substantive questions about social systems Visualizations Fill databases for archiving, search, retrieval Forecasting: Explore future behavior and what-if scenarios Analysis tools

Putting it all Together What happens once we have extracted relational data from texts? Load data into network analysis toolkit and perform analysis, visualization, simulation, etc Relation extraction from texts can be an alternative or supplemental relational data collection technique when: Classic methods for collecting network data fail Analysis of large amounts of data is constrained 11

Bravo!You have passed the primer in network text analysis! 12

Hands-on part: Types of techniques 1. Network Construction 1. From Meta Data 2. From Text Data 1. Node identification (and classification) 1. Dictionary-based (manually and/ or automatically built) 2. Edge identification (and classification) 1. Identification: Proximity based approach (co-occurrence) 3. Text pre-processing 1. Natural Language Processing (NLP) techniques: precondition for finding meaningful information, incl. representations of nodes and edges, in text data 2. Selection of relevant entities (positive filter: dictionary, entity extraction) or removal of irrelevant information (negative filter: delete list) given the research question, data, domain 13

Extracting Network data from meta data 14

Construction of Meta Data Networks Exploit meta data information that come along with data Example: LexisNexis Academic How to: Data curation: lexisnexis Pre-process: split up, clean up, put into relational database Network construction: meta data Generate networks: construct weighted networks of people, organizations, locations, information Visualize resulting networks in Gephi 15

Extracting network data from text data 16

Association Networks and Relation Extraction: One-mode and Multi-mode Networks Example from UN News Service (New York), 12-28-2004: Jan Pronk, the Special Representative of Secretary-General Kofi Annanto Sudan, today called for the immediate return of the vehicles to World Food Programme(WFP) and NGOs. extract relational data: one mode or semantic networks multi-mode or meta-networks Jan Pronk Jan Pronk Sudan Kofi Annan Sudan Kofi Annan vehicles vehicles WFP NGO s WFP NGO s Knowledge Person Location Organization Resource Identify relevant nodes Classify identified nodes 17

Classification Schema Ontology: the study of being or existence Taxonomy: practice & science of classification 18

Dictionary: Data Structure Text term (incl. n-gram), concept, entity class Functions: disambiguation (1,2), consolidation (3,4), n-gram concatenation (3,4) Examples: 1. Apple, apple, organization 2. apple, apple, resource 3. Barack Obama, Barack_Obama, agent 4. Barack Hussein Obama, Barack_Obama, agent

Dictionary: Usage How to: ConText: Text Analysis: codebook application See impact of codebook on text data» Helps to refine codebook Options: Normalization vs. entity class tagging Refinement (replace and insert) versus positive filter Positive filter: notion of adjacency

Dictionary: Construction From rule-based and deterministic methods to probabilistic and machine-learning based methods ConText: Load raw data, no preprocessing (needs proper syntax if possible) Text Analysis: Analysis: Entity Detection Use as baseline, refine

Dictionary: More details How to use them: List relevant entities: serves as positive filter Then construct one-mode network, aka semantic network Cross-classify relevant entities with ontological categories Then construct multi-mode network Limitations of manual approach: Tedious, incomplete, outdated, deterministic Help for constructing thesauri: Computer-supported: terms with high (weighted) frequency Automated: entity detection External sources (e.g. CIA World Fact Book, WordNet) Other automated techniques, e.g. Bootstrapping 22

Codebook Construction: Corpus Statistics Cumulative frequency: Bag of Words How to: Load texts into ConText Create Corpus Statistics This is one dimension of salience, prominence, importance What are other dimensions? Zipf, George Kingsley (1949). Human Behavior and the Principle of Least Effort. Cambridge, Mass.: Addison-Wesley 23

Codebook Construction: Corpus Statistics: TF*IDF What determines word s importance in corpus? Discriminating and distinguishing tf = term frequency (importance of term within document) tf = cumulative occurrence of term x in document y total number of terms in document y idf = inverse document freq. (importance of a term in corpus) idf= log total number of documents in corpus total number of documents containing term x tfidf= tf* idf tfidf: strategy and measure High if tf= high and df= low High for signal, low for noise How to: Generate: Concept List: Union 24

Codebook Construction: Stop words Stop words listed in delete list Serves as negative filter (remove from text data what s contained in list) How to: Text Analysis: preprocessing: remove stop words How to construct a delete list? Use predefined lists (data folder) Construct your own, one entry per line (incl. n-grams) Notion of adjacency (direct vs. rhetorical, which maintains original distance of words)

Text pre-processing: Stemming Detects inflections and derivations of concepts Converts each term into its morpheme How to: ConText: Text Analysis: preprocessing: stemming Two families of stemmers: Porter (rule-based): high efficiency, poor human readability (ConText) Krovetz(dictionary-based): lower efficiency, better human readability Porter, M.F. 1980. An algorithm for suffix stripping. I 14(3): 130-137. Krovetz, Robert (1995). Word Sense Disambiguation for Large Text Databases. Unpublished PhD Thesis, University of Massachusetts. 26

Text pre-processing: Parts of Speech (POS) Tagging Task: Assign single best tag or parts of speech (POS) to each word in a corpus (e.g. noun, verb, adjective, or adverb) What is the challenge here? Ambiguity resolution (can) -words match multiple tags depending on their context ConText: text analysis: preprocessing: part of speech tagging Church, K. (1988). A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. 2nd Conference on Applied Natural Language Processing, Austin, TX, 136-143. Diesner, J., & Carley, K. M. (2008). Looking under the hood of stochastic part of speech taggers (No. CMU-ISR-08-131R): Carnegie Mellon University, School of Computer Science, Institute for Software Research. 27

Finding salient terms: N-grams Meaningful multi-words units How to: Generate: generalization thesaurus: bigram Sort by decreasing frequency and decreasing tfidf, pick suitable entries, removed duplicates

Text pre-processing: When to stop? When to stop? ( criteria ) Orcam srazor: 14th-century English logician and Franciscan friar, William of Ockham. Aka lex parsimoniae(law of parsimony) Basic idea: All other things being equal, the simplest solution is the best. Why does it matter? Don t want: Overfitting Want: Generalizability 29

Linking nodes: Approaches Syntax and surface patterns(fillmore, Schrodt) Linguistics: parsing trees Logical and Knowledge Representation in Artificial Intelligence (Shapiro) first order calculus, predicate logic (quantifiers) Distance based (Danowski) Communications: distance in text (windowing) or in space (Euclidean) Probabilistic, learning from data (McCallum) Machine learning techniques: probabilistic (Bayesian), kernels (N-dimensional similarity), graphical models (hidden markov models, conditional random fields), boot strapping Cites and summary in Diesner, J., & Carley, K. M. (2010). Relation Extraction from Texts (in German, title: ExtraktionrelationalerDatenausTexten). In C. Stegbauer& R. Häußling(Eds.), Handbook Network Research (Handbuch Netzwerkforschung) (pp. 507-521). Vs Verlag. 30

Distance-based node linkage How to: Right hand side of codebook construction panel ConText: network construction: codebook a One mode (semantic network) versus multi mode Set parameters: Codebook Aggregation: one network per document or for all documents Distance: an appropriate window size Unit of analysis: where to stop the sliding window

References Recommended further readings related to this session: McCallum, A. (2005). Information extraction: distilling structured data from unstructured text. ACM Queue, 3(9), 48-57. Diesner, J., Carley, K. M. (2011): Semantic Networks. In G. Barnett (Ed), Encyclopedia of Social Networking, (pp. 595-598). Sage Publications. Diesner, J., Carley, K. M. (2011): Words and Networks. In G. Barnett (Ed.), Encyclopedia of Social Networking, (pp. 958-961). Sage Publications. 32

References Mental Models: Johnson-Laird, P. (1983). Mental Models. Cambridge, MA: Harvard University Press. Klimoski, R., & Mohammed, S. (1994). Team mental model: Construct or metaphor? Journal of Management, 20, 403-437. Rouse, W. B., & Morris, N. M. (1986). On looking into the black box; prospects and limits in the search for mental models. Psychological Bulletin, 100, 349-363.

Thank you! Q&A For any questions, comments, feedback, follow-upnow and in the future: Jana Diesner Email: jdiesner@illinois.edu Phone: ++1 (412) 519 7576 Web: http://people.lis.illinois.edu/~jdiesner 34