Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project



Similar documents
Search and Information Retrieval

Clustering Connectionist and Statistical Language Processing

Joint Research Centre

Delivering Smart Answers!

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Web Archiving and Scholarly Use of Web Archives

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-4, Issue-4) Abstract-

How To Make Sense Of Data With Altilia

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

Scholarly Use of Web Archives

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

Text Mining and Analysis

Clustering Technique in Data Mining for Text Documents

Role of Social Networking in Marketing using Data Mining

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

Search Result Optimization using Annotators

Big Data uses cases and implementation pilots at the OECD

Sentiment analysis on tweets in a financial domain

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

How To Write A Summary Of A Review

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Why are Organizations Interested?

Research Challenge on Opinion Mining and Sentiment Analysis *

Neural Networks for Sentiment Detection in Financial Text

Collecting Polish German Parallel Corpora in the Internet

NOMAD: Linguistic Resources and Tools Aimed at Policy Formulation and Validation

IT services for analyses of various data samples

Provalis Research Text Analytics and the Victory Index

Data Mining Yelp Data - Predicting rating stars from review text

Language and Computation

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Azure Machine Learning, SQL Data Mining and R

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Using Artificial Intelligence to Manage Big Data for Litigation

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Internet of Things, data management for healthcare applications. Ontology and automatic classifications

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Sentiment Analysis on Big Data

Bug Report, Feature Request, or Simply Praise? On Automatically Classifying App Reviews

User Guide to the Content Analysis Tool

The Data Mining Process

The Seven Practice Areas of Text Analytics

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Practical Semantic Web Tagging and Tag Clouds 1

Chapter ML:XI. XI. Cluster Analysis

Terminology Extraction from Log Files

Big Data: Rethinking Text Visualization

Analyzing survey text: a brief overview

Using Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance

Stock Market Prediction Using Data Mining

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

DIGITAL MARKETING TRAINING

Weblogs Content Classification Tools: performance evaluation

Machine Learning using MapReduce

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification

Social Media Mining. Data Mining Essentials

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

1 o Semestre 2007/2008

Service Road Map for ANDS Core Infrastructure and Applications Programs

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Search. Searching and Finding information in Unstructured and Structured Data Sources

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

EVILSEED: A Guided Approach to Finding Malicious Web Pages

Big Data and Semantic Web in Manufacturing. Nitesh Khilwani, PhD Chief Engineer, Samsung Research Institute Noida, India

Facilitating Business Process Discovery using Analysis

Semantically Enhanced Web Personalization Approaches and Techniques

Web 3.0 image search: a World First

Analysis of Web Archives. Vinay Goel Senior Data Engineer

RRSS - Rating Reviews Support System purpose built for movies recommendation

Data Mining Algorithms Part 1. Dejan Sarka

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Schema documentation for types1.2.xsd

Data Domain Profiling and Data Masking for Hadoop

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

An Automatic and Accurate Segmentation for High Resolution Satellite Image S.Saumya 1, D.V.Jiji Thanka Ligoshia 2

Transcription:

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded by the British Council; TUBITAK and Indian-European Research Networking Programme In The Social Sciences (ANR- DFG-ESRC-NWO with ICSSR)

OUTLINE OBJECTIVE: GROUNDING TEXT ANALYTICS TOOLS TO A METHODOGICAL GROUND BY USING CAQDAS CASE: SCIENCE IN THE MEDIA MONITORING PROJECT CORPUS CONSTRUCTION SCIENCE IN THE CONTEXT: ONTOLOGIES AND CODING FRAMES DETECTING ATTITUDES: SENTIMENT ANALYSIS TRIANGULATING TEXT ANALYSIS FINDING WITH SOCIAL RESEARCH METHODS:SURVEYS; FOCUS GROUPS ETC...

OBJECTIVE: GROUNDING TEXT ANALYTICS TOOLS INTO A METHODOGICAL GROUND-USING CAQDAS

SCIENCE IN THE MEDIA MONITORING PROJECT AIM: CITIZEN RESEARCH-ENGAGING PUBLIC TO ST&I Text analytics: Monopoly of Government and Big Business Aim of SMM: providing stakeholders such as politicians, NGOs, social movements and consumer and patient associations and policy-makers as well as individual researchers with text analytics tools to monitor the public opinion about Science, Technology and Innovation (ST&I) issues as reflected in the popular media OPEN TEXT ANALYTICS Citizens should be able to use CAQDAS and text analytics tools to collect evidence for their positions

SCIENCE IN THE MEDIA MONITORING PROJECT LSE AND ISTANBUL BILGI UNIVERSITY KNOWLEDGE PARTNERHIP FUNDED BY THE BRITISH COUNCIL TRENDS IN THE PUBLIC OPINION ABOUT SCIENCE AND TECHNOLOGY System has several components and programs in order to crawl web, classify the contexts, store the data and analyze the text. http://capulingturkey.com/ Big data: Retrieves news as RSS feeds every two hours and puts them in a database. All the news and columns in popular newspaper Hurriyet since March 2013 Filters ST&I relevant news with the help of a dictionary Calculates some visibility indices, the proportion of ST in total news body

web corpora: using the web as linguistic data source 1) a web crawler; 2) a web interface for crawling management and validation; 3) conversion tools; 4) HTML cleaner tools; 5) anti-duplicate filters; 6) a PoS tagger. 7) metadata BUT also the context of communication situation, i.e. who the speaker/writer is, what is the topic, what semantic domain the topic belongs to, what the mode of communication is, etc ->BRIDGING CAQDAS W/ TEXT ANALYTICS: SOME EXERCISES

corpus construction A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. (Sinclair 2004)

What is the domain of Science and Technology: General vs Topical corpus Complementing the linguistic info with the contextual social info We are not only sampling text units but also opinions, attitudes behaviours, events, social representations etc... Purposive sampling: starts w/ keywords relevant to the topic; and then iterative search of other relevant keywords Usually done with ad-hoc key-word queries; GOOGLE; LEXIS/NEXIS Should be more methodological Semantic: lexical; analysis of word meanings and relations between them. Pragmatic: involve multiple USER (Audience) feedback Hermeneutical circle, use of classical CAQDAS to make the initial categories Saturation: Defining the boundaries of a knowledge domain

Semantic and statistical description of a topic Sub-corpora: specific functional or semantic domain, law/administration, economy,literature, fashion, etc... The gathering of linguistic data for each sub-corpus requires a targeted crawling strategy. An underlying semantic theme; a document consisting of a large number of words might be concisely modelled as deriving from a smaller number of topics. Statistical: A topic is a probability distribution over terms in a vocabulary. But also purposive: hermeneutical grounding of the terms in the social context

corpus theoretical paradox: solving the problem w/ text mining tools and CAQDAS Iterative: makes this a circular process: Initial keywords and maybe nothing more than the keywords We assume to select the corpus according to some representative criteria (ie keyword search) and make empirical analysis to detect the keywords Question is: how to select further keywords to be most informative about the topic domain; superordinate subordinate concepts, hyponymy; hypernymy Some text mining solutions: context determination techniques such as: Word seeding: seed a keyword reflecting the domain feature, e.g., animal automatically extract a large set of surrounding extraction patterns (context words). Can get the Hypernyms: pigs, chicken, horses etc.. LDA: automatically discovering topics that some semantic contexts (sentences, paragraphs, chapters, contain). LDA represents documents as mixtures of topics that spit out words with certain probabilities.

LDA: example

Supervised topic models, Already human-coded text segments. Use the usual CAQDAS approaches; rigorous,methodological coding and thematization of the text Then use supervised machine learning techniques such as: Supervised LDA Naive Bayes K-Means SVM Etc... Improve the topic keywords

Coding frame and ontological terminology engineering: Modelling concepts and the relations between them, Concept: described by means of characteristics that denote properties of individual referents belonging to the extension of that concept. Idea is similar to codebook building Indexing Available ontologies: SNOMED, DEWEY

Suggested terminological anthropology: OECD Frascati manual for ST&I classification 1. Exploration and exploitation of the Earth. 2. Infrastructure and general planning of land use. 3. Control and care of the environment. 4. Protection and improvement of human health. 5. Production, distribution and rational utilisation of energy. 6. Agricultural production and technology. 7. Industrial production and technology. 8. Social structures and relationships. 9. Exploration and exploitation of space. 10. Non-oriented research. 11. Other civil research. 12. Defence

Ontology learning; grounded theory; word space theory Bottom up categorization, getting the themes out of the text itself Cluster analysis, correspondence analysis, formal concept analysis, semantic network analysis Grounded theory; thematic analysis