Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded by the British Council; TUBITAK and Indian-European Research Networking Programme In The Social Sciences (ANR- DFG-ESRC-NWO with ICSSR)
OUTLINE OBJECTIVE: GROUNDING TEXT ANALYTICS TOOLS TO A METHODOGICAL GROUND BY USING CAQDAS CASE: SCIENCE IN THE MEDIA MONITORING PROJECT CORPUS CONSTRUCTION SCIENCE IN THE CONTEXT: ONTOLOGIES AND CODING FRAMES DETECTING ATTITUDES: SENTIMENT ANALYSIS TRIANGULATING TEXT ANALYSIS FINDING WITH SOCIAL RESEARCH METHODS:SURVEYS; FOCUS GROUPS ETC...
OBJECTIVE: GROUNDING TEXT ANALYTICS TOOLS INTO A METHODOGICAL GROUND-USING CAQDAS
SCIENCE IN THE MEDIA MONITORING PROJECT AIM: CITIZEN RESEARCH-ENGAGING PUBLIC TO ST&I Text analytics: Monopoly of Government and Big Business Aim of SMM: providing stakeholders such as politicians, NGOs, social movements and consumer and patient associations and policy-makers as well as individual researchers with text analytics tools to monitor the public opinion about Science, Technology and Innovation (ST&I) issues as reflected in the popular media OPEN TEXT ANALYTICS Citizens should be able to use CAQDAS and text analytics tools to collect evidence for their positions
SCIENCE IN THE MEDIA MONITORING PROJECT LSE AND ISTANBUL BILGI UNIVERSITY KNOWLEDGE PARTNERHIP FUNDED BY THE BRITISH COUNCIL TRENDS IN THE PUBLIC OPINION ABOUT SCIENCE AND TECHNOLOGY System has several components and programs in order to crawl web, classify the contexts, store the data and analyze the text. http://capulingturkey.com/ Big data: Retrieves news as RSS feeds every two hours and puts them in a database. All the news and columns in popular newspaper Hurriyet since March 2013 Filters ST&I relevant news with the help of a dictionary Calculates some visibility indices, the proportion of ST in total news body
web corpora: using the web as linguistic data source 1) a web crawler; 2) a web interface for crawling management and validation; 3) conversion tools; 4) HTML cleaner tools; 5) anti-duplicate filters; 6) a PoS tagger. 7) metadata BUT also the context of communication situation, i.e. who the speaker/writer is, what is the topic, what semantic domain the topic belongs to, what the mode of communication is, etc ->BRIDGING CAQDAS W/ TEXT ANALYTICS: SOME EXERCISES
corpus construction A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. (Sinclair 2004)
What is the domain of Science and Technology: General vs Topical corpus Complementing the linguistic info with the contextual social info We are not only sampling text units but also opinions, attitudes behaviours, events, social representations etc... Purposive sampling: starts w/ keywords relevant to the topic; and then iterative search of other relevant keywords Usually done with ad-hoc key-word queries; GOOGLE; LEXIS/NEXIS Should be more methodological Semantic: lexical; analysis of word meanings and relations between them. Pragmatic: involve multiple USER (Audience) feedback Hermeneutical circle, use of classical CAQDAS to make the initial categories Saturation: Defining the boundaries of a knowledge domain
Semantic and statistical description of a topic Sub-corpora: specific functional or semantic domain, law/administration, economy,literature, fashion, etc... The gathering of linguistic data for each sub-corpus requires a targeted crawling strategy. An underlying semantic theme; a document consisting of a large number of words might be concisely modelled as deriving from a smaller number of topics. Statistical: A topic is a probability distribution over terms in a vocabulary. But also purposive: hermeneutical grounding of the terms in the social context
corpus theoretical paradox: solving the problem w/ text mining tools and CAQDAS Iterative: makes this a circular process: Initial keywords and maybe nothing more than the keywords We assume to select the corpus according to some representative criteria (ie keyword search) and make empirical analysis to detect the keywords Question is: how to select further keywords to be most informative about the topic domain; superordinate subordinate concepts, hyponymy; hypernymy Some text mining solutions: context determination techniques such as: Word seeding: seed a keyword reflecting the domain feature, e.g., animal automatically extract a large set of surrounding extraction patterns (context words). Can get the Hypernyms: pigs, chicken, horses etc.. LDA: automatically discovering topics that some semantic contexts (sentences, paragraphs, chapters, contain). LDA represents documents as mixtures of topics that spit out words with certain probabilities.
LDA: example
Supervised topic models, Already human-coded text segments. Use the usual CAQDAS approaches; rigorous,methodological coding and thematization of the text Then use supervised machine learning techniques such as: Supervised LDA Naive Bayes K-Means SVM Etc... Improve the topic keywords
Coding frame and ontological terminology engineering: Modelling concepts and the relations between them, Concept: described by means of characteristics that denote properties of individual referents belonging to the extension of that concept. Idea is similar to codebook building Indexing Available ontologies: SNOMED, DEWEY
Suggested terminological anthropology: OECD Frascati manual for ST&I classification 1. Exploration and exploitation of the Earth. 2. Infrastructure and general planning of land use. 3. Control and care of the environment. 4. Protection and improvement of human health. 5. Production, distribution and rational utilisation of energy. 6. Agricultural production and technology. 7. Industrial production and technology. 8. Social structures and relationships. 9. Exploration and exploitation of space. 10. Non-oriented research. 11. Other civil research. 12. Defence
Ontology learning; grounded theory; word space theory Bottom up categorization, getting the themes out of the text itself Cluster analysis, correspondence analysis, formal concept analysis, semantic network analysis Grounded theory; thematic analysis