Automated Content Analysis of Discussion Transcripts

Similar documents
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Experiments in Web Page Classification for Semantic Web

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

How To Predict Web Site Visits

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Web Document Clustering

Automated Problem List Generation from Electronic Medical Records in IBM Watson

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

Clustering Connectionist and Statistical Language Processing

Data Mining Analysis (breast-cancer data)

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Measuring Critical Thinking within Discussion Forums using a Computerised Content Analysis Tool

Word Completion and Prediction in Hebrew

Author Gender Identification of English Novels

Course Description This course will change the way you think about data and its role in business.

An Introduction to Data Mining

Machine learning for algo trading

Prediction Models for a Smart Home based Health Care System

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Assessing & Improving Online Learning Using Data from Practice

New Developments in the Automatic Classification of Records. Inge Alberts, André Vellino, Craig Eby, Yves Marleau

Scalable Developments for Big Data Analytics in Remote Sensing

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Blog Post Extraction Using Title Finding

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Trading Strategies and the Cat Tournament Protocol

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

A Supervised Forum Crawler

Accelerometer Based Real-Time Gesture Recognition

Special Topics in Computer Science

Feature Selection for Electronic Negotiation Texts

Data Mining with Weka

Sentiment analysis: towards a tool for analysing real-time students feedback

E-coaching and Feedback Practices to Promote Higher Order Thinking Online

How To Understand How Weka Works

Segmentation and Classification of Online Chats

Predicting Student Performance by Using Data Mining Methods for Classification

and Hung-Wen Chang 1 Department of Human Resource Development, Hsiuping University of Science and Technology, Taichung City 412, Taiwan 3

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Classifying Manipulation Primitives from Visual Data

Semantic Sentiment Analysis of Twitter

DATA MINING TECHNIQUES AND APPLICATIONS

Equity forecast: Predicting long term stock price movement using machine learning

Robust Sentiment Detection on Twitter from Biased and Noisy Data

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Predict Influencers in the Social Network

Intrusion Detection via Machine Learning for SCADA System Protection

The Delicate Art of Flower Classification

Mining an Online Auctions Data Warehouse

8. Machine Learning Applied Artificial Intelligence

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

A Content based Spam Filtering Using Optical Back Propagation Technique

Facilitating Business Process Discovery using Analysis

Active Learning SVM for Blogs recommendation

DESIGN OF DIGITAL SIGNATURE VERIFICATION ALGORITHM USING RELATIVE SLOPE METHOD

Optimizing content delivery through machine learning. James Schneider Anton DeFrancesco

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

Spam detection with data mining method:

Crowdfunding Support Tools: Predicting Success & Failure

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Critical Thinking, Cognitive Presence, and Computer Conferencing in Distance Education

Stakeholder Communication in Software Project Management. Modelling of Communication Features

Online Farsi Handwritten Character Recognition Using Hidden Markov Model

Towards Effective Recommendation of Social Data across Social Networking Sites

Automatic Resolver Group Assignment of IT Service Desk Outsourcing

PREDICTING STOCK PRICES USING DATA MINING TECHNIQUES

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Data Mining. Dr. Saed Sayad. University of Toronto

Learning is a very general term denoting the way in which agents:

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

FAdR: A System for Recognizing False Online Advertisements

Mining Online Diaries for Blogger Identification

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Cell Phone based Activity Detection using Markov Logic Network

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

SENTIMENT ANALYSIS: A STUDY ON PRODUCT FEATURES

Applying Machine Learning to Stock Market Trading Bryce Taylor

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Micro blogs Oriented Word Segmentation System

II. RELATED WORK. Sentiment Mining

Behavior Usage Model to Manage the Best Practice of e-learning

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Spyware Prevention by Classifying End User License Agreements

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Maschinelles Lernen mit MATLAB

Transcription:

Automated Content Analysis of Discussion Transcripts Vitomir Kovanović v.kovanovic@ed.ac.uk Dragan Gašević dgasevic@acm.org School of Informatics, University of Edinburgh Edinburgh, United Kingdom v.kovanovic@ed.ac.uk 31 Aug 2015, University of Edinburgh, United Kingdom

Asynchronous online discussions - gold mine of information (Henri, 1992) They are frequently used for all types of education delivery, Their use produced large amount of data about learning processes, Their use is well supported by the social-constructivist pedagogies. V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 1 / 18

Asynchronous online discussions - issues and challenges Produced data is used mainly for research after the courses are over, Content analysis techniques are complex and time consuming, Content analysis had almost no impact on educational practice (Donnelly and Gardner, 2011), There is a need for more proactive use of the data through automation: Few attempts for automated content analysis, Focus mostly on surface level characteristics, and Not based on well established theories of education. V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 2 / 18

Overall idea Overall idea To examine how we can use text mining for automation of content analysis of discussion transcripts. More specifically, We looked at the automation of content analysis of cognitive presence, one of the three main components of Community of Inquiry framework. V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 3 / 18

Community of Inquiry (CoI) model Community of Inquiry model (Garrison, Anderson, and Archer, 1999) Conceptual framework outlying important constructs that define worthwhile educational experience in distance education setting. Three presences: Social presence: relationships and social climate in a community. Cognitive presence: phases of cognitive engagement and knowledge construction. Teaching presence: instructional role during social learning. CoI model is: Extensively researched and validated. Adopts Content Analysis for assessment of presences. V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 4 / 18

Community of Inquiry (CoI) model Community of Inquiry model (Garrison, Anderson, and Archer, 1999) Conceptual framework outlying important constructs that define worthwhile educational experience in distance education setting. Three presences: Social presence: relationships and social climate in a community. Cognitive presence: phases of cognitive engagement and knowledge construction. Teaching presence: instructional role during social learning. CoI model is: Extensively researched and validated. Adopts Content Analysis for assessment of presences. V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 4 / 18

Cognitive presence Cognitive Presence an extent to which the participants in any particular configuration of a community of inquiry are able to construct meaning through sustained communication. (Garrison, Anderson, and Archer, 1999, p.89) Four phases of cognitive presence: 1 Triggering event: Some issue, dilemma or problem is identified. 2 Exploration: Students move between private world of reflection and shared world of social knowledge construction. 3 Integration: Students filter irrelevant information and synthesize new knowledge. 4 Resolution: Students analyze practical applicability, test different hypotheses, and start a new learning cycle. V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 5 / 18

Cognitive presence coding scheme Use of whole message as unit of analysis, Look for particular indicators of different sociocognitive processes, Requires expertise with coding instrument and domain knowledge. V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 6 / 18

Community of Inquiry (CoI) model Issues and challenges: Very labor intensive, Crude coding scheme, Requires experienced coders, Can t be used for real-time monitoring, Not explaining reasons behind observed levels of presences, and Not providing suggestions and guidelines for instructors to direct their pedagogical decisions. V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 7 / 18

Data set Six offerings of graduate level course in software engineering. Total of 1747 messages, 81 students, Manually coded by two coders (agreement = 98.1%, Cohen s κ = 0.974), ID Phase Messages (%) 0 Other 140 8.01% 1 Triggering Event 308 17.63% 2 Exploration 684 39.17% 3 Integration 508 29.08% 4 Resolution 107 6.12% All phases 1747 100% Number of Messages in Different Phases of Cognitive Presence V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 8 / 18

Feature extraction Unigrams, Bigrams and Trigrams, Part-of-Speech Bigrams and Trigrams, Backoff Bigrams and Trigrams: Example: John is working. Bigrams: john is, is working. Backoff Bigrams: john verb, noun is, is verb verb working. V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 9 / 18

Feature extraction Dependency triplets: rel, head, modifier Example: Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas. nsubjpass, submitted, Bills auxpass, submitted, were agent, submitted, Brownback nn, Brownback, Senator appos, Brownback, Republican prep of, Republican, Kansas prep on, Bills, ports conj and, ports, immigration prep on, Bills, immigration V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 10 / 18

Feature extraction Backoff dependency triplets: Example: Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas. Dependency triplet: conj and, ports, immigration Backoff dependency triplets: conj and, noun, immigration conj and, ports, noun conj and, noun, noun V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 11 / 18

Additional features Number of named entities in the message Brainstorming should involve more concepts than posing a question, Is message first in the discussion? Posing questions is more likely to be initiating discussions, Is message a reply to the first message in the discussion? V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 12 / 18

Classification Classifier: SVM classifier with RBF kernel. Accuracy and kernel parameter tuning evaluated using nested 5-fold cross-validation. Only features with support of 10 or more, Accuracy evaluated using 10 fold cross-validation, Comparison of models using McNemar s test. Implementation: Implemented in Java, Feature extraction using Stanford CoreNLP 1 toolkit, Tokenization, Part-of-Speech, and Dependency parsing modules Classification using Weka (Witten, Frank, and Hall, 2011) and LibSVM (Chang and Lin, 2011), and Statistical comparison using Java Statistical Classes (JSC) 2 1 http://nlp.stanford.edu/software/corenlp.shtml 2 http://www.jsc.nildram.co.uk/index.htm V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 13 / 18

Results We achieved Cohen s κ of 0.42 for our classification problem. Better then the existing Neural Network system (Cohen s κ=0.31). Unigram baseline model achieved Cohen s κ of 0.33. Error analysis: Predicted Actual Other Trigg. Expl. Integ. Resol. Other 17 04 05 02 00 Triggering 01 42 1 14 03 01 Exploration 02 09 98 24 04 Integration 01 03 38 1,2 56 04 Resolution 00 00 03 15 2 03 Confusion Matrix V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 14 / 18

Challenges 1 Effect of the large relative size of the exploration class, 2 Effect of the code-up rule for coding, 3 No relative importance of features, and 4 Context is not taken into the account. Code-up rule for coding V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 15 / 18

In progress: making use of tread context Discussions (and students learning) progresses from triggering to resolutions. Content of a message depends on the content of the previous messages. Content of a message depends on the learning progress of a given student. Model for message classification V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 16 / 18

Approach: Hidden Markov models (HMMs) & Conditional random fields (CRFs) Hidden Markov Models: HMMs used to models system states and their transitions in a variety of contexts. Widely used, Bayesian Knowledge Tracing models based on HMMs. Challenges with HMM: Can this be modeled as HMM (2nd order HMMs?) Dependency only on a single previous state, One manifest variable for each state Conditional random fields: Used for structured predictions (e.g., speech recognition) For speech recognition, take into the account the classes of all letters in a word. Widely used in natural language processing, More flexible than HMMs, Challenges with CRF: Too many parameters to estimate with little data V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 17 / 18

Conclusions and future work Summary: Promising path to explore, Use of backoff trigrams, plain and backoff dependency triplets, entity count and first message indicator seems useful, Future work: Additional types of features which look at the context of previous messages (e.g., convergence vs. divergence), Moving away from SVM, explore other classification methods which are better at explanation Give associated probabilities for each classification, Give relative importance of different features. Challenges: Challenges with message unit of analysis and surface-level features, Low frequency of resolution messages. V. Kovanović et al. (EDI) Automated Analysis of Discussion Transcripts 31 Aug 2015 18 / 18

Thank you Vitomir Kovanovic v.kovanovic@ed.ac.uk

References I Chang, Chih-Chung and Chih-Jen Lin (2011). LIBSVM: A library for support vector machines. In: ACM Transactions on Intelligent Systems and Technology 2 (3), 27:1 27:27. Donnelly, Roisin and John Gardner (2011). Content analysis of computer conferencing transcripts. In: Interactive Learning Environments 19.4, pp. 303 315. Garrison, D. Randy, Terry Anderson, and Walter Archer (1999). Critical Inquiry in a Text-Based Environment: Computer Conferencing in Higher Education. In: The Internet and Higher Education 2.2 3, pp. 87 105. Henri, France (1992). Computer Conferencing and Content Analysis. en. In: Collaborative Learning Through Computer Conferencing, pp. 117 136. Witten, Ian H., Eibe Frank, and Mark A. Hall (2011). Data Mining: Practical Machine Learning Tools and Techniques, Third Edition. 3rd ed. Morgan Kaufmann.