Relative N-Gram Signatures: Document Visualization at the Level of Character N-Grams

Similar documents
Visual Analytics: Combining Automated Discovery with Interactive Visualizations

Evaluation of Authorship Attribution Software on a Chat Bot Corpus

Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering

Chapter 1 Learning to Program With Alice

Blogs and Twitter Feeds: A Stylometric Environmental Impact Study

Graduate Studies in Computer Science at Dalhousie University. Evangelos Milios Faculty of Computer Science Dalhousie University

Readability Visualization for Massive Text Data

LINCOLN SCHOOL Course Syllabus: English 7 Theme: How does literature challenge, change, and define us?

Write my paper intelligence studies. Physics accompanies submission of dissertation in Part I and submission of a Project.

RNA Structure and folding

2.0. Specification of HSN 2.0 JavaScript Static Analyzer

Author Gender Identification of English Novels

Computer Aided Document Indexing System

Integrating Web Content Clustering into Web Log Association Rule Mining

Genetic & Evolutionary Feature Selection for Author Identification of HTML Associated with Malware

Map-like Wikipedia Visualization. Pang Cheong Iao. Master of Science in Software Engineering

From lowest energy to highest energy, which of the following correctly orders the different categories of electromagnetic radiation?

Data Deduplication in Slovak Corpora

Interpreting areading Scaled Scores for Instruction

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Detecting Internet Worms Using Data Mining Techniques

Master of Arts. Program in English

9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

Unit 10.4: Stories of Other Worlds: Science Fiction, Fantasy, and Imaginative Literature

Guidelines for Establishment of Contract Areas Computer Science Department

Microsoft Band Web Tile

Continuous Biometric User Authentication in Online Examinations

Computer-aided Document Indexing System

Strategy Formulation in Japanese Management

Abstract. Introduction

Applying Static Analysis to High-Dimensional Malicious Application Detection

Cross-Language Authorship Attribution

INFRARED SPECTROSCOPY (IR)

Search Engines. Stephen Shaw 18th of February, Netsoc

Determining if Two Documents are by the Same Author

Electromagnetic Radiation (EMR) and Remote Sensing

Technical Report. The KNIME Text Processing Feature:

Search and Information Retrieval

The First Online 3D Epigraphic Library: The University of Florida Digital Epigraphy and Archaeology Project

The Title of a Yale University Doctoral. Dissertation

TIBCO Spotfire Network Analytics 1.1. User s Manual

North Dakota Advanced Placement (AP) Course Codes. Computer Science Education Course Code Advanced Placement Computer Science A

Mining a Corpus of Job Ads

Biometric Authentication using Online Signatures

Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks

Adventures in Alice Programming

Tutorial for proteome data analysis using the Perseus software platform

Spam Detection Using Customized SimHash Function

PENNSYLVANIA COMMON CORE STANDARDS English Language Arts Grades 9-12

Content Management System User Guide

Caml Virtual Machine File & data formats Document version: 1.4

Visualizing Repertory Grid Data for Formative Assessment

Successful graduates of the MA program in American Studies will be awarded a degree in which the following two elements will appear:

Visualizing molecular simulations

Group Theory and Chemistry

A Practical Attack to De Anonymize Social Network Users

Term extraction for user profiling: evaluation by the user

Interactive Visual Data Analysis in the Times of Big Data

School Library Website Components

Creating While Loops with Microsoft SharePoint Designer Workflows Using Stateful Workflows

Interactive Exploration of Decision Tree Results

Lesson 15 - Fill Cells Plugin

Getting Started with Scratch

Visualizing Poetry: Creating Tools for Critical Analysis. Introduction Current debates over distant reading (Moretti) seem to imply that digital tools

Custom Linetypes (.LIN)

Rolling the Dice on Big Data. Ilse Ipsen Department of Mathematics

Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 4 th, 2007

Development of Leadership Skills in Engineering Employment. Charles W Turner Emeritus Professor of Electrical Engineering, King s College London

Narrative Literature Response Letters Grade Three

This use study analyzes a specific scenario for a financial credit interaction for an online personal loan request.

Simple Language Models for Spam Detection

ELEVATING FORENSIC INVESTIGATION SYSTEM FOR FILE CLUSTERING

Reading is the process in which the reader constructs meaning by interacting with the text.

Interactive Timeline Viewer (ItLv): A Tool to Visualize Variants Among Documents

WAFFle: Fingerprinting Filter Rules of Web Application Firewalls

Classifying Manipulation Primitives from Visual Data

Chemistry 102 Summary June 24 th. Properties of Light

Learning is a very general term denoting the way in which agents:

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Global Music Management MPAMB-GE points; NYU in London; Spring 2016

Cyber Security Through Visualization

Visual Structure Analysis of Flow Charts in Patent Images

Learning Objectives. Required Resources. Tasks. Deliverables

INFORMATION VISUALIZATION TECHNIQUES USAGE MODEL

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Movie Classification Using k-means and Hierarchical Clustering

Atomic Calculations. 2.1 Composition of the Atom. number of protons + number of neutrons = mass number

Web Modules for Garden Centers

COPYRIGHT ACT -- FAIR DEALING (Advisory for SUTD Faculty, Researchers, Staff and Students)

App Building Guidelines

SOUTH DAKOTA Reading and Communication Arts Standards Grade 9 Literature: The Reader s Choice Course

Data Integration through XML/XSLT. Presenter: Xin Gu

AP CHEMISTRY 2007 SCORING GUIDELINES (Form B)

Detection and mitigation of Web Services Attacks using Markov Model

Ivy Tech Community College of Indiana

Working Title: Web Development/User Experience Specialist Classification: Analyst/Programmer (Career) Job Code: 0400 Range Code: 2

Supervised and unsupervised learning - 1

1 st day Basic Training Course

UT Martin Password Policy May 2015

Thesis Format Guide. Denise Robertson Graduate School Office 138 Woodland Street Room

Transcription:

1 Relative N-Gram Signatures: Document Visualization at the Level of Character N-Grams Magdalena Jankowska, Evangelos Milios, Vlado Kešelj Faculty of Computer Science, Dalhousie University aaaa June 2013

2 Relative N-Gram Signatures Interactive classification of a document Who wrote this book? Analysis of characteristics of a document What are the characteristics of the author s style? Language independent method

5 Character N-Grams Strings of n consecutive characters from a given text Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: Alice's Adventures in the Wonderland by Lewis Carroll

6 Character N-Grams Strings of n consecutive characters from a given text Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: n=4 4-grams Alice's Adventures in the Wonderland by Lewis Carroll ALIC

7 Character N-Grams Strings of n consecutive characters from a given text Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: n=4 4-grams Alice's Adventures in the Wonderland by Lewis Carroll ALIC LICE

8 Character N-Grams Strings of n consecutive characters from a given text Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: n=4 4-grams Alice's Adventures in the Wonderland by Lewis Carroll ALIC LICE ICE_

9 Character N-Grams Strings of n consecutive characters from a given text Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: n=4 4-grams Alice's Adventures in the Wonderland by Lewis Carroll ALIC LICE ICE_ CE_W

10 Character N-Grams Strings of n consecutive characters from a given text Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: n=4 4-grams ALIC LICE ICE_ CE_W n-grams in our system: uppercase Alice's Adventures in the Wonderland by Lewis Carroll each sequence of non-word characters replaced by an underscore

Common N-Gram (CNG) Classifier assigns a document to a class from a given set of classes? works of Carrol works of Twain works of Shakespeare Proposed by Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. N-gram-based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING 03, 2003. 11

Common N-Gram (CNG) Classifier assigns a document to a class from a given set of classes? comparison of the frequency of the most common character n-grams works of Carrol works of Twain works of Shakespeare Proposed by Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. N-gram-based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING 03, 2003. 12

Common N-Gram (CNG) Classifier assigns a document to a class from a given set of classes? comparison of the frequency of the most common character n-grams works of Carrol works of Twain works of Shakespeare Applications: Authorship attribution Malicious code detection Gene classification Web page genre classification 13

14 CNG Classifier - Dissimilarity Profile a sequence of L most common n-grams of a given length n document 1: Alice's Adventures in the Wonderland by Lewis Carroll document 2: Tarzan of the Apes by Edgar Rice Burroughs _ T O _ I N G A N D _ A N D I N G O F _ A N D _ A N D _ T H E _ T H E T H E f 1 _ T H E n=4, L=6 n-gram normalized frequency

CNG Classifier - Dissimilarity Profile a sequence of L most common n-grams of a given length n document 1: Alice's Adventures in the Wonderland by Lewis Carroll document 2: Tarzan of the Apes by Edgar Rice Burroughs _ T O _ I N G A N D _ A N D I N G O F _ A N D _ distance(f 1 (x), f 2 (x) ) A N D _ T H E _ T H E _ n=4, L=6 _ T H E f 1 _ T H E f 2 f 2 (x)=0 f 1 (x)=0 15

CNG Classifier - Dissimilarity Profile a sequence of L most common n-grams of a given length n document 1: Alice's Adventures in the Wonderland by Lewis Carroll document 2: Tarzan of the Apes by Edgar Rice Burroughs _ T O _ I N G A N D _ A N D I N G O F _ A N D _ distance(f 1 (x), f 2 (x) ) A N D _ T H E _ T H E _ n=4, L=6 _ T H E f 1 _ T H E f 2 CNG dissimilarity between two documents f 1 (x)=0 sum of the distances with respect to all n-grams in the union of the profiles 16

Motivation text visualization on the language-independent level of character n-grams similarity of documents characteristics of documents visualization of the CNG classifier reasons for the classification result possibility of influencing the classification 17

RNG-Sig Web application 18 Implemented as a web application d3.js JavaScript library for visualization Available online with pre-loaded data at: http://cs.dal.ca/~jankowsk/ngram_signatures

19 Relative N-Gram Signature Relative signature of Tarzan of the Apes by Burroughs with respect to ( on the background of ) Alice's Adventures in the Wonderland by Carroll (base document) Visual representation of the CNG dissimilarity between two documents n=4 (4-grams) L=500 (500 most common n-grams)

20 Relative N-Gram Signature Relative signature of Tarzan of the Apes by Burroughs zoom with respect to ( on the background of ) Alice's Adventures in the Wonderland by Carroll (base document) Each strip represents an n-gram n=4 (4-grams) L=500 (500 most common n-grams)

21 Relative N-Gram Signature Relative signature of Tarzan of the Apes by Burroughs with respect to ( on the background of ) Alice's Adventures in the Wonderland by Carroll (base document) n=4 (4-grams) L=500 (500 most common n-grams) Each strip represents an n-gram 500 most common n-grams of Alice's Adventures decreasing frequency in Alice's Adventures

22 Relative N-Gram Signature Relative signature of Tarzan of the Apes by Burroughs with respect to ( on the background of ) Alice's Adventures in the Wonderland by Carroll (base document) n=4 (4-grams) L=500 (500 most common n-grams) these among 500 most common n-grams of Tarzan that are not among 500 most common n-grams of Alice's Adventures Each strip represents an n-gram 500 most common n-grams of Alice's Adventures decreasing frequency in Tarzan decreasing frequency in Alice's Adventures

23 Relative N-Gram Signature Relative signature of Tarzan of the Apes by Burroughs with respect to ( on the background of ) Alice's Adventures in the Wonderland by Carroll (base document) Color: distance of two documents with respect to this n-gram n=4 (4-grams) L=500 (500 most common n-grams)

24 Relative N-Gram Signature Visualizes similarity between documents on the level of character n-grams Relative signature of two documents that do not share any of their respective 500 most common n-grams Signature of a document with respect to itself

Relative N-Gram Signature Visual metaphor 25 Inspiration: emission spectrum spectrum of frequencies of electromagnetic emissions by atoms or molecules picture from Wikipedia

26 Sequence of signatures scenario authorship analysis

27 Sequence of signatures The same base document

28 Sequence of signatures Signature of the most similar document Carrol s Through the looking glass

29 Sequence of signatures CNG dissimilarity score sum of the distances over all n-grams in a signature

30 Sequence of signatures minimum dissimilarity = classifier result

31 Sequence of signatures zooming in

Interactive exploration of signatures 32 browsing

Interactive exploration of signatures 33 context: most common words

Interactive exploration of signatures 34

Interactive exploration of signatures 35

Interactive exploration of signatures 36 context: concordance style given n-gram within the text

Language independence 37 Polish authors

Language independence 38 Polish authors searching for n-grams

Language independence 39 Polish authors

40 Motivation for analysis of Mark Twain novels D. A. Keim and D. Oelke. Literature Fingerprinting: A New Method for Visual Literary Analysis. In Proceedings of the 2007 IEEE Symposium on Visual Analytics Science and Technology, 2007. Hapax Legomena Visual analysis of works of Mark Twain: Adventures of Huckleberry Finn stands out from the other works of Mark Twain with respect to: Function words frequency Simpson's index Hapax Legomena Function words (first dimension after PCA)

Example analysis: comparison of novels by Mark Twain Adventures of Huckleberry Finn 41

42 Complementary Comparison View N-grams ordered separately in each signature, according to their distance Better for comparison of signatures but not for exploring

43 Interactively influencing the visualization and the classifier Ad-hoc Authorship Attribution Competition, 2004 Manual, task-dependent adaptation of the classification process Problem G, sample 02

44 Interactively influencing the visualization and the classifier Ad-hoc Authorship Attribution Competition, 2004 Problem G, sample 02

45 Interactively influencing the visualization and the classifier Ad-hoc Authorship Attribution Competition, 2004 Problem G, sample 02

Interactively influencing the visualization and the classifier n-grams originating mostly from proper names ignoring selected n-grams in the base document Two options: the length of the list of n-grams in the base document is kept Intact by adding less frequent n-grams at the top no new n-grams are added the list of n-grams for the base document becomes shorter 46

Interactively influencing the visualization and the classifier ignoring selected n-grams in the base document correct classification result 47

Thank you! 65