9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

Similar documents
Topic Maps Visualization

Visualizing an Auto-Generated Topic Map

Information Visualization Multivariate Data Visualization Krešimir Matković

Visualization Quick Guide

IR User Interfaces and Visualization

Graph/Network Visualization

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Visualization Software

Understanding Data: A Comparison of Information Visualization Tools and Techniques

The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. Ben Shneiderman, 1996

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

SuperViz: An Interactive Visualization of Super-Peer P2P Network

3 Information Visualization

TIBCO Spotfire Network Analytics 1.1. User s Manual

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

SAS/GRAPH Network Visualization Workshop 2.1

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

NakeDB: Database Schema Visualization

Environmental Remote Sensing GEOG 2021

Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Visual Mining of E-Customer Behavior Using Pixel Bar Charts

Data representation and analysis in Excel

JustClust User Manual

InfiniteInsight 6.5 sp4

an introduction to VISUALIZING DATA by joel laumans

Self Organizing Maps for Visualization of Categories

Exploratory Data Analysis for Ecological Modelling and Decision Support

MyOra 3.0. User Guide. SQL Tool for Oracle. Jayam Systems, LLC

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Utilizing Microsoft Access Forms and Reports

A Tutorial on dynamic networks. By Clement Levallois, Erasmus University Rotterdam

What is Visualization? Information Visualization An Overview. Information Visualization. Definitions

Creating Charts in Microsoft Excel A supplement to Chapter 5 of Quantitative Approaches in Business Studies

MultiExperiment Viewer Quickstart Guide

Guide To Creating Academic Posters Using Microsoft PowerPoint 2010

Regression Clustering

Data Visualization Handbook

Self Organizing Maps: Fundamentals

An Introduction to Point Pattern Analysis using CrimeStat

Hierarchical Clustering Analysis

Space-filling Techniques in Visualizing Output from Computer Based Economic Models

Multi-Dimensional Data Visualization. Slides courtesy of Chris North

Principles of Data Visualization for Exploratory Data Analysis. Renee M. P. Teate. SYS 6023 Cognitive Systems Engineering April 28, 2015

Visualization of Phylogenetic Trees and Metadata

Visualization of Software Metrics Marlena Compton Software Metrics SWE 6763 April 22, 2009

Visualization Techniques in Data Mining

Diagrams and Graphs of Statistical Data

Interactive Voting System. IVS-Basic IVS-Professional 4.4

Visualization of Breast Cancer Data by SOM Component Planes

Visually Encoding Program Test Information to Find Faults in Software

Machine Learning using MapReduce

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

Visualization methods for patent data

Portal Connector Fields and Widgets Technical Documentation

Excel Unit 4. Data files needed to complete these exercises will be found on the S: drive>410>student>computer Technology>Excel>Unit 4

Data topology visualization for the Self-Organizing Map

Getting Started With Mortgage MarketSmart

Utilizing spatial information systems for non-spatial-data analysis

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

GeoGebra. 10 lessons. Gerrit Stols

Data Exploration Data Visualization

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Gephi Tutorial Quick Start

Public Online Data - The Importance of Colorful Query Trainers

The following is an overview of lessons included in the tutorial.

Exploratory Data Analysis

Adobe Illustrator CS5 Part 1: Introduction to Illustrator

Tutorial Segmentation and Classification

Object Recognition. Selim Aksoy. Bilkent University

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Final Project Report

Visual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics

Interactive Data Mining and Visualization

The Value of Visualization 2

Integration of Cluster Analysis and Visualization Techniques for Visual Data Analysis

Personalized Information Management for Web Intelligence

Visibility optimization for data visualization: A Survey of Issues and Techniques

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc.,

DATA LAYOUT AND LEVEL-OF-DETAIL CONTROL FOR FLOOD DATA VISUALIZATION

Data Visualization. Prepared by Francisco Olivera, Ph.D., Srikanth Koka Department of Civil Engineering Texas A&M University February 2004

By LaBRI INRIA Information Visualization Team

Spotfire v6 New Features. TIBCO Spotfire Delta Training Jumpstart

SQL Server 2014 BI. Lab 04. Enhancing an E-Commerce Web Application with Analysis Services Data Mining in SQL Server Jump to the Lab Overview

Chapter ML:XI (continued)

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Hierarchical Data Visualization

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

OECD.Stat Web Browser User Guide

Plotting: Customizing the Graph

Choosing Colors for Data Visualization Maureen Stone January 17, 2006

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Visualizing Data: Scalable Interactivity

Foundations of Artificial Intelligence. Introduction to Data Mining

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

HDDVis: An Interactive Tool for High Dimensional Data Visualization

Using Data Mining for Mobile Communication Clustering and Characterization

Transcription:

9. Text & Documents Visualizing and Searching Documents Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08 Slide 1 / 37

Outline Characteristics of text data Detecting patterns SeeSoft Arc diagrams Visualizing Plagiarism Keyword search TextArc Enhanced scrollbar TileBars Cluster Maps Visualization for the document space WEBSOM ThemeScapes Cluster map vs keyword search Slide 2 / 37

Text & Documents The main mean to store information Huge existing resources: libraries, WWW What to visualize? Text is of nominal data type, but with many additional and interesting properties Text structure Meta data Author Dates Descriptions Relations between documents (e.g. citation, similarity) Relevance of documents to a query Text statistics (e.g. frequency of different words) Content / Semantics Slide 3 / 37

Outline Characteristics of text data Detecting patterns SeeSoft Arc diagrams Visualizing Plagiarism Keyword search TextArc Enhanced scrollbar TileBars Cluster Maps Visualization for the document space WEBSOM ThemeScapes Cluster map vs keyword search Slide 4 / 37

SeeSoft Eick et al. 1993 Software visualization tool to display code line statistics (e.g. age, programmer, number of execution in recent test, etc.) Encoding Each column represents a file Height of column: length of document Files exceeding the height of the screen are continued over to the next columns Each row represents a line of code Width of row: length of line Color: age of the line (red: newest; blue: oldest) Scales up to 50,000 lines on a single screen Example: 20 files with 9,365 lines of code Reading windows controlled by virtual magnifying boxes Slide 5 / 37

Arc Diagrams Wattenberg 2002 Visualizes repetition in string data Application domains: text, DNA sequences, music Approach: to avoid clutter, only visualize an essential subset of all possible pairs of matching substrings Display string on a single line Connect the consecutive intervals by a semi-circular arc Thickness of the arc: length of the matching substring Height of the arc: proportional to the distance of substrings Slide 6 / 37

Arc Diagrams Apply translucency to not obscure matches Still: for strings with a high frequency of small repeated substrings the visualization may cause clutter Provide users with the ability to filter by minimum substring length to consider Slide 7 / 37

Arc Diagrams Comparison to a dotplot diagram Recap Matrix diagram Correlation m atrix String of n sym bols a 1, a 2, a n is represented by an n*n m atrix Pixel at coordinate (i, j) is black if a i = a j Can handle very large datasets Shows both sm all and large-scale structures Heavy clutter caused by small substrings with high frequency: n repetitions of a substring lead to n 2 visual marks Arc Diagrams mark only similar substrings, which are subsequent Slide 8 / 37

Arc Diagrams Applied to music, Minuet in G Major, Bach Shows classic pattern of a minuet: two main parts, each consisting of a long passage played twice Parts are loosely related: bundle of thin arcs connecting the two main parts Overlap of the two main arcs shows that the end of the first passage is the same as the beginning of the second passage Slide 9 / 37

Visualizing Plagiarism Ribler & Abrams 2000 Problem: programming assignment in a class with large number of students High probability of plagiarism Need to compare every document (code file) with every other document Visualization must support two steps Highlight suspicious documents Allow for detailed examination of the similar passages - high level of similarity between documents may not be due to cheating (e.g. headers) Slide 10 / 37

Visualizing Plagiarism Categorical Patterngram Visualize frequencies of sequences of characters present in more than one document Remove all non-printable characters in the document collection Define length of character sequence to analyse (in the example: 4) Histogram-like approach X-axis: start character of sequence Y-axis: number of documents containing the sequence Doc at Y = 1: base document to compare against all other documents Slide 11 / 37

Visualizing Plagiarism Composite Categorical Patterngram Visualizes which particular documents are similar Y-axis: each value corresponds to an individual document Slide 12 / 37

Visualizing Plagiarism Case study Students were asked to extend a sample program of about 30 lines of code Average completed program was about 150 lines Submission via email Graphic shows categorical patterngram for a single submission Sequence length = 10 Lines not text due to high density Rather confusing color coding Color coding (not very reasonable) Green: frequency >= 10 Red: frequency < 10 Blue: base document Plagiarism or not? Slide 13 / 37

Visualizing Plagiarism What to look out for? Sequences that occur frequently are not of interest - all points with y >= 10 are plotted as y = 10 Suspicious: accumulation of points with low frequencies Analysis Majority of points are plotted at Y = 1 Hence most 10-char sequences are unique to the base document Number of points plotted at Y = 2, but evenly distributed Slide 14 / 37

Visualizing Plagiarism Composite Categorical Patterngram for the submission Solid line represents the base document (submission number 23) Large number of points plotted in the range of x = [0; 500]: email message header Other frequent sequences due to the sample program Pattern typical for independent work Slide 15 / 37

Visualizing Plagiarism Example of patterngrams indicating extensive plagiarism Slide 16 / 37

Visualizing Plagiarism Patterngram of more subtle plagiarism Slide 17 / 37

Visualizing Plagiarism What may a student do to mask plagiarized code Change variable names Minimize masking effect by replacing all alphanumeric strings in all documents into single characters Two documents with the same code but different variable names will produce identical patterngrams Slide 18 / 37

Outline Characteristics of text data Detecting patterns SeeSoft Arc diagrams Visualizing Plagiarism Keyword search TextArc Enhanced scrollbar TileBars Cluster Maps Visualization for the document space WEBSOM ThemeScapes Cluster map vs keyword search Slide 19 / 37

TextArc http://www.textarc.org/ - demo Represents the entire text as 1 pixel lines in an outer circle Text is revealed via mouse-over Words are repeated in inner circle at a readable size Position of the words depend on where the word appears in the document Words that appear throughout the novel will be drawn to the center Frequent words stand out Example visualizes the novel Alice in Wonderland Various visualization features Association of words Word frequency Reading order of words (animated) Slide 20 / 37

Search Terms on a Scrollbar Byrd 1999 Searching of keywords in a single document Color coding to map each occurrence of a keyword in the document as a small colored icon in the scrollbar Provides an overview of the entire document, not only of the portion currently visible Users can directly jump to keyword occurences by moving the slider thumb Slide 21 / 37

TileBars Hearst 1995 Problem with document ranking of common search engines? Ranking approach is opaque: What role did the query terms played in the ranking process What is the relationship between the query terms in the document TileBars attempts to le the users make informed decisions about which documents and passages to view Slide 22 / 37

TileBars Users provide sets of query terms OR within a set AND between sets Documents are partitioned into adjacent, non-overlapping multi-paragraph segments Each document of the result set is represented by a rectangle - width indicates relative length of the document Stacked squares correspond to text segments Each row of the stack corresponds to a set of query terms Darkness of the square indicates the frequency of terms from the corresponding term set - (Why is this a reasonable color mapping?) Title + initial words appear next to each document Users can click on segments to retrieve the corresponding text Slide 23 / 37

TileBars Analysis hints Overall darkness indicates that all term sets are discussed in detail throughout the document When terms are discussed simultaneously the tiles blend together causing an easy to spot block Scattered term set occurrence show large areas of white space Helps to distinguish between passing remarks and prominent topic terms Users may also set distribution constraints to refine the query Minimum number of hits per term set Minimum distribution (percentage of tiles containing at least one hit) Minimum adjacent overlap span Slide 24 / 37

Outline Characteristics of text data Detecting patterns SeeSoft Arc diagrams Visualizing Plagiarism Keyword search TextArc Enhanced scrollbar TileBars Cluster Maps Visualization for the document space WEBSOM ThemeScapes Cluster map vs keyword search Slide 25 / 37

Cluster Maps Downscaling of n-dimensional document space to 2D Map of a document collection Similar documents are placed close to each other Dissimilar documents are placed farer apart from each other Provide thematic overview for exploration (same concept as product arrangements in a store) How to - Vector space model and map construction Create inverted index of document collection Exclude stop words and the most frequent words ( and may not be a good discriminator of content) Matrix of indexing words versus documents gives you document vectors A document vector reflects the frequency of index words occurring in the document Slide 26 / 37

Cluster Maps How to - Vector space model and map Document vectors construction (continued) Doc 1 Doc 2 Doc 3 Compute similarity between pairs of documents (e.g. dot product of vectors) Artificial 1 2 0 Layout documents in 1D/2D/3D Creativity 2 1 0 Common approaches Java 0 0 3 Spring model of graph layout Multi-dimensional scaling Similarity Matrix Clustering (e.g. hierarchical) Doc 1 Doc 2 Doc 3 Self-organizing maps (SOM aka Kohonen Doc 1 1 0.66 0 map) Doc 2 0.66 1 0 Doc 3 0 0 1 Slide 27 / 37

SOM Unsupervised learning algorithm SOM map is formed from a regular grid of neurons (nodes) Each node has An x y coordinate in the grid A weight vector of the same dimensionality as the input vectors Input vectors Network of 4x4 nodes Used to train the map Represent collection of objects In case of visualizing text, input vectors are usually equal to document vectors Slide 28 / 37

SOM - Algorithm 1. Start with assigning small random weights to the nodes of the grid 2. Chose a vector at random from the set of input vectors and present it to the grid 3. For each node: calculate the Euclidean distance between each node's weight vector and the current input vector - the closest node is called the Best Matching Unit (BMU) 4. Calculate the radius of the BMU (radius diminishes with each time-step) 5. For each node within the radius of the BMU: adjust the weights to make them more similar to the input vector - the closer a node is to the BMU, the more its weights get altered 6. Repeat step 2 for N iterations When training is completed each document is assigned to its BMU Slide 29 / 37

Cluster Maps Lin 1992 Personal collection of 660 research documents 2500 learning iterations Labeled word show most frequent title words Size maps to frequencies of occurrence of the words Neighboring relationships of areas indicate frequencies of the co-occurrence of words Slide 30 / 37

Cluster Maps Research interest changing over time Slide 31 / 37

WEBSOM http://websom.hut.fi/websom/ SOM of Finnish news bulletins for exploring and retrieving documents Labels show the topics of areas in the SOM Coloring encodes density - light areas contain more documents Navigation via zooming and panning Documents can be retrieved on the lowest level of the visualization Demo Slide 32 / 37

ThemeScapes Wise et al. 1995 Map document density to third dimension News article visualized as an abstract 3D landscape Mountains represent frequent themes in the document corpus (height proportional to number of documents relating to the theme) Spatial characteristics of the map should map to interconnections of themes http://nd.loopback.org/hyperd/zb/spire/spire.html http://infoviz.pnl.gov/technologies.html Slide 33 / 37

Outline Characteristics of text data Detecting patterns SeeSoft Arc diagrams Visualizing Plagiarism Keyword search TextArc Enhanced scrollbar TileBars Cluster Maps Visualization for the document space WEBSOM ThemeScapes Cluster map vs keyword search Slide 34 / 37

Cluster Map vs Keyword Search Chris North Cluster Map pros Facilitates non-targeted exploration and browsing by spatially organizing documents Provides overview of document set: major themes, sizes of clusters, relationships between themes Scales up Cluster Map cons How to label groups? What does the space mean? How to label space? Where to locate documents with multiple themes: both mountains, between mountains,? Relationships within documents? Algorithm (SOM) is time-consuming Slide 35 / 37

Cluster Map vs Keyword Search Chris North Keyword search pros Reduces the browsing space according to user s interests Keyword search cons What keywords do I use? What about other related documents that don t use these keywords? No initial overview Mega-hit, zero-hit problem Slide 36 / 37

Additional Sources Jonn Stasko, lecture material, CS 7450 Chris North, lecture material, CS 5764 Slide 37 / 37