Semantic Word Clouds



Similar documents
Opportuni)es and Challenges of Textual Big Data for the Humani)es

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

Extrac'ng People s Hobby and Interest Informa'on from Social Media Content

Data Warehousing. Yeow Wei Choong Anne Laurent

XML, Seman9c Web and Content Analy9cs

Ins+tuto Superior Técnico Technical University of Lisbon. Big Data. Bruno Lopes Catarina Moreira João Pinho

Ontology and automatic code generation on modeling and simulation

CS 5150 So(ware Engineering Evalua4on and User Tes4ng

SBML SBGN SBML Just my 2 cents. Alice C. Villéger COMBINE 2010

Cloud Data Management System (CDMS)

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Performance Management. Ch. 9 The Performance Measurement. Mechanism. Chiara Demar8ni UNIVERSITY OF PAVIA. mariachiara.demar8ni@unipv.

Seman&c Web: Benefits For Clinical Decision Support At The Bedside. Emory Fry, MD SemTechBiz 2013

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Doing Big Data Projects: What s the Best Team Process Methology?

Network Maps for End Users: Collect, Analyze, Visualize and Communicate Network Insights with Zero Coding

Using Social Media to Drive Recommender Systems for Mobile Apps. - GRP Presenta=on - Jovian Lin (A M)

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Semantic Search in Portals using Ontologies

INCREMENTAL, APPROXIMATE DATABASE QUERIES AND UNCERTAINTY FOR EXPLORATORY VISUALIZATION. Danyel Fisher Microso0 Research

Ibis: Scaling Python Analy=cs on Hadoop and Impala

Keeping Pace with Big Data

DTCC Data Quality Survey Industry Report

The Development of a Strategic Planning Framework for VCU s College of Humani?es and Sciences

Information Services for Smart Grids

Run$me Query Op$miza$on

Semantic Interoperability

How To Understand The Big Data Paradigm

Research at the Department of Computer Science and Software Engineering. Professor Yong Yue BEng, PhD, CEng, FIET, FIMechE 17 October 2014

Protec'ng Communica'on Networks, Devices, and their Users: Technology and Psychology

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Nodes, Ties and Influence

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms

Data Integra*on in a Networked World. Karl Aberer EPFL karl.aberer@epfl.ch h@p://lsir.epfl.ch/ h@p://

Web Services and Development of Semantic Applications

Business Analysis Center of Excellence The Cornerstone of Business Transformation

CMMI for High-Performance with TSP/PSP

CitationBase: A social tagging management portal for references

CSER & emerge Consor.a EHR Working Group Collabora.on on Display and Storage of Gene.c Informa.on in Electronic Health Records

Processing of Mix- Sensi0vity Video Surveillance Streams on Hybrid Clouds

How to write a Bachelor s Thesis in Cogni4ve and Decision Sciences? Gilles Du4lh

CONTENTS. Introduc on 2. Undergraduate Program 4. BSC in Informa on Systems 4. Graduate Program 7. MSC in Informa on Science 7

Data Management within Land Use Division

Some Security Challenges of Cloud Compu6ng. Kui Ren Associate Professor Department of Computer Science and Engineering SUNY at Buffalo

Data Exploration Data Visualization

SDN- based Mobile Networking for Cellular Operators. Seil Jeon, Carlos Guimaraes, Rui L. Aguiar

Welcome! Accelera'ng Pa'ent- Centered Outcomes Research and Methodological Research. Andrea Heckert, PhD, MPH Program Officer, Science

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #5: En-ty/Rela-onal Models- - - Part 1

Interna'onal Standards Ac'vi'es on Cloud Security EVA KUIPER, CISA CISSP HP ENTERPRISE SECURITY SERVICES

A Web Page Prediction Model Based on Click-Stream Tree Representation of User Behavior

Social Network Mining

Service Oriented Architecture

De la Business Intelligence aux Big Data. Marie- Aude AUFAURE Head of the Business Intelligence team Ecole Centrale Paris. 22/01/14 Séminaire Big Data

Topic Extrac,on from Online Reviews for Classifica,on and Recommenda,on (2013) R. Dong, M. Schaal, M. P. O Mahony, B. Smyth

BENCHMARKING V ISUALIZATION TOOL

Managed Services. An essen/al set of tools for today's businesses

Understanding Prototype Theory and How it Can be Useful in Analyzing and Creating SEC XBRL Filings

The use of Semantic Web Technologies in Spatial Decision Support Systems

ECIA RiSE Initiative. Risk Assessment Database

Founda'onal IT Governance A Founda'onal Framework for Governing Enterprise IT Adapted from the ISACA COBIT 5 Framework

MSc Data Science at the University of Sheffield. Started in September 2014

Suppor&ng a social media research environment by mining big textual data. Sophia Ananiadou Na-onal Centre for Text Mining

Big Data from a Database Theory Perspective

Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources

isecure: Integrating Learning Resources for Information Security Research and Education The isecure team

Social Media Analy.cs (SMA)

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Making Sense of Big Data. Dr. Thomas E. Potok Computa2onal Data Analy2cs Group Leader Oak Ridge Na2onal Laboratory

Pu?ng B2B Research to the Legal Test

Ontology-Based Semantic Modeling of Safety Management Knowledge

Language Resources, Language Technology, Text Mining, the Seman8c Web: How interoperability of machines can help humans in the mul8lingual web

Expanding Assessment of Analy3cal Skills among Biology Majors: From Introductory labs to Upper Division Elec3ves

Connec(ng to the NC Educa(on Cloud

Cost Effec/ve Approaches to Best Prac/ces in Data Analy/cs for Internal Audit

ORGANIZATIONAL KNOWLEDGE MAPPING BASED ON LIBRARY INFORMATION SYSTEM

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Python for Data Analysis and Visualiza4on. Fang (Cherry) Liu, Ph.D PACE Gatech July 2013

Website Design. A Crash Course. Monique Sherre, monique@boxcarmarke4ng.com

Scalable Mul*- Class Traffic Management in Data Center Backbone Networks

Graduate Systems Engineering Programs: Report on Outcomes and Objec:ves

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Office of Business and Financial Services. Department Budget Presenta0on

Graph Mining and Social Network Analysis

A Brief Introduc/on to CDISC SDTM and Data Mapping

KNOWLEDGE ORGANIZATION

Scalus A)ribute Workshop. Paris, April 14th 15th

Big Data Visualiza9on

Wandering Lonely as a Cloud. Arts and Humani7es, Clouds, Crowds and Seamless Infrastructures

How To Use A Webmail On A Pc Or Macodeo.Com

«Shanoir : une solu/on pour la ges/on de données distribuées en imagerie in- vivo» Jus/ne Guillaumont Isabelle Corouge

The author(s) shown below used Federal funds provided by the U.S. Department of Justice and prepared the following final report:

Program Model: Muskingum University offers a unique graduate program integra6ng BUSINESS and TECHNOLOGY to develop the 21 st century professional.

Mega Modeling for Scien/fic Big Data Processing

Experiments on cost/power and failure aware scheduling for clouds and grids

Getting Real with Policies for Software Defined Infrastructure. Manish Dave Principal Engineer, Intel IT

Identity and Access Positioning of Paradgimo

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

A Framework for Ontology-Based Knowledge Management System

Transcription:

Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm Semantic Word Clouds Marina San(ni san$nim@stp.lingfil.uu.se Department of Linguis(cs and Philology Uppsala University, Uppsala, Sweden Spring 2016

Previous lecture: Ontologies 2

Semantic Web & Ontologies The goal of the Seman(c Web is to allow web informa(on and services to be more effec(vely exploited by humans and automated tools. Essen(ally, the focus of the seman(c web is to share data instead of documents. This data must be meaningful both for human and for machines (ie automated tools and web applica(ons) Q: How are we going to represent meaning and knowledge on the web? A: via annota&on. Knowledge is represented in the form of rich conceptual schemas/formalisms called ontologies. Therefore, ontologies are the backbone of the Seman(c Web. Ontologies give formally defined meanings to the terms used in annota&ons, transforming them into seman&c annota&ons. 3

concepts that are hierarchically organized Ontologies are Tree of Porphyry, III AD Wordnet, XXI AD (see Lect 5, ex similarity measures) 4

Reasoning: RDF/OWL vs Databases (and other data structures) OWL axioms behave like inference rules rather than database constraints.! Class: Phoenix!!SubClassOf: ispetof only Wizard!! Individual: Fawkes! Types: Phoenix! Facts: ispetof Dumbledore! Fawkes is said to be a Phoenix and to be the pet of Dumbledore, and it is also stated that only a Wizard can have a pet Phoenix. In OWL, this leads to the implica(on that Dumbledore is a Wizard. That is, if we were to query the ontology for instances of Wizard, then Dumbledore would be part of the answer. In a database se[ng the schema could include a similar statement about the Phoenix class, but in this case it would be interpreted as a constraint on the data: adding the fact that Fawkes ispetof Dumbledore without Dumbledore being already known to be a Wizard would lead to an invalid database state, and such an update would therefore be rejected by a database management system as a constraint viola(on. 5

So, what is an ontology for us? An ontology is a FORMAl, EXPLICIT specifica&on of a SHARED conceptualiza&on Machine-readable Consensual Knowledge Concepts, properties relations, functions, constraints, axioms, are explicitly defined Abstract model and simplified view of some phenomenon in the world that we want to represent Studer, Benjamins, Fensel. Knowledge Engineering: Principles and Methods. Data and Knowledge Engineering. 25 (1998) 161-197 An ontology is an explicit specification of a conceptualization 6 Gruber, T. A translation Approach to portable ontology specifications. Knowledge Acquisition. Vol. 5. 1993. 199-220

How to build an ontology Generally speaking (and roughly said), when designing an ontology, four main components are used: 1. Classes 2. Rela(ons 3. Axioms 4. Instances 7

Prac(cal Ac(vity: emo(ons Your remarks: Emo(ons are ambiguous: eg. happiness can be also ill- directed The polarity of some emo(ons cannot be assessed etc. Classes Rela(ons Axioms Instances etc. 8

Occupa(onal psychology (wikipedia) Industrial and organiza(onal psychology (also known as I O psychology, occupa(onal psychology, work psychology, WO psychology, IWO psychology and business psychology) is the scien$fic study of human behavior in the workplace and applies psychological theories and principles to organiza(ons and individuals in their workplace. I- O psychologists are trained in the scien(st prac((oner model. I- O psychologists contribute to an organiza(on's success by improving the performance, mo(va(on, job sa(sfac(on, occupa(onal safety and health as well as the overall health and well- being of its employees. An I O psychologist conducts research on employee behaviors and a[tudes, and how these can be improved through hiring prac(ces, training programs, feedback, and management systems. 9

In summary Why to build an ontology? To share common understanding of the structure of informa(on among people or machines To make domain assump$ons explicit Ojen based on controlled vocabulary To analyze domain knowledge To enable reuse of domain knowledge 10

Ontologies and Tags Ontologies and tagging systems are two different ways to organize the knowledge present in Web. The first one has a formal fundamental that derives from descrip(ve logic and ar(ficial intelligence. Domain experts decide the terms. The other one is simpler and it integrates heterogeneous contents, and it is based on the collabora(on of users in the Web 2.0. User- generated annota(on. 11

Folksonomies Tagging facili(es within Web 2.0 applica(ons have shown how it might be possible for user communi$es to collabora$vely annotate web content, and create simple forms of ontology via the development of loosely- hierarchically organised sets of tags, onen called folksonomies. 12

Folksonomy=Social Tagging Folksonomies (also known as social tagging) are user- defined metadata collec(ons. Users do not deliberately create folksonomies and there is rarely a prescribed purpose, but a folksonomy evolves when many users create or store content at par(cular sites and iden(fy what they think the content is about. Tag clouds pinpoint the frequency of certain tags. 13

A common way to organize tags is in tag clouds 14

Automa(c folksonomy construc(on The collec(ve knowledge expressed though user- generated tags has a great poten(al. However, we need tools to efficiently aggregate data from large numbers of users with highly idiosyncra$c vocabularies and invented words or expressions. Many approaches to automa(c folksonomy construc(on combine tags using sta(s(cal methods... Ample space for improvement 15

Ontology, taxonomy, folksonomy, etc. Many different defini(ons A good summary and interpreta(on is here: hpp://www.ideaeng.com/taxonomies- ontologies- 0602 16

Today We will talk more generally about word clouds 17

Further Reading Seman&c Similarity from Natural Language and Ontology Analysis by Sébas(en Harispe, Sylvie Ranwez, Stefan Janaqi, and Jacky Montmain Synthesis Lectures on Human Language Technologies, May 2015, Vol. 8, No. 1 The two state- of- the- art approaches for es(ma(ng and quan(fying seman(c similari(es/relatedness of seman(c en((es are presented in detail: the first one relies on corpora analysis and is based on Natural Language Processing techniques and seman(c models while the second is based on more or less formal, computer- readable and workable forms of knowledge such as seman(c networks, thesauri or ontologies. 18

Previous lecture: the end 19

Acknowledgements This presenta(on is based on the following paper: Barth et al. (2014). Experimental Comparison of Seman(c Word Cloud. In Experimental Algorithms, Volume 8504 of the series Lecture Notes in Computer Science pp 247-258 Link: hpps://www.cs.arizona.edu/~kobourov/wordle2.pdf Some slides have been borrowed from Sergey Pupyrev. 20

Today Experiments on seman&cs- preserving word clouds, in which seman(cally related words are close to each other. 21

Outline What is a Word Cloud? 3 early algorithms 3 new algorithms Metrics & Quan(ta(ve Evalua(on 22

Word Clouds Word clouds have become a standard tool for abstrac(ng, visualizing and comparing texts We could apply the same or similar techniques to the huge amonts of tags produced by users interac(ng in the social networks 23

Comparison & conceptualiza(on Tool Word Clouds as a tool for conceptualizing documents. Cf Ontologies Ex: 2008, comparison of speeches: Obama vs McCain Cf. Lect 10: Extrac(ve summariza(on & Abstrac(ve summariza(on 24

Word Clouds and Tag Clouds are ojen used to represent importance among terms (ex, band popularity) or serve as a naviga(on tool (ex, Google search results). 25

The Problem How to compute seman(c- preserving word clouds in which seman(cally- related words are close to each other? 26

Wordle hpp://www.wordle.net Prac(cal tools, like Wordle, make word cloud visualiza(on easy. They offer an appealing way to SUMMARIZE text Shortoming: they do not capture the rela(onships between words in any way since word placement is independent of context 27

Many word clouds are arranged randomly (look also at the scapered colours) 28

Paperns and Vicinity/Adjacency Humans are spontaneously papern- seekers: if they see two words close to each other in a word cloud, they spontaneously think they are related 29

In Linguis(cs and NLP This natural tendency in linking spacial vicinity to seman&c relatedness is exploited as evidence that words are seman(cally related or seman(cally similar Remember? : You shall know a word by the company it keeps (Firth, J. R. 1957:11) 30

So, it makes sense to place such related words close to each other (look also at the color distribu(on) 31

Seman(c word clouds have higher user sa(sfac(on compared to other layouts 32

All recent word cloud visualiza(on tools aim to incoprorate seman(cs in the layout 33

but none of them provide any guarantee about the quality of the layout in terms of seman(cs 34

Early algorithms: Force- Directed Graph Most of the exis(ng algorithms are based on force- directed graph layout. Force- directed graph drawing algorithms are a class of algorithms for drawing graphs in an aesthe(cally pleasing way Aprac(ve forces between pairs to reduce empty space Repulsive forces ensure that words do not overlap Final force preserve seman(c rela(ons between words. Some of the most flexible algorithms for calcula(ng layouts of simple undirected graphs belong to a class known as force- directed algorithms. Such algorithms calculate the layout of a graph using only informa(on contained within the structure of the graph itself, rather than relying on domain- specific knowledge. Graphs drawn with these algorithms tend to be aesthe(cally pleasing, exhibit symmetries, and tend to produce crossing- free layouts for planar 35 graphs.

Newer Algorithms: rectangle representa(on of graphs Vertex- weighted and edge- weighed graph: The ver(ces of the graph are the words Their weight correspond to some measure of importance (eg. word frequencies) The edges capture the seman(c relatedness of pair of words (eg. co- occurrence) Their weight correspond to the strength of the rela(on Each vertex can be drawn as a box (rectangle) with a dimension determing by its weight A realized adjacency is the sum of the edge weights for all pairs of touching boxes. The goal is to maximize the realized adjacencies. 36

Purpose of the experiments that are shown here: Seman(cs preserva(on in terms of closeness/ vicinity/adjacency 37

Example A contact of 2 boxes is a common boundary. The contact of two boxes is interpredet as seman(c relatedness The contact of 2 boxes can be calculated, so the adjacency can be computed and evaluated. 38

Preprocessing: 1) Term Extrac(on 2) Ranking 3) Similarity/Dissimilarity Computa(on 39

Similarity/dissimilarity matrix 40

cos( v, w) = Lect 6: Repe((on v w v w = v w w v = Which pair of words is more similar? cosine(apricot,informa(on) = N i=1 N 2 v i=1 i v i w i N 2 w i i=1 large data computer apricot 1 0 0 digital 0 1 2 informa(on 1 6 1 1+ 0 + 0 = 1 1+ 0 + 0 1+ 36 +1 38 =.16 cosine(digital,informa(on) = 0 +1+ 4 0 + 6 + 2 1+ 36 +1 = 8 38 5 =.58 cosine(apricot,digital) = 1+ 0 + 0 0 + 0 + 0 0 +1+ 4 = 0 41

Lect 06: Other possible similarity measures 42

Input - Output The input for all algorithms is a collec(on of n rectangles, each with a fixed width and height propor(onal to the rank of the word A similarity/dissimilarity matrix The output is a set of non- overlapping posi(ons for the rectangles. 43

Early Algorithms 1. Wordle (Random) 2. Context- Preserving Word Cloud Visualiza(on (CPWCV) 3. Seam Carving 44

Wordle à Random The Wordle algorithm places one word at a (me in a greedy fashion, ie aiming to use space as efficiently as possible. First the words are sorted by weight/rank in decreasing order. Then for each word in the order, a posi(on is picked at random. 45

1: Random 46

2: Random 47

3: Random 48

4: Random 49

5: Random 50

6: Random 51

Context- Preserving Word Cloud Visualiza(on (CPWCV) First, a dissimilarity matrix is computed and Mul(dimensional Scaling (MDS) is performed Mul(dimensional Scaling (MDS) aims at detec(ng meaningful underlying dimensions in the data. Second, effort to create a compact layout 52

1: Context- Preserving 53

2: Context- Preserving : repulsive force 54

3: Context- Preserving : aprac(ve force 55

Seam Carving Basically, an algorithm for image resizing It was invented at Mitsubishi s 56

1: Seam Carving 57

2: Seam Carving : space is divided into regions 58

3: Seam Carving : empty paths trimmed out itera(vely 59

4: Seam Carving 60

5: Seam Carving 61

6: Seam Carving: space divided into regions 62

7: Seam Carving 63

3 New Algorithms 1. Inflate and Push 2. Star Forest 3. Cycle Cover 64

Inflate- and- Push Simple heuris(c method for word layout, which aims to preserve seman(c rela(ons between pair of words. Based on 1. Heuris(cs: scaling down all word rectangles by some constant; 2. Compu(ng MDS (mul(dimensional scaling) on the dissimilarity matrix 3. Iteretavely increase the size of rectangles by 5% (ie inflate words; 4. When words overlaps, apply a force- directed algorithm to push words away. 65

Inflate: star(ng point 66

Inflate : scaling down 67

Inflate : seman(cally- related words are placed close to each other. Apply inflate words (5%) itera(vely. 68

Inflate: push words : repulsive force to resolve overlaps 69

Inflate: final stage 70

Star Forest A star is a tree A star forest is a forest whose connected components are all stars. 71

Repe((on: trees and graphs A tree is special form of graph i.e. minimally connected graph and having only one path between any two ver(ces. In a graph there can be more than one path i.e. graph can have uni- direc(onal or bi- direc(onal paths (edges) between nodes. 72

Three steps 1. Extrac(ng the star forest: par&&on a graph into disjoint stars 2. Realising a star: build a word cloud for every star 3. Pack all the stars together 73

Star Forest : star = tree 1. Extract stars greedily from a dissimilarity matrix à disjoint stars = star forest 2. Compute the op(mal stars, ie the best set of words to be adjacent 3. Aprac(ve force to get a compact layout 74

Cycle Cover This algorithm is based on a similarity matrix. First, a similarity path is created Then, the op(mal level of compact- ness is computed 75

Quan(ta(ve Metrics 1. Realized Adjacenies how close are similar words to each other? 2. Distor(on how distant are dissimilar words? 3. Uniform Area U(liza(on uniformity of the distribu(on (overpopulated vs sparse areas in the word cloud) 4. Comptactness how well u(lized is the drawing area? 5. Aspect Ra(o width and height of the bounding box 6. Running Time execu(on (me 76

2 datasets (1) WIKI, a set of 112 plain- text ar(cles extracted from the English Wikipedia, each consis(ng of at least 200 dis(nct words (2) PAPERS, a set of 56 research papers published in conferences on experimental algorithms (SEA and ALENEX) in 2011-2012. 77

Cycle Cover wins 78

Seam Carving wins 79

Random wins 80

Inflate wins 81

Random and Seam Carving win 82

All ok except Seam Carving 83

Demo 84

The end 85