Automatically Generated Tag Clouds



Similar documents
Institute for Information Systems and Computer Media. Graz University of Technology. Phone: (+43) Graz University of Technology

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search

Exploiting Tag Clouds for Database Browsing and Querying

Term extraction for user profiling: evaluation by the user

Search and Information Retrieval

Semantic Search in Portals using Ontologies

ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

An Ontology-based e-learning System for Network Security

Facilitating Business Process Discovery using Analysis

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Universal. Event. Product. Computer. 1 warehouse.

Visualizing WordNet Structure

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Clustering Technique in Data Mining for Text Documents

Web Data Extraction: 1 o Semestre 2007/2008

Using Use Cases for requirements capture. Pete McBreen McBreen.Consulting

Interactive Dynamic Information Extraction

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Blog Post Extraction Using Title Finding

Big Data: Rethinking Text Visualization

Customer Intentions Analysis of Twitter Based on Semantic Patterns

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Writing Learning Objectives

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Context Capture in Software Development

Semantically Enhanced Web Personalization Approaches and Techniques

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Visualization methods for patent data

EXPLOITING FOLKSONOMIES AND ONTOLOGIES IN AN E-BUSINESS APPLICATION

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

SPATIAL DATA CLASSIFICATION AND DATA MINING

2 AIMS: an Agent-based Intelligent Tool for Informational Support

Writing learning objectives

SEMANTIC VIDEO ANNOTATION IN E-LEARNING FRAMEWORK

Building a Question Classifier for a TREC-Style Question Answering System

Symbol Tables. Introduction

Data, Measurements, Features

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

The Scientific Data Mining Process

TOOL OF THE INTELLIGENCE ECONOMIC: RECOGNITION FUNCTION OF REVIEWS CRITICS. Extraction and linguistic analysis of sentiments

Mining Text Data: An Introduction

Lecture Overview. Web 2.0, Tagging, Multimedia, Folksonomies, Lecture, Important, Must Attend, Web 2.0 Definition. Web 2.

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Dynamic Data in terms of Data Mining Streams

Web Database Integration

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Database Marketing, Business Intelligence and Knowledge Discovery

A Semantic web approach for e-learning platforms

Query Recommendation employing Query Logs in Search Optimization

To download the script for the listening go to:

virtual class local mappings semantically equivalent local classes ... Schema Integration

KNOWLEDGE-BASED IN MEDICAL DECISION SUPPORT SYSTEM BASED ON SUBJECTIVE INTELLIGENCE

Towards Effective Recommendation of Social Data across Social Networking Sites

RRSS - Rating Reviews Support System purpose built for movies recommendation

16.1 MAPREDUCE. For personal use only, not for distribution. 333

1 o Semestre 2007/2008

Technical Report. The KNIME Text Processing Feature:

Practical Semantic Web Tagging and Tag Clouds 1

Automatic Timeline Construction For Computer Forensics Purposes

Appendix B Data Quality Dimensions

Web 3.0 image search: a World First

Data Mining and Database Systems: Where is the Intersection?

Sentiment analysis on tweets in a financial domain

COURSE RECOMMENDER SYSTEM IN E-LEARNING

LinksTo A Web2.0 System that Utilises Linked Data Principles to Link Related Resources Together

Electronic Document Management Using Inverted Files System

De la Business Intelligence aux Big Data. Marie- Aude AUFAURE Head of the Business Intelligence team Ecole Centrale Paris. 22/01/14 Séminaire Big Data

Basics of Dimensional Modeling

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

DESIGNING AND MINING WEB APPLICATIONS: A CONCEPTUAL MODELING APPROACH

A Workbench for Prototyping XML Data Exchange (extended abstract)

Text Mining - Scope and Applications

Search Engine Based Intelligent Help Desk System: iassist

int.ere.st: Building a Tag Sharing Service with the SCOT Ontology

Introduction. A. Bellaachia Page: 1

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

Lightweight Data Integration using the WebComposition Data Grid Service

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

CHAPTER 1 INTRODUCTION

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Warehousing and Data Mining in Business Applications

An Efficient Database Design for IndoWordNet Development Using Hybrid Approach

Interactive Graphic Design Using Automatic Presentation Knowledge

Information, Entropy, and Coding

Comparison of Tag Cloud Layouts: Task-Related Performance and Visual Exploration

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Text Mining: The state of the art and the challenges

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

Card-Sorting: What You Need to Know about Analyzing and Interpreting Card Sorting Results

M3039 MPEG 97/ January 1998

Domain Classification of Technical Terms Using the Web

Analysing the Behaviour of Students in Learning Management Systems with Respect to Learning Styles

Research of Postal Data mining system based on big data

An eclipse-based Feature Models toolchain

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Transcription:

Automatically Generated Tag Clouds Geraldo Xexéo 1,2, Fernando Morgado 1, Patrícia Fiuza 1 1 Programa de Engenharia de Sistemas e Computação, COPPE/UFRJ 2 Departamento de Ciência da Computação, IM/UFRJ {xexeo,fernandofm,patfiuza}@cos.ufrj.br Abstract. This paper proposes a formal model for discussing the construction of tag clouds, applicable both to human beings and computer programs. From this model we also to propose a methodology for automatic generation of tag clouds, which aim to achieve results similar to the ones produced by human beings. We also present a specific implementation and describe some of our initial results. 1. Introduction With the advancement of social software sites, like Delicious[1], Flickr[2], and the increasing incentive to use tags to describe various types of information objects, it became common to present these tags in a format known as tag cloud. According to Hassan and Herrero Solana[3], a tag cloud is a visual list of tags arranged in a way to transmit information and meaning through the use of different font sizes, styles and colors, based on the their importance within the group in which it appears. The more popular a tag is, the larger is the font and therefore it is more prominent in the tag cloud in which it occurs. Therefore, tag clouds provide a summary or a semantic view of the concepts most important to represent an object [4]. Human beings build this semantic view by associating concepts that are understood from an observed object and the words that represent them, under their specific points of view. If these objects are documents, those words can be found in their contents or inferred from the understanding of these contents. In general, people can easily create words that represent concepts, since they have a reasonable knowledge of the language and of the world. They can also describe concepts as sentences, even if partial. Automatic systems, however, do not own this knowledge; therefore, they must infer knowledge exclusively from the words that compose the document or some other information, such as metadata or the tagging applied to similar documents. We should also point out that the relation between words and concepts is not bijective. A single word can represent more than one concept, what is known as polysemy. A crane, for example, can represent a bird or construction equipment. On the other hand, a concept can be represented by many words (or sequence of words), such as film, movie, picture, motion picture and flick, what is known as synonymy. Also, there could be no word that singularly represents a concept, for instance, blood bank or gene bank. Finally, there are also words with no associated meaning, such as prepositions, but that are very frequent textual documents. One could possible also argue for the existence of concepts that cannot be represented by words, 136

but this is not the main stream reasoning. However, it is common sense that there are concepts that need a complex text structure to describe. Facing these limitations, we understand that an automatic solution, such as the one we propose, can only be effective when developed as an approximation of the human behavior. To enable this approximation, we judge necessary to create a model that allows the description of the semantic view of a document, both when analyzed under the human point of view and when analyzed under the point of view of a computer program. Therefore, this work aims to present a formal model for building tag clouds, applicable both to human beings and computer programs, which is presented in section 3. From this model we also to propose a methodology for automatic generation of tag clouds, in section 4, which aim to achieve results similar to the ones produced by human beings, what is defended in section 5. In the next section we give a brief introduction to tag clouds and their automatic generation. 2. Tag Clouds In this section we provide some informal definitions for tag clouds and some related concepts, aiming to provide a common understanding that will allow us to later build a formal framework. We also discuss some questions relating the difference between human generated and computer generated tag clouds. We adopt the definition proposed by Rivadeneira et al.[5]: tag clouds are visual presentations of a set of words, typically a set of tags selected by some rationale, in which attributes of the text such as size, weight, or color are used to represent features of the associated terms. Moreover, a tag cloud usually has another component: a reference, such as a title or a caption that indicates to which object that tag cloud relates. There is no need for the object to be concrete or accessible through an URL, such as a document file. For instance, a tag cloud can refer to an event that occurred or will occur, as a rock show. A tag can be defined as any label or symbol attached to an object, such as a document, an image, music, etc. Usually, these labels are short and often consist of a single word[6]. Each tag usually denotes a concept related to the object to which it is bound. Concepts can include ideas of origin, purpose, description, and others. Marinchev [7] identified the abstract set of concepts related to an object, and represented in the tag cloud, to a semantic field. A semantic field is the set of concepts connected to a focus, but in a form that is now independent of the originating taggers, and available to other people for understanding. [7]. As a result, a tag cloud is a visual representation of the semantic field of an object. The entire process of creating a tag cloud can be summarized in three steps[7]. 1. Understanding an object/focus and the concepts that can be applied it. 2. Capture of the semantic field around the object focus. 3. Transforming the semantic field to a tag cloud. A fourth step of this process of creation is the actual use of the tag cloud, which is recreated as the final user interpret and attempts to understand it as an actual object or possible objects inside a context. 137

Two main aspects of presentation are considered during the construction of tag clouds: the properties of the component font and disposal of tags. The font size is usually used to represent the importance of tags, while a common use of color is to highlight possible categories in the set of tags. Tags can be arranged in alphabetical order or based on frequency. They may also have a random position. Terms that have the same semantic classification can be placed close to each other[4]. Although first generation tag clouds were 2 dimensional, nowadays one can find 3 dimensional cloud tags. 2.1. Computers and Tag Clouds In our work, we automatically generate tag clouds from text documents, aiming to approximate a human generated tag cloud. To accomplish that we decided to mimic the process described by Marinchev [7] (explained above). However, the process first step is a challenge, since it says that to create a tag cloud one must first create concepts in human minds. These concepts are abstract thoughts that not always can be perfectly described in words. For example, reading the end of Shakespeare s Romeo and Juliet can induce an overall sentiment of sadness that is only approximately described by tags such as sad or unhappy. Therefore, to adopt that process we must first decide on how to represent concepts in a computer. It should be clear that this representation is not the same representation provided by tags. Tags are symbols, usually words, that humans can understand and assign some meaning. Concepts, in the cognitive sense, are abstract thoughts, while in the computational sense they must be modeled as some data structure, or even a procedure or rule. One can select, for example, Wordnet synsets [8] to represent concepts. It is not unreasonable, however, to select words to represent them, even the same words used as tags. There is some previous work on automatic tag cloud generation. PubCloud[9] is a tool, based on the use of tag clouds, to summarize results returned by a search, and to allow the navigation from tag cloud to the results. CloudMine[6] is a tool that categorizes, summarizes and displays the most important terms of a document as text clouds. Some articles, such as [10] and [11] predict tags by learning from previously existing tagged documents. None of them build a formal model to discuss tag cloud generation, which is the main contribution of this article. 3. A Formal Model to Discuss Tag Clouds In this section we create from scratch a formal and conceptual definition of tag clouds that allow us to derive an abstract method for building them which is similar to the one briefly described by Marinchev[7]. We start by defining resources and contexts. Our motivation for these definitions is that we are interested in creating tag clouds that describe resources in a context. Resources are any abstract concept or physical entity that can be uniquely identified in the web or outside it[12]. The definition of resource is left open, to follow the approach used in the RFC line of documents. In our case, a resource is any identifiable object can be described, at least partially, by a set of tags. These tags act as representations of concepts that reside in human mind or in computer data structures and can be applied to the resource according to some rationale. In the Web, resources 138

are identified by URIs. There are many ways to represent resource properties (such as metadata), however RDF[13] representation are standard and stable. A resource is represented by a letter r, possibly indexed. A context, denoted by the word w, is a set of resources that can be analyzed as a whole. The context containing all resources is be denoted by W, for the web. Therefore, w is not an element of W, but a subset of it. Contexts can be abstract, as when defined by a single word such as Medicine, or very objective, such as the answers to the query soccer game by a specific search engine. Contexts can include documents, real life objects and events, i.e., anything that can be described as a resource. There is no restriction that all resources of a context are of the same type. Figure 1. UML Model for resources and contexts 3.1. Preliminary definitions: attribute pair set In this subsection we define some preliminary concepts that will lead us to the definition of an attribute pair set, which is an abstraction created to allow us to define a dynamic set of attributes and its values to an object. An object is a primitive concept, therefore, not defined in the theory. As in object oriented theory, objects are the root set to which all our other defined concepts belong. Both atomic elements and sets belong to the set of objects. The set of all objects is the universal set, denoted U. A domain, or a value set, denoted by V i, is a set of values. We use domains as they are used in most part of database theory: to define a set of admissible values for an attribute. They are indexed, as in V i, to represent the fact that we are using multiple domains. Values in a domain can be indicated by a double indexed letter v, such as in v ij, to show that the value v ij belongs to domain V i. We make no previous requirement on a domain, such as being finite or composed of atomic values. The set of all domains is represented by the letter V (non indexed). An attribute a of a object o is a property that describes o. The set of all possible attributes is the denoted by A. Although abstract, attributes are usually represented (named) by strings. One should expect these strings to be words of sequence of words with some clear meaning. Humans usually can easily associate a domain to an attribute, e.g., meters to evaluate distance or integers to evaluate age. Computer programs, on the other hand, need that this association to be made explicit by some declaration, e.g., in a type declaration such as found in different programming languages. A domain attribution function is a function that associates a domain with an attribute: f da : A V. When defined, a domain attribution function represents the types of values that can be assigned to an attribute. 139

From now on, we suppose that our sets A, V and the function f da are defined. One simple example of possible values is: A={color,size} V={Colors,SmallIntegers} Colors={red,green,blue,yellow,black,white} SmallIntegers= {1..256} f da ={(color,colors),(size,smallintegers)} An attribute pair is an ordered pair (a i,v ij ): (a i,v ij ) A V i where f da (a i )=V i. Attribute pairs describe the value of an attribute a i, in some particular context. We create attribute pairs to allow for dynamic selection of attributes that can be applied to an object. In this way, later on, we will not be obliged to previously define which attributes can be used to describe an object, i.e., its class, as in traditional objectoriented theory. An attribute pair set, or a type restricted map or simply a map, is a set of attribute pairs where every first element of an ordered pair is unique among all members of the map. Maps will be used further on to represent the set of attributes that can be used to describe an object. A map will be denoted by m, and formally defined as: m = { (a i,v ij ) ((a i,v ij ) A V i ) (f da (a i )=V i )) ( if (a i,v ij ) m (a k,v kn ) m) then a i =a k v ij =v kn )} The set of all possible maps will be denoted by M. Figure 2. UML Model representing the basic framework. 3.2. Classification and attribution functions In this section we use the definitions in the basic framework to apply the concept of attributes to objects. A classification function is a function f c, that, given an object o generates a set of attributes A i, A i A, that can be used to represent the attributes of the object. f c : O (A) f c (o) = A i = { a ij a ij is an attribute of r } A map attribution function, is a function f ma that, given an object, and a set of attributes A i, A i A, generates a map where to each attribute of A i correspond an attribute pair. f ma : O (A) M f ma (o,a i ) = { (a ij,v ij ) a ij A i ((a ij,v ij ) f ma (o,a i ) (f ad (a ij )=V i )) } 140

3.3. Applying attributes to resources In this section we use the definitions in the basic framework to apply the concept of attributes to resources. A resource classification function is a classication function f rc for which the set of objects is restricted to the set of resources. f rc : W (A) f rc (r) = A i = { a ij a ij is an attribute of r } A map attribution function for a resource r, is a map attribution function f mar for which the set of objects is restricted to the set of resources. The resulting map usually represents properties of the resource. f mar : W (A) M f mar (r,a i ) = { (a ij,v ij ) a ij A i ((a ij,v ij ) f mar (r,a i ) (f ad (a ij )=V i )) } A resource classification function is a function that, given a resource, establisheswhich attributes can be evaluated for it. A map attribution function is an evaluation function that returns the values for a set of properties of a resource. It also can be understood as the application of a set of evaluation functions, each one returning the value of a specific property of an object. From the above definitions, we have now the vocabulary to discuss how, given an resource, we can dynamically can generate a set of attributes and their values. The concepts described as sets and functions can be seen in Figure 1 as a UML model. A resource representation or a resource map for a resource r, RM(r), is a map: RM(r) = f ma (r,f rc (r)) Resource maps act as representations of resources. For example, the resource map of a document can be formed by tuples representing its bag of words. Figure 3. An UML model representing the use of maps to describe resource (as a ResourceMap). 3.4. Concepts and Semantic Fields In this section we slowly build the concept of an abstract tag cloud that could be applied to any object. We start by formalizing Marinchev s [7] concept of semantic field. For that, we must start supposing the not only the existence of a set of resources, but also of a set of concepts, denoted C. Concepts can be extremely abstract, for example, in the case that we are talking about concepts formed in the human mind, or much more concrete, as in the case of representation of concepts in computer data structures. 141

The formal definition of concept is not an easy task, and has led to many discussions on philosophy. We use the conservative approach of adopting the Classical Theory of Concepts, that is: most concepts are structured mental representations that encode a set of necessary and sufficient conditions for their application, if possible, in sensory and perceptual terms [14]. However, we aim to explain concepts both as human and as computer based phenomena. Therefore, we will accept that concepts do not need to be mental representations, but only adequate cognitive representations. Given a resource r, from a context w, and a set of abstract concepts C, a semantic field for r is a set SF(r) of concepts SF(r) = {c i c i C applies(c i, r) } where applies is a logical predicate that represents the fact that a concept can be used to describe, in some way, an object or an object property. Therefore, a semantic field is a set of abstract concepts that, somehow, can be applied to an object aiming to build some understanding of it. At some time we will be interested in describing the semantic field of an object under a specific context, and we will use an subscript to represent it, as in SF w (r). A concept classification function is a classification function f cc, that, given a resource r and a concept c generate a set of attributes A i, A i A, that can be used to represent the attributes of the concept c when referring to the resource r. f cc : W C (A) f cc (r,c) = A i = { a ij a ij is an attribute of c when referring to r } A map attribution function for a concept c, is a map attribution function f mac that, given a concept c a resource r, and a set of attributes A i, A i A, generates a map where to each attribute of A i correspond an attribute pair describing. f mac : W C (A) M f mac (r,c,a i ) = { (a ij,v ij ) a ij A i ((a ij,v ij ) f mac (r,a i ) (f da (a ij )=V i )) } A valued semantic field for a resource r is a set of ordered pairs where the first element is a concept applicable to the resource, and the second element is an attribute pair set composed of the attributes induced by c i in r. VSF(r) = { (c i,m i ) c i SF(r) m i = f mac (r,c i,f cc (r,c i))} Although Marinchev s article[13] only discuss semantic fields, i.e., the mapping of concepts to resources, we believe that this mapping cannot be assumed to be free of subtleties and additional information. Moreover, computers do not really deal with concepts, but actually to some representation that can be mapped to a concept. These representations receive a great benefit from being able to carry additional information with them. For example, given that we choose to represent concepts by Wordnet synsets, a SF for an document d can be the set: S = {person,individual,someone,somebody,mortal,soul}. However, it is interesting to know with words were used to obtain the synset. To that, we can have the label original words defining on attribute pair in our map for synset S, and the set {person, individual} describing which words were found in the document d that generated the synset. 142

A semantic field generator is a function f sfg that given a context w, a specific resource resource r, r w, and a set of concepts, generates a semantic field SF(r) which indicates a set of concepts than can be considered, under some reasoning, to be applicable to r in context w. f sfg : W (W) C f sfg (r,w) = {c i c i C applies w (c i, r) } = SF w (r) A valued semantic field generator is a function f vsfg that given a context w, a specific resource resource r, r w, and a set of concepts, generates a valued semantic field VSF(r) which indicates a set of concepts than can be considered, under some reasoning, to be applicable to r and its corresponding maps. f sfg : W (W) C M f vsfg (r,w) = { (c i,m i ) c i SF w (r) m i = f mac (r,c i,f cc (r,c i))} We would like to point that although it is possible to generate a semantic field from a single resource, is much more reasonable to consider that to generate this semantic field one needs to analyze not only the resource, but also the context where it is inserted. This context is characterized by all other resources that can be view somehow in the same scope as the analyzed resource. Figure 4. Modelling Semantic Fields in UML. 3.5. From Tags to Abstract Tag Clouds A tag classification function is a function f tc, that, given a resource r and a tag t generate a set of attributes A i, A i A, that can be used to represent the attributes of the tag t when referring to the resource r. f tc : W C (A) f tc (r,c) = A i = { a ij a ij is an attribute of c when referring to r } A map attribution function for a tag t, is a function f mat that, given a tag t, a resource r, and a set of attributes A i, A i A, generates a map where to each attribute of A i correspond an attribute pair describing. f mat : W C (A) M f mat (r,c,a i ) = { (a ij,v ij ) a ij A i ((a ij,v ij ) f mat (r,a i ) (f da (a ij )=V i )) } Given a resource r, from a context w, and a set of tags T, a tag field for r is a set of tags TF w (r) = {t j t j T, c SF(r) represents(t j,c)}, 143

where each tag t j is a symbol, usually a word or a short sequence of words, that represents one or more concepts that can be applied to r. Again, we leave the definition of the predicate represents open to interpretation. However, we point that there is no requirement that it is total over SF(r). A tag field assumes the role of a concrete representation of a semantic field. Also, there is no difference from a tag field created by humans or computers. Both of them are sets of concrete symbols. One semantic field can induce different tag fields according to the symbols (words) available and the represents function chosen. A tag field generator is a function f tfg that given a context w, a specific resource resource r, r w, generates a tag field TF(r) which represents the set of tags than can be considered, under some reasoning, to be applicable to r in context w. F tfg : W (W) T f tfg (r,w) = {t i c j SF w (r), represents w (t i, c j ) } = TF w (r) Given a tag field TF(r), an abstract tag cloud ATC(r) is a set of tuples ATC(r)= { (t i,m i ) }, where t i is a tag belonging to TF(r), and m i is a map that represents the attributes of the tag. A abstract tag cloud generator is a function f atcg that given a context w, a specific resource resource r, r w, generates an abstract tag cloud ATG(r,w) which indicates a set of tags than can be considered, under some reasoning, to represent the concepts applicable to r and their corresponding maps. F atcg : W (W) C M F atcg (r,w) = { (t i,m i ) t i TF w (r) m i = f mat (r,c i,f tc (r,c i))} For example, given that the text Romeo and Juliet is an object the concept of forbidden love (that we cannot avoid to represent as words) can be represented as two tags forbidden and love, which can appear in the abstract tag cloud as: { (forbidden,{(color,black),(size,12),(x,1),(y,10)}), (love,{(color,red),(size,16),(x,20),(y,20)})}. One should notice that not all attributes must be representative of visual characteristics. It is possible to have hidden attributes, such as the tf idf [15]value of a word in a text, which will be used in some representation or algorithm. Figure 5. Modeling Tags, and Abstract Tag Clouds in UML. 144

3.6. Tag Clouds are Visual Representations We are now able to define tag clouds as a suitable visual representation of an abstract tag cloud. To illustrate the idea, we present a tag cloud (Figure 6) generated from a document about classification of a text document using wavelets, member of a set of documents about web intelligence. Due to printing limitations, we use only size as visual attribute of a tag, to indicate its importance. Larger font sizes indicate the greater importance of a tag. In contrast, smaller font sizes are used for less important tags. The arrangement, i.e., x positions, will follow the frequency of the tags in the document. wavelets classification term signal representation document text domain compression transformation recall precision original signal reduction compression daubechies haar 3.7. Creating Abstract Tag Clouds Figure 6. A simple tag cloud From this sequence of definitions, one can derive the process of generating tag clouds for a document as: 1. Select a document and its context. 2. Build a valued semantic field for the document 3. Use the valued semantic field to define an abstract tag cloud for the document. 4. Generate a suitable visual representation for the abstract tag cloud We make no assumptions on how these procedures will be developed. Many questions are left open and should be defined only in a particular implementation. For example, we make no decision on how to represent concepts. 4. Generating Tag Clouds for a Document In this section we will discuss the steps involved in the process of creating tag clouds. These steps follow the theoretical model proposed in the previous section. The tag creation process starts by applying techniques of text pre processing to a document (the resource) as described in [16]. We continue selecting the nouns from the clean text. We focus on nouns because it is easier to understand them as complete concepts and they are the most common type of tag used by humans. Following our model, the first step is to create a resource representation, RM(r). We do that by extracting the following information from the document: 1. List of terms (stems), and their tf idf. 2. List of bigrams of stems, and their tf idf. 3. A map from terms and bigrams to sets of words, describing which words were reduced to each term or bigram. 145

The process of creating that information represents the implementation of f mar. Tf idf is a traditional measure of term relevance used in informational retrieval, defined as [15]: where tfij N w ij = log( ) max ( tf ) n w ij : is the weight of term j in document i tf ij : is the frequence of term j in document i N: is the number of documents in a collection n j : is the number os documents with term j k The next step is calculating the semantic field, SF(r). In our implementation we actually use RM(r) as input, considering that f sfg (r)=g sfg (f mar (r)), where g sfg is an auxiliary function. Our current strategy is to rank all terms and bigrams by tf idf and select the top 40. The weighting provided by the tf idf measure allows us to select terms representing the documents in a collection in a simple and efficient way. However, we are currently experimenting with other measures, such as information gain[17]. Currently, our semantic field is a list of terms and term bigrams. To further improve our results, we extend our semantic field using Wordnet [8]. Wordnet is lexical database of English where nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual semantic and lexical relations [8]. These allow us to simulate some human behavior when creating tag clouds, such as looking for synonyms and hyperonyms, enriching the set of terms. From the semantic field we generate additional information that is interesting for tag cloud calculation and create a valued semantic field VSF(r). This additional information includes the (original) words present in document r that correspond to terms selected for the semantic field. ik j Figure 7. Activity Diagram representing the steps for tag cloud creation. We use VSF(r) to calculate the tag field, TF(r). Our approach is to look for the lemmas of the words that generated the terms chosen for the semantic field and arbitrarily select one or more of it, according to frequency considerations. A lemma is the canonical form for a set of words (lexeme) and for nouns is the nominative singular, usually masculine. Finally, to generate the abstract tag cloud we use again the tf idf as first (hidden) attribute and calculate font size and position according to its value. 146

5. Example of Use and Evaluation For an initial analysis of the proposal, we subjected the tool described a set of 20 articles dealing with the subject "Web intelligence". These articles were selected at random from the Web Intelligence 2008 conference proceedings[18]. Of all individual results obtained, we selected three of them as to exemplify the ideas described here. In our examples, tags generated directly from the document are represented with a gray background. Tags in white over a black background are originated from Wordnet. Figure 8 shows the tag cloud of an article that presents a proposal for mining and exploitation of feedback reported by customers in unstructured texts. To enable this, the tool uses a tree structure and display in the form of clusters. Clusters are used in different colors trying to represent the feelings of customers. In this article we avoid use of colors due to printing limitations. It is easy to see that the fundamental concepts presented in the article are highlighted in terms in the tag cloud. Special attention to the presence of the words feedback, keyphrases cluster e unstructured information. Figure 8. Tag cloud for first result Figure 9 shows the tag cloud generated for an article that deals about trying discover the social context that a person is involved, through the obtainment of information in social networks websites, like Orkut and Facebook. And after discover this social context, use it to improve information retrieval algorithms used by this person. We observe in the tag cloud, tags that summarize this idea, for example, social context, comment and people preferences. Figure 9. Tag cloud for second result Figure 10, presents a tag cloud generated for an article that pretends classify the documents into reader emotion categories. And integrate it in a web search engine. Like before we observe in tag cloud the presence of tags that transmit us this idea, such as, emotion, emotion classification and reader. 147

Figure 10. Tag cloud for third result All tag clouds generated for this collection presented such characteristics, i.e., their tags provided particular information that can be used to highlight the particular features of a text in relation to a collection of documents. 5.1. Subjective Evaluation To evaluate the tag clouds that we generated, we decided to execute an evaluation based on the idea proposed by [19]. For this evaluation we chose compare our results against the tag clouds generated by a well known visualization web site, that generates different kinds of visualization, among them tag clouds. We will name this site as Site M. In this evaluation, we were able to set up a small evaluation session for the three examples in this paper. This session was realized with seven experts on the topic. All of them are master or doctor students at the UFRJ and have a satisfactory English knowledge. Besides, all evaluators had knowledge about what was approached in the articles that we use to generate the tag clouds. The evaluators concluded that the first example to our tag cloud was better (5) or equal (2) than obtained by the Site M. For the third example the evaluation was similar, our tag cloud was considered better by 2, and equal by others 5 evaluators. However, for the second example all evaluators agreed that both tag clouds failed. Tables below reproduce the evaluation results. Each cell shows how many evaluators found the tag cloud better, equal or worse than the other. Table 1: Evaluation for the first example Our tag cloud Tag cloud generated by M Better 5 0 Equal 2 Worse 0 5 Table 2: Evaluation for the second example Our tag cloud Tag cloud generated by M Better 0 0 Equal 0 Worse 7 7 Table 3: Evaluation for the third example Our tag cloud Tag cloud generated by M Better 2 0 Equal 5 Worse 0 2 148

Based on this evaluation and the opinion requested for the evaluators, we believe that our tag clouds are equal or better than the tag clouds of Site M with respect to information, but worst in the aspect of visual representation. Therefore, based on this evaluation, we believe that our generated tag clouds, satisfactorily meet the requirement of representing by their tags, the main concepts and features available in their respective articles. An important advantage that we noticed in our model of generation of tag clouds, was the preference for the use of nouns to represent concepts. This allowed create representative tag clouds with a smaller number of tags. Besides, also we will create a measure to evaluate the tag clouds generated, in relation of their quality. We intend use some parameters to obtain this measure such as: quantity of tags in the "tag cloud", percentage of n grams, among others. 6. Conclusion We presented a formal model to describe tag clouds. A methodology for their construction was also presented and some preliminary results were shown. These results demonstrate the real applicability of the proposal. Due to printing limitations, we decided to present our tag clouds as gray scale representations. Our tool support colors, 2 D distribution of tags and clusterization. As future work, we intend through the application of algorithms for the cluster of terms, generate tag clouds with the ability to highlight characteristics of groups of documents within the collection. Since this is mainly a proposal to substitute a human activity, it is not yet clear on how to evaluate it and what type of functionality we should aggregate to have a good user experience and a effective evaluation of our proposal. For example, we can create hypertext links between tags and tag clouds, and also between tags and documents. These will greatly enhance the usability of our tools, however they also make difficult to assess the individual influence of our proposal for tag cloud generation. We also plan to inquire further into the formal model, and make it compatible with an ontology of information developed by our group. 7. References [1] http://delicious.com/ [2] http://www.flickr.com/ [3] Hassan Montero Y. and Herrero Solana V. Improving Tag Clouds as Visual Information Retrieval Interfaces, in Proc. of the 1st International Conference on Multidisciplinary Information Sciences and Technologies InSCiT.2006. [4] Lamantia J. Tag Clouds: Navigation for Landscapes of Meaning. Joe Lamantia Blog. <http://www.joelamantia.com/blog/archives/ideas/tag_clouds_navigation_for_lan dscapes_of_meaning.html>. 2006. [5] Rivadeneira A. W., Gruen D. M., Muller M. J., Millen D.R.Getting our head in the clouds: toward evaluation studies of tagclouds. In Proceedings of the SIGCHI 149

Conference on Human Factors in Computing Systems. CHI '07. ACM, New York, NY, 995 998. 2007. [6] Watters D., Meaningful Clouds: Towards a novel interface for document visualization. http://danwatters.com/documents/cloudmine_dwatters.pdf (visited 16/5/2009) [7] Marinchev I., Practical Semantic Web Tagging and Tag Clouds, Journal Cybernetics and Information Technologies, v. 6, n. 3 (2006), pp. 33 39. 2006. [8] Fellbaum, Christiane. WordNet: An Electornic Lexical Database. Bradford Books. 1998. [9] Kuo B.Y., Hentrich T., Good B. M. and Wilkinson M.D. Tag clouds for summarizing web search results. In Proceedings of the 16th international Conference on World Wide Web WWW '07. ACM, New York, NY, 1203 1204. 2007. [10] Song, Y., Zhuang, Z., Li, H., Zhao, Q., Li, J., Lee, W., and Giles, C. L. 2008. Real time automatic tag recommendation. In Proceedings of the 31st Annual international ACM SIGIR Conference on Research and Development in information Retrieval. SIGIR '08. ACM, New York, NY, 515 522. [11] Heymann, P., Ramage, D., and Garcia Molina, H. 2008. Social tag prediction. In Proceedings of the 31st Annual international ACM SIGIR Conference on Research and Development in information Retrieval. SIGIR '08. ACM, New York, NY, 531 538 [12] Berners Lee, T., Fielding, R. Masinter, L. Uniform Resource Identifier: Generic Syntax, RFC 3986, January 2005. [13] Manola, F. Miller, E. RDF Primer. W3 Recommendation 10. Feb 2004. [14] Laurence, S. Margolis, E. Concepts and Cognitive Sience in Margolis E. and Laurence, S. (eds.) Concepts: Core Readings, Cambridge, Mass: MIT Press, 1999. [15] Manning, C., Raghavan, P., and Schtze, H. 2008 Introduction to Information Retrieval. Cambridge University Press. [16] Weiss S., Indurkia N., Zhang T., Damerau F., Text Mining Predictive Methods for Analyzing Unstructured Information. Springer. 2005. [17] Dasgupta A., Drineas P., Harb B., Josifovski V., Mohoney M. Feature Selection Methods for Text Classification. Proceedings of KDD07. 2007. [18] Jain, L. at al. Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2008, 9 12 December 2008, University of Technology, Sydney, Australia. [19] Harper, S. and Patel, N. 2005. Gist summaries for visually impaired surfers. In Proceedings of the 7th international ACM SIGACCESS Conference on Computers and Accessibility. Assets '05. ACM, New York, NY, 90 97. 8. Acknowledgements The authors would like to thanks the financial support of CNPq, CAPES, FAPERJ and Fundação Coppetec. 150